docs(tree): added roadmap

2026-06-23 21:00:30 +08:00 · 2024-05-02 19:20:28 +02:00 · 2024-05-02 19:20:28 +02:00 · c8eeff873d
commit c8eeff873d
parent 15be111422
2 changed files with 122 additions and 0 deletions
--- a/README.md
+++ b/README.md
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@ -0,0 +1,75 @@
+---
+title: ScrapGraphAI Roadmap
+markmap:
+  colorFreezeLevel: 2
+  maxWidth: 500
+---
+
+# **ScrapGraphAI Roadmap**
+
+## **Short-Term Goals**
+
+- Integration with more llm APIs
+
+- Test proxy rotation implementation
+
+- Add more search engines inside the SearchInternetNode
+
+- Improve the documentation (ReadTheDocs)
+    - [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102)
+
+- Create tutorials for the library
+
+## **Medium-Term Goals**
+
+- Node for handling API requests
+
+- Improve SearchGraph to look into the first 5 results of the search engine
+
+- Make scraping more deterministic
+    - Create DOM tree of the website
+    - HTML tag text embeddings with tags metadata
+    - Study tree forks from root node
+    - How do we use the tags parameters?
+
+- Create scraping folder with report
+    - Folder contains .scrape files, DOM tree files, report
+    - Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
+    - We can use pyecharts with R-markdown
+
+- Scrape multiple pages of the same website
+    - Create new node that instantiate multiple graphs at the same time
+    - Make graphs run in parallel
+    - Scrape only relevant URLs from user prompt
+    - Use the multi dimensional DOM tree of the website for retrieval
+  - [Issue #112](https://github.com/VinciGit00/Scrapegraph-ai/issues/112)
+
+- Crawler graph
+    - Scrape all the URLs with the same domain in all the pages
+    - Build many DOM trees and link them together
+    - Save the multi dimensional tree in a file
+
+- Compare two DOM trees to assess the similarity
+    - Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
+    - Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)
+
+- Nodes for handling authentication
+    - Use Selenium or Playwright to handle authentication
+    - Passes the cookies to the other nodes
+
+- Nodes that attaches to an open browser
+    - Use Selenium or Playwright to attach to an open browser
+    - Navigate inside the browser and scrape the content
+
+- Nodes for taking screenshots and understanding the page layout
+    - Use Selenium or Playwright to take screenshots
+    - Use LLM to asses if it is a block-like page, paragraph-like page, etc.
+    - [Issue #88](https://github.com/VinciGit00/Scrapegraph-ai/issues/88)
+    
+## **Long-Term Goals**
+
+- Automatic generation of scraping pipelines from a given prompt
+
+- Create API for the library
+
+- Finetune a LLM for html content