mirror of https://github.com/VinciGit00/Scrapegraph-ai.git synced 2026-06-12 21:01:54 +08:00

History

Marco Perini ecb7601be7 Some checks are pending CodeQL / Analyze (python) (push) Waiting to run Details / build (3.10) (push) Waiting to run Details Release / Build (push) Waiting to run Details Release / Release (push) Blocked by required conditions Details docs(version): fixed compatible versions		2024-06-18 23:31:49 +02:00
..
assets	docs(scriptcreator): enhance documentation	2024-06-12 01:16:50 +02:00
source	docs(version): fixed compatible versions	2024-06-18 23:31:49 +02:00
chinese.md	beautofy readmes	2024-06-08 11:20:06 +02:00
japanese.md	beautofy readmes	2024-06-08 11:20:06 +02:00
korean.md	add new readmes	2024-06-13 11:39:57 +02:00
make.bat	add: readthedocs structure	2024-01-31 16:46:05 +01:00
Makefile	add: readthedocs structure	2024-01-31 16:46:05 +01:00
README.md	docs(roadmap): open contributions	2024-05-02 19:37:41 +02:00
russian.md	docs: fixed readme по русский	2024-06-18 22:19:16 +02:00

README.md

title

markmap

ScrapGraphAI Roadmap

colorFreezeLevel	maxWidth
2	500

ScrapGraphAI Roadmap

Short-Term Goals

Integration with more llm APIs
Test proxy rotation implementation
Add more search engines inside the SearchInternetNode
Improve the documentation (ReadTheDocs)
- Issue #102
Create tutorials for the library

Medium-Term Goals

Node for handling API requests
Improve SearchGraph to look into the first 5 results of the search engine
Make scraping more deterministic
- Create DOM tree of the website
- HTML tag text embeddings with tags metadata
- Study tree forks from root node
- How do we use the tags parameters?
Create scraping folder with report
- Folder contains .scrape files, DOM tree files, report
- Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
- We can use pyecharts with R-markdown
Scrape multiple pages of the same website
- Create new node that instantiate multiple graphs at the same time
- Make graphs run in parallel
- Scrape only relevant URLs from user prompt
- Use the multi dimensional DOM tree of the website for retrieval
- Issue #112
Crawler graph
- Scrape all the URLs with the same domain in all the pages
- Build many DOM trees and link them together
- Save the multi dimensional tree in a file
Compare two DOM trees to assess the similarity
- Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
- Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)
Nodes for handling authentication
- Use Selenium or Playwright to handle authentication
- Passes the cookies to the other nodes
Nodes that attaches to an open browser
- Use Selenium or Playwright to attach to an open browser
- Navigate inside the browser and scrape the content
Nodes for taking screenshots and understanding the page layout
- Use Selenium or Playwright to take screenshots
- Use LLM to asses if it is a block-like page, paragraph-like page, etc.
- Issue #88

Long-Term Goals

Automatic generation of scraping pipelines from a given prompt
Create API for the library
Finetune a LLM for html content