Scrapegraph-ai/docs
Marco Perini ecb7601be7
Some checks are pending
CodeQL / Analyze (python) (push) Waiting to run
/ build (3.10) (push) Waiting to run
Release / Build (push) Waiting to run
Release / Release (push) Blocked by required conditions
docs(version): fixed compatible versions
2024-06-18 23:31:49 +02:00
..
assets docs(scriptcreator): enhance documentation 2024-06-12 01:16:50 +02:00
source docs(version): fixed compatible versions 2024-06-18 23:31:49 +02:00
chinese.md beautofy readmes 2024-06-08 11:20:06 +02:00
japanese.md beautofy readmes 2024-06-08 11:20:06 +02:00
korean.md add new readmes 2024-06-13 11:39:57 +02:00
make.bat add: readthedocs structure 2024-01-31 16:46:05 +01:00
Makefile add: readthedocs structure 2024-01-31 16:46:05 +01:00
README.md docs(roadmap): open contributions 2024-05-02 19:37:41 +02:00
russian.md docs: fixed readme по русский 2024-06-18 22:19:16 +02:00

title markmap
ScrapGraphAI Roadmap
colorFreezeLevel maxWidth
2 500

ScrapGraphAI Roadmap

Short-Term Goals

  • Integration with more llm APIs

  • Test proxy rotation implementation

  • Add more search engines inside the SearchInternetNode

  • Improve the documentation (ReadTheDocs)

  • Create tutorials for the library

Medium-Term Goals

  • Node for handling API requests

  • Improve SearchGraph to look into the first 5 results of the search engine

  • Make scraping more deterministic

    • Create DOM tree of the website
    • HTML tag text embeddings with tags metadata
    • Study tree forks from root node
    • How do we use the tags parameters?
  • Create scraping folder with report

    • Folder contains .scrape files, DOM tree files, report
    • Report could be a HTML page with scraping speed, costs, LLM info, scraped content and DOM tree visualization
    • We can use pyecharts with R-markdown
  • Scrape multiple pages of the same website

    • Create new node that instantiate multiple graphs at the same time
    • Make graphs run in parallel
    • Scrape only relevant URLs from user prompt
    • Use the multi dimensional DOM tree of the website for retrieval
    • Issue #112
  • Crawler graph

    • Scrape all the URLs with the same domain in all the pages
    • Build many DOM trees and link them together
    • Save the multi dimensional tree in a file
  • Compare two DOM trees to assess the similarity

    • Save the DOM tree of the scraped website in a file as a sort of cache to be used to compare with future website structure
    • Create similarity metrics with multiple DOM trees (overall tree? only relevant tags structure?)
  • Nodes for handling authentication

    • Use Selenium or Playwright to handle authentication
    • Passes the cookies to the other nodes
  • Nodes that attaches to an open browser

    • Use Selenium or Playwright to attach to an open browser
    • Navigate inside the browser and scrape the content
  • Nodes for taking screenshots and understanding the page layout

    • Use Selenium or Playwright to take screenshots
    • Use LLM to asses if it is a block-like page, paragraph-like page, etc.
    • Issue #88

Long-Term Goals

  • Automatic generation of scraping pipelines from a given prompt

  • Create API for the library

  • Finetune a LLM for html content