docs(faq): added faq section and refined installation

2026-06-25 21:11:11 +08:00 · 2024-05-25 17:03:02 +02:00 · 2024-05-25 17:03:02 +02:00 · 545374c17e
commit 545374c17e
parent 15b7682967
8 changed files with 94 additions and 84 deletions
--- a/.python-version
+++ b/.python-version
@ -1,2 +0,0 @@
-3.10.14
-
--- a/README.md
+++ b/README.md
@ -168,7 +168,7 @@ Feel free to contribute and join our Discord server to discuss with us improveme

 Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).

-[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
+[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
 [![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
 [![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)

@ -179,13 +179,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h

 ## ❤️ Contributors
 [![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
+
 ## Sponsors
 <div style="text-align: center;">
  <a href="https://serpapi.com?utm_source=scrapegraphai">
    <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
  </a>
  <a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
-    <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
+    <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
  </a>
 </div>

--- a/docs/source/getting_started/installation.rst
+++ b/docs/source/getting_started/installation.rst
@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com
   
   It is higly recommended to install the library in a virtual environment (conda, venv, etc.)

-If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
+If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
+To install the library using rye, you can run the following command:

 .. code-block:: bash

-   poetry install
+   rye pin 3.10
+   rye sync
+   rye build
+
+.. caution::
+   
+      **Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.

 Additionally on Windows when using WSL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -32,6 +32,15 @@

   modules/modules

+.. toctree::
+   :hidden:
+   :caption: EXTERNAL RESOURCES
+
+   GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
+   Discord <https://discord.gg/uJN7TYcpNa>
+   Linkedin <https://www.linkedin.com/company/scrapegraphai/>
+   Twitter <https://twitter.com/scrapegraphai>
+
 Indices and tables
 ==================

--- a/docs/source/introduction/overview.rst
+++ b/docs/source/introduction/overview.rst
@ -6,13 +6,11 @@
 Overview 
 ========

-ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
-In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
-direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
-HTML, JSON, and more.
+ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
+In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs) 
+and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).

-Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
-providing a more flexible and low-maintenance solution compared to traditional scraping tools.
+Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.

 Why ScrapegraphAI?
 ==================
@ -21,17 +19,75 @@ Traditional web scraping tools often rely on fixed patterns or manual configurat
 ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. 
 This flexibility ensures that scrapers remain functional even when website layouts change.

-We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
-as well as local models which can run on your machine using Ollama.
+We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
+as well as local models which can run on your machine using **Ollama**.

 Library Diagram
 ===============

-With ScrapegraphAI you first construct a pipeline of steps you want to execute by combining nodes into a graph.
-Executing the graph takes care of all the steps that are often part of scraping: fetching, parsing etc...
-Finally the scraped and processed data gets fed to an LLM which generates a response.
+With ScrapegraphAI you can use many already implemented scraping pipelines or create your own.
+
+The diagram below illustrates the high-level architecture of ScrapeGraphAI:

 .. image:: ../../assets/project_overview_diagram.png
   :align: center
   :width: 70%
   :alt: ScrapegraphAI Overview
+
+FAQ
+===
+
+1. **What is ScrapeGraphAI?**
+
+   ScrapeGraphAI is an open-source python library that uses large language models (LLMs) and graph logic to automate the creation of scraping pipelines for websites and various document types.
+
+2. **How does ScrapeGraphAI differ from traditional scraping tools?**
+
+   Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.
+
+3. **Which LLMs are supported by ScrapeGraphAI?**
+
+   ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
+
+4. **Can ScrapeGraphAI handle different document formats?**
+
+   Yes, ScrapeGraphAI can scrape information from various document formats such as XML, HTML, JSON, and more.
+
+5. **I get an empty or incorrect output when scraping a website. What should I do?**
+
+   There are several reasons behind this issue, but for most cases, you can try the following:
+
+      - Set the `headless` parameter to `False` in the graph_config. Some javascript-heavy websites might require it.
+
+      - Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.
+
+      - Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.
+      
+      - Use a different LLM model. Some models might perform better on certain websites than others.
+
+      - Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.
+
+      - Visualize the pipeline graphically using :ref:`Burr`.
+   
+   If the issue persists, please report it on the GitHub repository.
+
+6. **How does ScrapeGraphAI handle the context window limit of LLMs?**
+
+   By splitting big websites/documents into chunks with overlaps and applying compression techniques to reduce the number of tokens. If multiple chunks are present, we will have multiple answers to the user prompt, and therefore, we merge them together in the last step of the scraping pipeline.
+
+7. **How can I contribute to ScrapeGraphAI?**
+
+   You can contribute to ScrapeGraphAI by submitting bug reports, feature requests, or pull requests on the GitHub repository. Join our `Discord <https://discord.gg/uJN7TYcpNa>`_ community and follow us on social media!
+
+Sponsors
+========
+
+.. image:: ../../assets/serp_api_logo.png
+   :width: 10%
+   :alt: Serp API
+   :target: https://serpapi.com?utm_source=scrapegraphai
+
+.. image:: ../../assets/transparent_stat.png
+   :width: 15%
+   :alt: Stat Proxies
+   :target: https://dashboard.statproxies.com/?refferal=scrapegraph
--- a/docs/source/scrapers/graph_config.rst
+++ b/docs/source/scrapers/graph_config.rst
@ -14,6 +14,8 @@ Some interesting ones are:
 - `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
 - `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.

+.. _Burr:
+
 Burr Integration
 ^^^^^^^^^^^^^^^^

@ -43,6 +45,8 @@ To log your graph execution in the platform, you need to set the `burr_kwargs` p
        }
    }

+.. _Proxy:
+
 Proxy Rotation
 ^^^^^^^^^^^^^^

--- a/requirements-dev.lock
+++ b/requirements-dev.lock
@ -81,8 +81,6 @@ cycler==0.12.1
 dataclasses-json==0.6.6
    # via langchain
    # via langchain-community
-decorator==5.1.1
-    # via ipython
 defusedxml==0.7.1
    # via langchain-anthropic
 distro==1.9.0
@ -97,10 +95,7 @@ email-validator==2.1.1
    # via fastapi
 exceptiongroup==1.2.1
    # via anyio
-    # via ipython
    # via pytest
-executing==2.0.1
-    # via stack-data
 faiss-cpu==1.8.0
    # via scrapegraphai
 fastapi==0.111.0
@ -119,7 +114,6 @@ free-proxy==1.1.1
 frozenlist==1.4.1
    # via aiohttp
    # via aiosignal
-fsspec==2024.5.0
 fsspec==2024.5.0
    # via huggingface-hub
 furo==2024.5.6
@ -208,8 +202,6 @@ jmespath==1.0.1
 jsonpatch==1.33
    # via langchain
    # via langchain-core
-jsonpickle==3.0.4
-    # via pyvis
 jsonpointer==2.4
    # via jsonpatch
 jsonschema==4.22.0
@ -268,9 +260,6 @@ multidict==6.0.5
    # via yarl
 mypy-extensions==1.0.0
    # via typing-inspect
-networkx==3.3
-    # via pyvis
-    # via scrapegraphai
 numpy==1.26.4
    # via altair
    # via contourpy
@ -312,8 +301,6 @@ playwright==1.43.0
    # via undetected-playwright
 pluggy==1.5.0
    # via pytest
-prompt-toolkit==3.0.43
-    # via ipython
 proto-plus==1.23.0
    # via google-ai-generativelanguage
    # via google-api-core
@ -354,8 +341,6 @@ pygments==2.18.0
    # via furo
    # via rich
    # via sphinx
-pygments==2.18.0
-    # via ipython
 pyparsing==3.1.2
    # via httplib2
    # via matplotlib
@ -373,8 +358,6 @@ python-multipart==0.0.9
    # via fastapi
 pytz==2024.1
    # via pandas
-pyvis==0.3.2
-    # via scrapegraphai
 pyyaml==6.0.1
    # via huggingface-hub
    # via langchain
@ -414,7 +397,6 @@ sf-hamilton==1.63.0
 shellingham==1.5.4
    # via typer
 six==1.16.0
-    # via asttokens
    # via python-dateutil
 smmap==5.0.1
    # via gitdb
@ -453,8 +435,6 @@ starlette==0.37.2
    # via fastapi
 streamlit==1.34.0
    # via burr
-stack-data==0.6.3
-    # via ipython
 tenacity==8.3.0
    # via langchain
    # via langchain-community
@ -480,9 +460,6 @@ tqdm==4.66.4
    # via scrapegraphai
 typer==0.12.3
    # via fastapi-cli
-traitlets==5.14.3
-    # via ipython
-    # via matplotlib-inline
 typing-extensions==4.11.0
    # via altair
    # via anthropic
@ -492,7 +469,6 @@ typing-extensions==4.11.0
    # via google-generativeai
    # via groq
    # via huggingface-hub
-    # via ipython
    # via openai
    # via pydantic
    # via pydantic-core
@ -508,10 +484,10 @@ typing-inspect==0.9.0
    # via sf-hamilton
 tzdata==2024.1
    # via pandas
-undetected-playwright==0.3.0
-    # via scrapegraphai
 ujson==5.10.0
    # via fastapi
+undetected-playwright==0.3.0
+    # via scrapegraphai
 uritemplate==4.1.1
    # via google-api-python-client
 urllib3==2.2.1
--- a/requirements.lock
+++ b/requirements.lock
@ -22,8 +22,6 @@ anyio==4.3.0
    # via groq
    # via httpx
    # via openai
-asttokens==2.4.1
-    # via stack-data
 async-timeout==4.0.3
    # via aiohttp
    # via langchain
@ -50,8 +48,6 @@ colorama==0.4.6
 dataclasses-json==0.6.6
    # via langchain
    # via langchain-community
-decorator==5.1.1
-    # via ipython
 defusedxml==0.7.1
    # via langchain-anthropic
 distro==1.9.0
@ -60,9 +56,6 @@ distro==1.9.0
    # via openai
 exceptiongroup==1.2.1
    # via anyio
-    # via ipython
-executing==2.0.1
-    # via stack-data
 faiss-cpu==1.8.0
    # via scrapegraphai
 filelock==3.14.0
@ -72,7 +65,6 @@ free-proxy==1.1.1
 frozenlist==1.4.1
    # via aiohttp
    # via aiosignal
-fsspec==2024.5.0
 fsspec==2024.5.0
    # via huggingface-hub
 google==3.0.0
@ -139,8 +131,6 @@ jmespath==1.0.1
 jsonpatch==1.33
    # via langchain
    # via langchain-core
-jsonpickle==3.0.4
-    # via pyvis
 jsonpointer==2.4
    # via jsonpatch
 langchain==0.1.15
@ -174,12 +164,8 @@ langsmith==0.1.60
    # via langchain-core
 lxml==5.2.2
    # via free-proxy
-markupsafe==2.1.5
-    # via jinja2
 marshmallow==3.21.2
    # via dataclasses-json
-matplotlib-inline==0.1.7
-    # via ipython
 minify-html==0.15.0
    # via scrapegraphai
 multidict==6.0.5
@ -187,9 +173,6 @@ multidict==6.0.5
    # via yarl
 mypy-extensions==1.0.0
    # via typing-inspect
-networkx==3.3
-    # via pyvis
-    # via scrapegraphai
 numpy==1.26.4
    # via faiss-cpu
    # via langchain
@ -206,15 +189,9 @@ packaging==23.2
    # via marshmallow
 pandas==2.2.2
    # via scrapegraphai
-parso==0.8.4
-    # via jedi
-pexpect==4.9.0
-    # via ipython
 playwright==1.43.0
    # via scrapegraphai
    # via undetected-playwright
-prompt-toolkit==3.0.43
-    # via ipython
 proto-plus==1.23.0
    # via google-ai-generativelanguage
    # via google-api-core
@ -225,10 +202,6 @@ protobuf==4.25.3
    # via googleapis-common-protos
    # via grpcio-status
    # via proto-plus
-ptyprocess==0.7.0
-    # via pexpect
-pure-eval==0.2.2
-    # via stack-data
 pyasn1==0.6.0
    # via pyasn1-modules
    # via rsa
@ -247,8 +220,6 @@ pydantic-core==2.18.2
    # via pydantic
 pyee==11.1.0
    # via playwright
-pygments==2.18.0
-    # via ipython
 pyparsing==3.1.2
    # via httplib2
 python-dateutil==2.9.0.post0
@ -258,8 +229,6 @@ python-dotenv==1.0.1
    # via scrapegraphai
 pytz==2024.1
    # via pandas
-pyvis==0.3.2
-    # via scrapegraphai
 pyyaml==6.0.1
    # via huggingface-hub
    # via langchain
@ -282,7 +251,6 @@ s3transfer==0.10.1
 selectolax==0.3.21
    # via yahoo-search-py
 six==1.16.0
-    # via asttokens
    # via python-dateutil
 sniffio==1.3.1
    # via anthropic
@ -295,8 +263,6 @@ soupsieve==2.5
 sqlalchemy==2.0.30
    # via langchain
    # via langchain-community
-stack-data==0.6.3
-    # via ipython
 tenacity==8.3.0
    # via langchain
    # via langchain-community
@ -311,16 +277,12 @@ tqdm==4.66.4
    # via huggingface-hub
    # via openai
    # via scrapegraphai
-traitlets==5.14.3
-    # via ipython
-    # via matplotlib-inline
 typing-extensions==4.11.0
    # via anthropic
    # via anyio
    # via google-generativeai
    # via groq
    # via huggingface-hub
-    # via ipython
    # via openai
    # via pydantic
    # via pydantic-core
@ -335,13 +297,10 @@ undetected-playwright==0.3.0
    # via scrapegraphai
 uritemplate==4.1.1
    # via google-api-python-client
-urllib3==2.2.1
 urllib3==2.2.1
    # via botocore
    # via requests
    # via yahoo-search-py
-wcwidth==0.2.13
-    # via prompt-toolkit
 yahoo-search-py==0.3
    # via scrapegraphai
 yarl==1.9.4