docs(faq): added faq section and refined installation

This commit is contained in:
Marco Perini 2024-05-25 17:03:02 +02:00
parent 15b7682967
commit 545374c17e
8 changed files with 94 additions and 84 deletions

View File

@ -1,2 +0,0 @@
3.10.14

View File

@ -168,7 +168,7 @@ Feel free to contribute and join our Discord server to discuss with us improveme
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md). Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX) [![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/) [![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai) [![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
@ -179,13 +179,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
## ❤️ Contributors ## ❤️ Contributors
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors) [![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
## Sponsors ## Sponsors
<div style="text-align: center;"> <div style="text-align: center;">
<a href="https://serpapi.com?utm_source=scrapegraphai"> <a href="https://serpapi.com?utm_source=scrapegraphai">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"> <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
</a> </a>
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph"> <a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;"> <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
</a> </a>
</div> </div>

View File

@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com
It is higly recommended to install the library in a virtual environment (conda, venv, etc.) It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_: If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
To install the library using rye, you can run the following command:
.. code-block:: bash .. code-block:: bash
poetry install rye pin 3.10
rye sync
rye build
.. caution::
**Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.
Additionally on Windows when using WSL Additionally on Windows when using WSL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View File

@ -32,6 +32,15 @@
modules/modules modules/modules
.. toctree::
:hidden:
:caption: EXTERNAL RESOURCES
GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
Discord <https://discord.gg/uJN7TYcpNa>
Linkedin <https://www.linkedin.com/company/scrapegraphai/>
Twitter <https://twitter.com/scrapegraphai>
Indices and tables Indices and tables
================== ==================

View File

@ -6,13 +6,11 @@
Overview Overview
======== ========
ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools. ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML, and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).
HTML, JSON, and more.
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.
providing a more flexible and low-maintenance solution compared to traditional scraping tools.
Why ScrapegraphAI? Why ScrapegraphAI?
================== ==================
@ -21,17 +19,75 @@ Traditional web scraping tools often rely on fixed patterns or manual configurat
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
This flexibility ensures that scrapers remain functional even when website layouts change. This flexibility ensures that scrapers remain functional even when website layouts change.
We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc. We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
as well as local models which can run on your machine using Ollama. as well as local models which can run on your machine using **Ollama**.
Library Diagram Library Diagram
=============== ===============
With ScrapegraphAI you first construct a pipeline of steps you want to execute by combining nodes into a graph. With ScrapegraphAI you can use many already implemented scraping pipelines or create your own.
Executing the graph takes care of all the steps that are often part of scraping: fetching, parsing etc...
Finally the scraped and processed data gets fed to an LLM which generates a response. The diagram below illustrates the high-level architecture of ScrapeGraphAI:
.. image:: ../../assets/project_overview_diagram.png .. image:: ../../assets/project_overview_diagram.png
:align: center :align: center
:width: 70% :width: 70%
:alt: ScrapegraphAI Overview :alt: ScrapegraphAI Overview
FAQ
===
1. **What is ScrapeGraphAI?**
ScrapeGraphAI is an open-source python library that uses large language models (LLMs) and graph logic to automate the creation of scraping pipelines for websites and various document types.
2. **How does ScrapeGraphAI differ from traditional scraping tools?**
Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.
3. **Which LLMs are supported by ScrapeGraphAI?**
ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
4. **Can ScrapeGraphAI handle different document formats?**
Yes, ScrapeGraphAI can scrape information from various document formats such as XML, HTML, JSON, and more.
5. **I get an empty or incorrect output when scraping a website. What should I do?**
There are several reasons behind this issue, but for most cases, you can try the following:
- Set the `headless` parameter to `False` in the graph_config. Some javascript-heavy websites might require it.
- Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.
- Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.
- Use a different LLM model. Some models might perform better on certain websites than others.
- Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.
- Visualize the pipeline graphically using :ref:`Burr`.
If the issue persists, please report it on the GitHub repository.
6. **How does ScrapeGraphAI handle the context window limit of LLMs?**
By splitting big websites/documents into chunks with overlaps and applying compression techniques to reduce the number of tokens. If multiple chunks are present, we will have multiple answers to the user prompt, and therefore, we merge them together in the last step of the scraping pipeline.
7. **How can I contribute to ScrapeGraphAI?**
You can contribute to ScrapeGraphAI by submitting bug reports, feature requests, or pull requests on the GitHub repository. Join our `Discord <https://discord.gg/uJN7TYcpNa>`_ community and follow us on social media!
Sponsors
========
.. image:: ../../assets/serp_api_logo.png
:width: 10%
:alt: Serp API
:target: https://serpapi.com?utm_source=scrapegraphai
.. image:: ../../assets/transparent_stat.png
:width: 15%
:alt: Stat Proxies
:target: https://dashboard.statproxies.com/?refferal=scrapegraph

View File

@ -14,6 +14,8 @@ Some interesting ones are:
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface. - `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`. - `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
.. _Burr:
Burr Integration Burr Integration
^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^
@ -43,6 +45,8 @@ To log your graph execution in the platform, you need to set the `burr_kwargs` p
} }
} }
.. _Proxy:
Proxy Rotation Proxy Rotation
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^

View File

@ -81,8 +81,6 @@ cycler==0.12.1
dataclasses-json==0.6.6 dataclasses-json==0.6.6
# via langchain # via langchain
# via langchain-community # via langchain-community
decorator==5.1.1
# via ipython
defusedxml==0.7.1 defusedxml==0.7.1
# via langchain-anthropic # via langchain-anthropic
distro==1.9.0 distro==1.9.0
@ -97,10 +95,7 @@ email-validator==2.1.1
# via fastapi # via fastapi
exceptiongroup==1.2.1 exceptiongroup==1.2.1
# via anyio # via anyio
# via ipython
# via pytest # via pytest
executing==2.0.1
# via stack-data
faiss-cpu==1.8.0 faiss-cpu==1.8.0
# via scrapegraphai # via scrapegraphai
fastapi==0.111.0 fastapi==0.111.0
@ -119,7 +114,6 @@ free-proxy==1.1.1
frozenlist==1.4.1 frozenlist==1.4.1
# via aiohttp # via aiohttp
# via aiosignal # via aiosignal
fsspec==2024.5.0
fsspec==2024.5.0 fsspec==2024.5.0
# via huggingface-hub # via huggingface-hub
furo==2024.5.6 furo==2024.5.6
@ -208,8 +202,6 @@ jmespath==1.0.1
jsonpatch==1.33 jsonpatch==1.33
# via langchain # via langchain
# via langchain-core # via langchain-core
jsonpickle==3.0.4
# via pyvis
jsonpointer==2.4 jsonpointer==2.4
# via jsonpatch # via jsonpatch
jsonschema==4.22.0 jsonschema==4.22.0
@ -268,9 +260,6 @@ multidict==6.0.5
# via yarl # via yarl
mypy-extensions==1.0.0 mypy-extensions==1.0.0
# via typing-inspect # via typing-inspect
networkx==3.3
# via pyvis
# via scrapegraphai
numpy==1.26.4 numpy==1.26.4
# via altair # via altair
# via contourpy # via contourpy
@ -312,8 +301,6 @@ playwright==1.43.0
# via undetected-playwright # via undetected-playwright
pluggy==1.5.0 pluggy==1.5.0
# via pytest # via pytest
prompt-toolkit==3.0.43
# via ipython
proto-plus==1.23.0 proto-plus==1.23.0
# via google-ai-generativelanguage # via google-ai-generativelanguage
# via google-api-core # via google-api-core
@ -354,8 +341,6 @@ pygments==2.18.0
# via furo # via furo
# via rich # via rich
# via sphinx # via sphinx
pygments==2.18.0
# via ipython
pyparsing==3.1.2 pyparsing==3.1.2
# via httplib2 # via httplib2
# via matplotlib # via matplotlib
@ -373,8 +358,6 @@ python-multipart==0.0.9
# via fastapi # via fastapi
pytz==2024.1 pytz==2024.1
# via pandas # via pandas
pyvis==0.3.2
# via scrapegraphai
pyyaml==6.0.1 pyyaml==6.0.1
# via huggingface-hub # via huggingface-hub
# via langchain # via langchain
@ -414,7 +397,6 @@ sf-hamilton==1.63.0
shellingham==1.5.4 shellingham==1.5.4
# via typer # via typer
six==1.16.0 six==1.16.0
# via asttokens
# via python-dateutil # via python-dateutil
smmap==5.0.1 smmap==5.0.1
# via gitdb # via gitdb
@ -453,8 +435,6 @@ starlette==0.37.2
# via fastapi # via fastapi
streamlit==1.34.0 streamlit==1.34.0
# via burr # via burr
stack-data==0.6.3
# via ipython
tenacity==8.3.0 tenacity==8.3.0
# via langchain # via langchain
# via langchain-community # via langchain-community
@ -480,9 +460,6 @@ tqdm==4.66.4
# via scrapegraphai # via scrapegraphai
typer==0.12.3 typer==0.12.3
# via fastapi-cli # via fastapi-cli
traitlets==5.14.3
# via ipython
# via matplotlib-inline
typing-extensions==4.11.0 typing-extensions==4.11.0
# via altair # via altair
# via anthropic # via anthropic
@ -492,7 +469,6 @@ typing-extensions==4.11.0
# via google-generativeai # via google-generativeai
# via groq # via groq
# via huggingface-hub # via huggingface-hub
# via ipython
# via openai # via openai
# via pydantic # via pydantic
# via pydantic-core # via pydantic-core
@ -508,10 +484,10 @@ typing-inspect==0.9.0
# via sf-hamilton # via sf-hamilton
tzdata==2024.1 tzdata==2024.1
# via pandas # via pandas
undetected-playwright==0.3.0
# via scrapegraphai
ujson==5.10.0 ujson==5.10.0
# via fastapi # via fastapi
undetected-playwright==0.3.0
# via scrapegraphai
uritemplate==4.1.1 uritemplate==4.1.1
# via google-api-python-client # via google-api-python-client
urllib3==2.2.1 urllib3==2.2.1

View File

@ -22,8 +22,6 @@ anyio==4.3.0
# via groq # via groq
# via httpx # via httpx
# via openai # via openai
asttokens==2.4.1
# via stack-data
async-timeout==4.0.3 async-timeout==4.0.3
# via aiohttp # via aiohttp
# via langchain # via langchain
@ -50,8 +48,6 @@ colorama==0.4.6
dataclasses-json==0.6.6 dataclasses-json==0.6.6
# via langchain # via langchain
# via langchain-community # via langchain-community
decorator==5.1.1
# via ipython
defusedxml==0.7.1 defusedxml==0.7.1
# via langchain-anthropic # via langchain-anthropic
distro==1.9.0 distro==1.9.0
@ -60,9 +56,6 @@ distro==1.9.0
# via openai # via openai
exceptiongroup==1.2.1 exceptiongroup==1.2.1
# via anyio # via anyio
# via ipython
executing==2.0.1
# via stack-data
faiss-cpu==1.8.0 faiss-cpu==1.8.0
# via scrapegraphai # via scrapegraphai
filelock==3.14.0 filelock==3.14.0
@ -72,7 +65,6 @@ free-proxy==1.1.1
frozenlist==1.4.1 frozenlist==1.4.1
# via aiohttp # via aiohttp
# via aiosignal # via aiosignal
fsspec==2024.5.0
fsspec==2024.5.0 fsspec==2024.5.0
# via huggingface-hub # via huggingface-hub
google==3.0.0 google==3.0.0
@ -139,8 +131,6 @@ jmespath==1.0.1
jsonpatch==1.33 jsonpatch==1.33
# via langchain # via langchain
# via langchain-core # via langchain-core
jsonpickle==3.0.4
# via pyvis
jsonpointer==2.4 jsonpointer==2.4
# via jsonpatch # via jsonpatch
langchain==0.1.15 langchain==0.1.15
@ -174,12 +164,8 @@ langsmith==0.1.60
# via langchain-core # via langchain-core
lxml==5.2.2 lxml==5.2.2
# via free-proxy # via free-proxy
markupsafe==2.1.5
# via jinja2
marshmallow==3.21.2 marshmallow==3.21.2
# via dataclasses-json # via dataclasses-json
matplotlib-inline==0.1.7
# via ipython
minify-html==0.15.0 minify-html==0.15.0
# via scrapegraphai # via scrapegraphai
multidict==6.0.5 multidict==6.0.5
@ -187,9 +173,6 @@ multidict==6.0.5
# via yarl # via yarl
mypy-extensions==1.0.0 mypy-extensions==1.0.0
# via typing-inspect # via typing-inspect
networkx==3.3
# via pyvis
# via scrapegraphai
numpy==1.26.4 numpy==1.26.4
# via faiss-cpu # via faiss-cpu
# via langchain # via langchain
@ -206,15 +189,9 @@ packaging==23.2
# via marshmallow # via marshmallow
pandas==2.2.2 pandas==2.2.2
# via scrapegraphai # via scrapegraphai
parso==0.8.4
# via jedi
pexpect==4.9.0
# via ipython
playwright==1.43.0 playwright==1.43.0
# via scrapegraphai # via scrapegraphai
# via undetected-playwright # via undetected-playwright
prompt-toolkit==3.0.43
# via ipython
proto-plus==1.23.0 proto-plus==1.23.0
# via google-ai-generativelanguage # via google-ai-generativelanguage
# via google-api-core # via google-api-core
@ -225,10 +202,6 @@ protobuf==4.25.3
# via googleapis-common-protos # via googleapis-common-protos
# via grpcio-status # via grpcio-status
# via proto-plus # via proto-plus
ptyprocess==0.7.0
# via pexpect
pure-eval==0.2.2
# via stack-data
pyasn1==0.6.0 pyasn1==0.6.0
# via pyasn1-modules # via pyasn1-modules
# via rsa # via rsa
@ -247,8 +220,6 @@ pydantic-core==2.18.2
# via pydantic # via pydantic
pyee==11.1.0 pyee==11.1.0
# via playwright # via playwright
pygments==2.18.0
# via ipython
pyparsing==3.1.2 pyparsing==3.1.2
# via httplib2 # via httplib2
python-dateutil==2.9.0.post0 python-dateutil==2.9.0.post0
@ -258,8 +229,6 @@ python-dotenv==1.0.1
# via scrapegraphai # via scrapegraphai
pytz==2024.1 pytz==2024.1
# via pandas # via pandas
pyvis==0.3.2
# via scrapegraphai
pyyaml==6.0.1 pyyaml==6.0.1
# via huggingface-hub # via huggingface-hub
# via langchain # via langchain
@ -282,7 +251,6 @@ s3transfer==0.10.1
selectolax==0.3.21 selectolax==0.3.21
# via yahoo-search-py # via yahoo-search-py
six==1.16.0 six==1.16.0
# via asttokens
# via python-dateutil # via python-dateutil
sniffio==1.3.1 sniffio==1.3.1
# via anthropic # via anthropic
@ -295,8 +263,6 @@ soupsieve==2.5
sqlalchemy==2.0.30 sqlalchemy==2.0.30
# via langchain # via langchain
# via langchain-community # via langchain-community
stack-data==0.6.3
# via ipython
tenacity==8.3.0 tenacity==8.3.0
# via langchain # via langchain
# via langchain-community # via langchain-community
@ -311,16 +277,12 @@ tqdm==4.66.4
# via huggingface-hub # via huggingface-hub
# via openai # via openai
# via scrapegraphai # via scrapegraphai
traitlets==5.14.3
# via ipython
# via matplotlib-inline
typing-extensions==4.11.0 typing-extensions==4.11.0
# via anthropic # via anthropic
# via anyio # via anyio
# via google-generativeai # via google-generativeai
# via groq # via groq
# via huggingface-hub # via huggingface-hub
# via ipython
# via openai # via openai
# via pydantic # via pydantic
# via pydantic-core # via pydantic-core
@ -335,13 +297,10 @@ undetected-playwright==0.3.0
# via scrapegraphai # via scrapegraphai
uritemplate==4.1.1 uritemplate==4.1.1
# via google-api-python-client # via google-api-python-client
urllib3==2.2.1
urllib3==2.2.1 urllib3==2.2.1
# via botocore # via botocore
# via requests # via requests
# via yahoo-search-py # via yahoo-search-py
wcwidth==0.2.13
# via prompt-toolkit
yahoo-search-py==0.3 yahoo-search-py==0.3
# via scrapegraphai # via scrapegraphai
yarl==1.9.4 yarl==1.9.4