docs(refactor): added proxy-rotation usage and refactor readthedocs

This commit is contained in:
Marco Perini 2024-05-13 11:03:33 +02:00
parent 5d6d996e8f
commit e256b758b2
16 changed files with 398 additions and 230 deletions

BIN
docs/assets/searchgraph.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

BIN
docs/assets/speechgraph.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

View File

@ -30,4 +30,3 @@ exclude_patterns = []
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = 'sphinx_rtd_theme' html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']

View File

@ -1,7 +1,9 @@
Examples Examples
======== ========
Here some example of the different ways to scrape with ScrapegraphAI Let's suppose you want to scrape a website to get a list of projects with their descriptions.
You can use the `SmartScraperGraph` class to do that.
The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models.
OpenAI models OpenAI models
^^^^^^^^^^^^^ ^^^^^^^^^^^^^
@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br
# ************************************************ # ************************************************
smart_scraper_graph = SmartScraperGraph( smart_scraper_graph = SmartScraperGraph(
prompt="List me all the news with their description.", prompt="List me all the projects with their description.",
# also accepts a string with the already downloaded HTML code # also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects", source="https://perinim.github.io/projects",
config=graph_config config=graph_config
@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br
result = smart_scraper_graph.run() result = smart_scraper_graph.run()
print(result) print(result)
To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!

View File

@ -7,26 +7,35 @@ for this project.
Prerequisites Prerequisites
^^^^^^^^^^^^^ ^^^^^^^^^^^^^
- `Python 3.8+ <https://www.python.org/downloads/>`_ - `Python >=3.9,<3.12 <https://www.python.org/downloads/>`_
- `pip <https://pip.pypa.io/en/stable/getting-started/>` - `pip <https://pip.pypa.io/en/stable/getting-started/>`_
- `ollama <https://ollama.com/>` *optional for local models - `Ollama <https://ollama.com/>`_ (optional for local models)
Install the library Install the library
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
The library is available on PyPI, so it can be installed using the following command:
.. code-block:: bash .. code-block:: bash
pip install scrapegraphai pip install scrapegraphai
**Note:** It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
.. code-block:: bash
poetry install
Additionally on Windows when using WSL Additionally on Windows when using WSL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you are using Windows Subsystem for Linux (WSL) and you are facing issues with the installation of the library, you might need to install the following packages:
.. code-block:: bash .. code-block:: bash
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2 sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
As simple as that! You are now ready to scrape gnamgnamgnam 👿👿👿

View File

@ -3,12 +3,6 @@
You can adapt this file completely to your liking, but it should at least You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive. contain the root `toctree` directive.
Welcome to scrapegraphai-ai's documentation!
=======================================
Here you will find all the information you need to get started.
The following sections will guide you through the installation process and the usage of the library.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
:caption: Introduction :caption: Introduction
@ -22,6 +16,19 @@ The following sections will guide you through the installation process and the u
getting_started/installation getting_started/installation
getting_started/examples getting_started/examples
.. toctree::
:maxdepth: 2
:caption: Scrapers
scrapers/graphs
scrapers/llm
scrapers/graph_config
.. toctree::
:maxdepth: 2
:caption: Modules
modules/modules modules/modules
Indices and tables Indices and tables

View File

@ -2,7 +2,7 @@ Contributing
============ ============
Hey, you want to contribute? Awesome! Hey, you want to contribute? Awesome!
Just fork the repo, make your changes, and send me a pull request. Just fork the repo, make your changes, and send a pull request.
If you're not sure if it's a good idea, open an issue and we'll discuss it. If you're not sure if it's a good idea, open an issue and we'll discuss it.
Go and check out the `contributing guidelines <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md>`__ for more information. Go and check out the `contributing guidelines <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md>`__ for more information.

View File

@ -1,20 +1,25 @@
.. image:: ../../assets/scrapegraphai_logo.png
:align: center
:width: 50%
:alt: ScrapegraphAI
Overview Overview
======== ========
In a world where web pages are constantly changing and in a data-hungry world there is a need for a new generation of scrapers, and this is where ScrapegraphAI was born. ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
An opensource library with the aim of starting a new era of scraping tools that are more flexible and require less maintenance by developers, with the use of LLMs. In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
HTML, JSON, and more.
.. image:: ../../assets/scrapegraphai_logo.png Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
:align: center providing a more flexible and low-maintenance solution compared to traditional scraping tools.
:width: 100px
:alt: ScrapegraphAI
Why ScrapegraphAI? Why ScrapegraphAI?
================== ==================
ScrapegraphAI in our vision represents a significant step forward in the field of web scraping, offering an open-source solution designed to meet the needs of a constantly evolving web landscape. Here's why ScrapegraphAI stands out: Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages.
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
Flexibility and Adaptability
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
This flexibility ensures that scrapers remain functional even when website layouts change. This flexibility ensures that scrapers remain functional even when website layouts change.
We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
as well as local models which can run on your machine using Ollama.

View File

@ -1,6 +1,3 @@
scrapegraphai
=============
.. toctree:: .. toctree::
:maxdepth: 4 :maxdepth: 4

View File

@ -1,29 +0,0 @@
scrapegraphai.graphs package
=====================
Submodules
----------
scrapegraphai.graphs.base\_graph module
--------------------------------
.. automodule:: scrapegraphai.graphs.base_graph
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.graphs.smart\_scraper\_graph module
------------------------------------------
.. automodule:: scrapegraphai.graphs.smart_scraper_graph
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: scrapegraphai.graphs
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,61 +0,0 @@
scrapegraphai.nodes package
====================
Submodules
----------
scrapegraphai.nodes.base\_node module
------------------------------
.. automodule:: scrapegraphai.nodes.base_node
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.nodes.conditional\_node module
-------------------------------------
.. automodule:: scrapegraphai.nodes.conditional_node
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.nodes.fetch\_html\_node module
-------------------------------------
.. automodule:: scrapegraphai.nodes.fetch_html_node
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.nodes.generate\_answer\_node module
------------------------------------------
.. automodule:: scrapegraphai.nodes.generate_answer_node
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.nodes.get\_probable\_tags\_node module
---------------------------------------------
.. automodule:: scrapegraphai.nodes.get_probable_tags_node
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.nodes.parse\_html\_node module
-------------------------------------
.. automodule:: scrapegraphai.nodes.parse_html_node
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: scrapegraphai.nodes
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,110 +0,0 @@
scrapegraphai package
==============
Subpackages
-----------
.. toctree::
:maxdepth: 4
scrapegraphai.graphs
scrapegraphai.nodes
Submodules
----------
scrapegraphai.class\_creator module
----------------------------
.. automodule:: scrapegraphai.class_creator
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.class\_generator module
------------------------------
.. automodule:: scrapegraphai.class_generator
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.convert\_to\_csv module
------------------------------
.. automodule:: scrapegraphai.convert_to_csv
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.convert\_to\_json module
-------------------------------
.. automodule:: scrapegraphai.convert_to_json
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.dictionaries module
--------------------------
.. automodule:: scrapegraphai.dictionaries
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.getter module
--------------------
.. automodule:: scrapegraphai.getter
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.json\_getter module
--------------------------
.. automodule:: scrapegraphai.json_getter
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.pydantic\_class module
-----------------------------
.. automodule:: scrapegraphai.pydantic_class
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.remover module
---------------------
.. automodule:: scrapegraphai.remover
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.request module
---------------------
.. automodule:: scrapegraphai.request
:members:
:undoc-members:
:show-inheritance:
scrapegraphai.token\_calculator module
-------------------------------
.. automodule:: scrapegraphai.token_calculator
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: scrapegraphai
:members:
:undoc-members:
:show-inheritance:

View File

@ -0,0 +1,49 @@
Additional Parameters
=====================
It is possible to customize the behavior of the graphs by setting some configuration options.
Some interesting ones are:
- `verbose`: If set to `True`, some debug information will be printed to the console.
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
Proxy Rotation
^^^^^^^^^^^^^^
It is possible to rotate the proxy by setting the `proxy` option in the graph configuration.
We provide a free proxy service which is based on `free-proxy <https://pypi.org/project/free-proxy/>`_ library and can be used as follows:
.. code-block:: python
graph_config = {
"llm":{...},
"loader_kwargs": {
"proxy" : {
"server": "broker",
"criteria": {
"anonymous": True,
"secure": True,
"countryset": {"IT"},
"timeout": 10.0,
"max_shape": 3
},
},
},
}
Do you have a proxy server? You can use it as follows:
.. code-block:: python
graph_config = {
"llm":{...},
"loader_kwargs": {
"proxy" : {
"server": "http://your_proxy_server:port",
"username": "your_username",
"password": "your_password",
},
},
}

View File

@ -0,0 +1,109 @@
Graphs
======
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
There are currently three types of graphs available in the library:
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
**Note:** they all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the `LLM`_ and `Configuration`_ sections.
SmartScraperGraph
^^^^^^^^^^^^^^^^^
.. image:: ../../assets/smartscrapergraph.png
:align: center
:width: 90%
:alt: SmartScraperGraph
|
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the SmartScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
It will fetch the data from the source and extract the information based on the prompt in JSON format.
.. code-block:: python
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {...},
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their descriptions",
source="https://perinim.github.io/projects",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
SearchGraph
^^^^^^^^^^^
.. image:: ../../assets/searchgraph.png
:align: center
:width: 80%
:alt: SearchGraph
|
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SearchGraph class, and run the graph.
It will create a search query, fetch the first n results from the search engine, run n SmartScraperGraph instances, and return the results in JSON format.
.. code-block:: python
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {...},
"embeddings": {...},
}
# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)
# Run the graph
result = search_graph.run()
print(result)
SpeechGraph
^^^^^^^^^^^
.. image:: ../../assets/speechgraph.png
:align: center
:width: 90%
:alt: SpeechGraph
|
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SpeechGraph class, and run the graph.
It will fetch the data from the source, extract the information based on the prompt, and generate an audio file with the answer, as well as the answer itself, in JSON format.
.. code-block:: python
from scrapegraphai.graphs import SpeechGraph
graph_config = {
"llm": {...},
"tts_model": {...},
}
# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************
speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)
result = speech_graph.run()
print(result)

View File

@ -0,0 +1,190 @@
LLM
===
We support many known LLM models and providers used to analyze the web pages and extract the information requested by the user. Models can be split in **Chat Models** and **Embedding Models** (the latter are mainly used for Retrieval Augmented Generation RAG).
These models are specified inside the graph configuration dictionary and can be used interchangeably, for example by defining a different model for llm and embeddings.
- **Local Models**: These models are hosted on the local machine and can be used without any API key.
- **API-based Models**: These models are hosted on the cloud and require an API key to access them (eg. OpenAI, Groq, etc).
**Note**: If the emebedding model is not specified, the library will use the default one for that LLM, if available.
Local Models
------------
Currently, local models are supported through Ollama integration. Ollama is a provider of LLM models which can be downloaded from here `Ollama <https://ollama.com/>`_.
Let's say we want to use **llama3** as chat model and **nomic-embed-text** as embedding model. We first need to pull them from ollama using:
.. code-block:: bash
ollama pull llama3
ollama pull nomic-embed-text
Then we can use them in the graph configuration as follows:
.. code-block:: python
graph_config = {
"llm": {
"model": "llama3",
"temperature": 0.0,
"format": "json",
},
"embeddings": {
"model": "nomic-embed-text",
},
}
You can also specify the **base_url** parameter to specify the models endpoint. By default, it is set to http://localhost:11434. This is useful if you are running Ollama on a Docker container or on a different machine.
If you want to host Ollama in a Docker container, you can use the following command:
.. code-block:: bash
docker-compose up -d
docker exec -it ollama ollama pull llama3
API-based Models
----------------
OpenAI
^^^^^^
You can get the API key from `here <https://platform.openai.com/api-keys>`_.
.. code-block:: python
graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
}
If you want to use text to speech models, you can specify the `tts_model` parameter:
.. code-block:: python
graph_config = {
"llm": {
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
"temperature": 0.7,
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
}
Gemini
^^^^^^
You can get the API key from `here <https://ai.google.dev/gemini-api/docs/api-key>`_.
**Note**: some countries are not supported and therefore it won't be possible to request an API key. A possible workaround is to use a VPN or run the library on Colab.
.. code-block:: python
graph_config = {
"llm": {
"api_key": "GEMINI_API_KEY",
"model": "gemini-pro"
},
}
Groq
^^^^
You can get the API key from `here <https://console.groq.com/keys>`_. Groq doesn't support embedding models, so in the following example we are using Ollama one.
.. code-block:: python
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
},
}
Azure
^^^^^
We can also pass a model instance for the chat model and the embedding model. For Azure, a possible configuration would be:
.. code-block:: python
llm_model_instance = AzureChatOpenAI(
openai_api_version="AZURE_OPENAI_API_VERSION",
azure_deployment="AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"
)
embedder_model_instance = AzureOpenAIEmbeddings(
azure_deployment="AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME",
openai_api_version="AZURE_OPENAI_API_VERSION",
)
graph_config = {
"llm": {
"model_instance": llm_model_instance
},
"embeddings": {
"model_instance": embedder_model_instance
}
}
Hugging Face Hub
^^^^^^^^^^^^^^^^
We can also pass a model instance for the chat model and the embedding model. For Hugging Face, a possible configuration would be:
.. code-block:: python
llm_model_instance = HuggingFaceEndpoint(
repo_id="mistralai/Mistral-7B-Instruct-v0.2",
max_length=128,
temperature=0.5,
token="HUGGINGFACEHUB_API_TOKEN"
)
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
api_key="HUGGINGFACEHUB_API_TOKEN",
model_name="sentence-transformers/all-MiniLM-l6-v2"
)
graph_config = {
"llm": {
"model_instance": llm_model_instance
},
"embeddings": {
"model_instance": embedder_model_instance
}
}
Anthropic
^^^^^^^^^
We can also pass a model instance for the chat model and the embedding model. For Anthropic, a possible configuration would be:
.. code-block:: python
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
api_key="HUGGINGFACEHUB_API_TOKEN",
model_name="sentence-transformers/all-MiniLM-l6-v2"
)
graph_config = {
"llm": {
"api_key": "ANTHROPIC_API_KEY",
"model": "claude-3-haiku-20240307",
"max_tokens": 4000
},
"embeddings": {
"model_instance": embedder_model_instance
}
}