docs(refactor): added proxy-rotation usage and refactor readthedocs

2026-06-25 21:11:11 +08:00 · 2024-05-13 11:03:33 +02:00 · 2024-05-13 11:03:33 +02:00 · e256b758b2
commit e256b758b2
parent 5d6d996e8f
16 changed files with 398 additions and 230 deletions
--- a/docs/assets/searchgraph.png
+++ b/docs/assets/searchgraph.png
--- a/docs/assets/smartscrapergraph.png
+++ b/docs/assets/smartscrapergraph.png
--- a/docs/assets/speechgraph.png
+++ b/docs/assets/speechgraph.png
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -30,4 +30,3 @@ exclude_patterns = []
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 html_theme = 'sphinx_rtd_theme'
 html_static_path = ['_static']
--- a/docs/source/getting_started/examples.rst
+++ b/docs/source/getting_started/examples.rst
@ -1,7 +1,9 @@
 Examples
 ========
-Here some example of the different ways to scrape with ScrapegraphAI
+Let's suppose you want to scrape a website to get a list of projects with their descriptions.
 You can use the `SmartScraperGraph` class to do that.
 The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models.
 OpenAI models
 ^^^^^^^^^^^^^
@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br
   # ************************************************
   smart_scraper_graph = SmartScraperGraph(
-      prompt="List me all the news with their description.",
+      prompt="List me all the projects with their description.",
      # also accepts a string with the already downloaded HTML code
      source="https://perinim.github.io/projects",
      config=graph_config
@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br
   result = smart_scraper_graph.run()
   print(result)
 To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!
--- a/docs/source/getting_started/installation.rst
+++ b/docs/source/getting_started/installation.rst
@ -7,26 +7,35 @@ for this project.
 Prerequisites
 ^^^^^^^^^^^^^
- `Python 3.8+ <https://www.python.org/downloads/>`_
+- `Python >=3.9,<3.12 <https://www.python.org/downloads/>`_
- `pip <https://pip.pypa.io/en/stable/getting-started/>`
+- `pip <https://pip.pypa.io/en/stable/getting-started/>`_
- `ollama <https://ollama.com/>` *optional for local models 
+- `Ollama <https://ollama.com/>`_ (optional for local models)
 Install the library
 ^^^^^^^^^^^^^^^^^^^^
 The library is available on PyPI, so it can be installed using the following command:
 .. code-block:: bash
   pip install scrapegraphai
 **Note:** It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
 If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
 .. code-block:: bash
   poetry install
 Additionally on Windows when using WSL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 If you are using Windows Subsystem for Linux (WSL) and you are facing issues with the installation of the library, you might need to install the following packages:
 .. code-block:: bash
   sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
 As simple as that! You are now ready to scrape gnamgnamgnam 👿👿👿
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -3,12 +3,6 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
 Welcome to scrapegraphai-ai's documentation!
 =======================================
 Here you will find all the information you need to get started.
 The following sections will guide you through the installation process and the usage of the library.
 .. toctree::
   :maxdepth: 2
   :caption: Introduction
@ -22,6 +16,19 @@ The following sections will guide you through the installation process and the u
   getting_started/installation
   getting_started/examples
 .. toctree::
   :maxdepth: 2
   :caption: Scrapers
   scrapers/graphs
   scrapers/llm
   scrapers/graph_config
 .. toctree::
   :maxdepth: 2
   :caption: Modules
   modules/modules
 Indices and tables
--- a/docs/source/introduction/contributing.rst
+++ b/docs/source/introduction/contributing.rst
@ -2,7 +2,7 @@ Contributing
 ============
 Hey, you want to contribute? Awesome!
-Just fork the repo, make your changes, and send me a pull request.
+Just fork the repo, make your changes, and send a pull request.
 If you're not sure if it's a good idea, open an issue and we'll discuss it.
 Go and check out the `contributing guidelines <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md>`__ for more information.
--- a/docs/source/introduction/overview.rst
+++ b/docs/source/introduction/overview.rst
@ -1,20 +1,25 @@
 .. image:: ../../assets/scrapegraphai_logo.png
   :align: center
   :width: 50%
   :alt: ScrapegraphAI
 Overview 
 ========
-In a world where web pages are constantly changing and in a data-hungry world there is a need for a new generation of scrapers, and this is where ScrapegraphAI was born. 
+ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
-An opensource library with the aim of starting a new era of scraping tools that are more flexible and require less maintenance by developers, with the use of LLMs.
+In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
 direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
 HTML, JSON, and more.
-.. image:: ../../assets/scrapegraphai_logo.png
+Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
-   :align: center
+providing a more flexible and low-maintenance solution compared to traditional scraping tools.
   :width: 100px
   :alt: ScrapegraphAI
 Why ScrapegraphAI?
 ==================
-ScrapegraphAI in our vision represents a significant step forward in the field of web scraping, offering an open-source solution designed to meet the needs of a constantly evolving web landscape. Here's why ScrapegraphAI stands out:
+Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages.
-
+ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. 
 Flexibility and Adaptability
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. 
 This flexibility ensures that scrapers remain functional even when website layouts change.
 We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
 as well as local models which can run on your machine using Ollama.
--- a/docs/source/modules/modules.rst
+++ b/docs/source/modules/modules.rst
@ -1,6 +1,3 @@
 scrapegraphai
 =============
 .. toctree::
   :maxdepth: 4
--- a/docs/source/modules/yosoai.graphs.rst
+++ b/docs/source/modules/yosoai.graphs.rst
@ -1,29 +0,0 @@
 scrapegraphai.graphs package
 =====================
 Submodules
 ----------
 scrapegraphai.graphs.base\_graph module
 --------------------------------
 .. automodule:: scrapegraphai.graphs.base_graph
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.graphs.smart\_scraper\_graph module
 ------------------------------------------
 .. automodule:: scrapegraphai.graphs.smart_scraper_graph
   :members:
   :undoc-members:
   :show-inheritance:
 Module contents
 ---------------
 .. automodule:: scrapegraphai.graphs
   :members:
   :undoc-members:
   :show-inheritance:
--- a/docs/source/modules/yosoai.nodes.rst
+++ b/docs/source/modules/yosoai.nodes.rst
@ -1,61 +0,0 @@
 scrapegraphai.nodes package
 ====================
 Submodules
 ----------
 scrapegraphai.nodes.base\_node module
 ------------------------------
 .. automodule:: scrapegraphai.nodes.base_node
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.nodes.conditional\_node module
 -------------------------------------
 .. automodule:: scrapegraphai.nodes.conditional_node
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.nodes.fetch\_html\_node module
 -------------------------------------
 .. automodule:: scrapegraphai.nodes.fetch_html_node
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.nodes.generate\_answer\_node module
 ------------------------------------------
 .. automodule:: scrapegraphai.nodes.generate_answer_node
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.nodes.get\_probable\_tags\_node module
 ---------------------------------------------
 .. automodule:: scrapegraphai.nodes.get_probable_tags_node
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.nodes.parse\_html\_node module
 -------------------------------------
 .. automodule:: scrapegraphai.nodes.parse_html_node
   :members:
   :undoc-members:
   :show-inheritance:
 Module contents
 ---------------
 .. automodule:: scrapegraphai.nodes
   :members:
   :undoc-members:
   :show-inheritance:
--- a/docs/source/modules/yosoai.rst
+++ b/docs/source/modules/yosoai.rst
@ -1,110 +0,0 @@
 scrapegraphai package
 ==============
 Subpackages
 -----------
 .. toctree::
   :maxdepth: 4
   scrapegraphai.graphs
   scrapegraphai.nodes
 Submodules
 ----------
 scrapegraphai.class\_creator module
 ----------------------------
 .. automodule:: scrapegraphai.class_creator
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.class\_generator module
 ------------------------------
 .. automodule:: scrapegraphai.class_generator
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.convert\_to\_csv module
 ------------------------------
 .. automodule:: scrapegraphai.convert_to_csv
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.convert\_to\_json module
 -------------------------------
 .. automodule:: scrapegraphai.convert_to_json
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.dictionaries module
 --------------------------
 .. automodule:: scrapegraphai.dictionaries
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.getter module
 --------------------
 .. automodule:: scrapegraphai.getter
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.json\_getter module
 --------------------------
 .. automodule:: scrapegraphai.json_getter
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.pydantic\_class module
 -----------------------------
 .. automodule:: scrapegraphai.pydantic_class
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.remover module
 ---------------------
 .. automodule:: scrapegraphai.remover
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.request module
 ---------------------
 .. automodule:: scrapegraphai.request
   :members:
   :undoc-members:
   :show-inheritance:
 scrapegraphai.token\_calculator module
 -------------------------------
 .. automodule:: scrapegraphai.token_calculator
   :members:
   :undoc-members:
   :show-inheritance:
 Module contents
 ---------------
 .. automodule:: scrapegraphai
   :members:
   :undoc-members:
   :show-inheritance:
--- a/docs/source/scrapers/graph_config.rst
+++ b/docs/source/scrapers/graph_config.rst
@ -0,0 +1,49 @@
 Additional Parameters
 =====================
 It is possible to customize the behavior of the graphs by setting some configuration options.
 Some interesting ones are:
 - `verbose`: If set to `True`, some debug information will be printed to the console.
 - `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
 - `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
 - `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
 Proxy Rotation
 ^^^^^^^^^^^^^^
 It is possible to rotate the proxy by setting the `proxy` option in the graph configuration.
 We provide a free proxy service which is based on `free-proxy <https://pypi.org/project/free-proxy/>`_ library and can be used as follows:
 .. code-block:: python
    graph_config = {
        "llm":{...},
        "loader_kwargs": {
            "proxy" : {
                "server": "broker",
                "criteria": {
                    "anonymous": True,
                    "secure": True,
                    "countryset": {"IT"},
                    "timeout": 10.0,
                    "max_shape": 3
                },
            },
        },
    }
 Do you have a proxy server? You can use it as follows:
 .. code-block:: python
    graph_config = {
        "llm":{...},
        "loader_kwargs": {
            "proxy" : {
                "server": "http://your_proxy_server:port",
                "username": "your_username",
                "password": "your_password",
            },
        },
    }
--- a/docs/source/scrapers/graphs.rst
+++ b/docs/source/scrapers/graphs.rst
@ -0,0 +1,109 @@
 Graphs
 ======
 Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
 There are currently three types of graphs available in the library:
 - **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
 - **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
 - **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
 **Note:** they all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the `LLM`_ and `Configuration`_ sections.
 SmartScraperGraph
 ^^^^^^^^^^^^^^^^^
 .. image:: ../../assets/smartscrapergraph.png
   :align: center
   :width: 90%
   :alt: SmartScraperGraph
 |
 First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the SmartScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
 It will fetch the data from the source and extract the information based on the prompt in JSON format.
 .. code-block:: python
   from scrapegraphai.graphs import SmartScraperGraph
   graph_config = {
      "llm": {...},
   }
   smart_scraper_graph = SmartScraperGraph(
      prompt="List me all the projects with their descriptions",
      source="https://perinim.github.io/projects",
      config=graph_config
   )
   result = smart_scraper_graph.run()
   print(result)
 SearchGraph
 ^^^^^^^^^^^
 .. image:: ../../assets/searchgraph.png
   :align: center
   :width: 80%
   :alt: SearchGraph
 |
 Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SearchGraph class, and run the graph.
 It will create a search query, fetch the first n results from the search engine, run n SmartScraperGraph instances, and return the results in JSON format.
 .. code-block:: python
   from scrapegraphai.graphs import SearchGraph
   graph_config = {
      "llm": {...},
      "embeddings": {...},
   }
   # Create the SearchGraph instance
   search_graph = SearchGraph(
      prompt="List me all the traditional recipes from Chioggia",
      config=graph_config
   )
   # Run the graph
   result = search_graph.run()
   print(result)
 SpeechGraph
 ^^^^^^^^^^^
 .. image:: ../../assets/speechgraph.png
   :align: center
   :width: 90%
   :alt: SpeechGraph
 |
 Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SpeechGraph class, and run the graph.
 It will fetch the data from the source, extract the information based on the prompt, and generate an audio file with the answer, as well as the answer itself, in JSON format.
 .. code-block:: python
   from scrapegraphai.graphs import SpeechGraph
   graph_config = {
      "llm": {...},
      "tts_model": {...},
   }
   # ************************************************
   # Create the SpeechGraph instance and run it
   # ************************************************
   speech_graph = SpeechGraph(
      prompt="Make a detailed audio summary of the projects.",
      source="https://perinim.github.io/projects/",
      config=graph_config,
   )
   result = speech_graph.run()
   print(result)
--- a/docs/source/scrapers/llm.rst
+++ b/docs/source/scrapers/llm.rst
@ -0,0 +1,190 @@
 LLM
 ===
 We support many known LLM models and providers used to analyze the web pages and extract the information requested by the user. Models can be split in **Chat Models** and **Embedding Models** (the latter are mainly used for Retrieval Augmented Generation RAG).
 These models are specified inside the graph configuration dictionary and can be used interchangeably, for example by defining a different model for llm and embeddings.
 - **Local Models**: These models are hosted on the local machine and can be used without any API key.
 - **API-based Models**: These models are hosted on the cloud and require an API key to access them (eg. OpenAI, Groq, etc).
 **Note**: If the emebedding model is not specified, the library will use the default one for that LLM, if available.
 Local Models
 ------------
 Currently, local models are supported through Ollama integration. Ollama is a provider of LLM models which can be downloaded from here `Ollama <https://ollama.com/>`_.
 Let's say we want to use **llama3** as chat model and **nomic-embed-text** as embedding model. We first need to pull them from ollama using:
 .. code-block:: bash
   ollama pull llama3
   ollama pull nomic-embed-text
 Then we can use them in the graph configuration as follows:
 .. code-block:: python
    graph_config = {
        "llm": {
            "model": "llama3",
            "temperature": 0.0,
            "format": "json",
        },
        "embeddings": {
            "model": "nomic-embed-text",
        },
    }
 You can also specify the **base_url** parameter to specify the models endpoint. By default, it is set to http://localhost:11434. This is useful if you are running Ollama on a Docker container or on a different machine.
 If you want to host Ollama in a Docker container, you can use the following command:
 .. code-block:: bash
    docker-compose up -d
    docker exec -it ollama ollama pull llama3
 API-based Models
 ----------------
 OpenAI
 ^^^^^^
 You can get the API key from `here <https://platform.openai.com/api-keys>`_.
 .. code-block:: python
    graph_config = {
        "llm": {
            "api_key": "OPENAI_API_KEY",
            "model": "gpt-3.5-turbo",
        },
    }
 If you want to use text to speech models, you can specify the `tts_model` parameter:
 .. code-block:: python
    graph_config = {
        "llm": {
            "api_key": "OPENAI_API_KEY",
            "model": "gpt-3.5-turbo",
            "temperature": 0.7,
        },
        "tts_model": {
            "api_key": "OPENAI_API_KEY",
            "model": "tts-1",
            "voice": "alloy"
        },
    }
 Gemini
 ^^^^^^
 You can get the API key from `here <https://ai.google.dev/gemini-api/docs/api-key>`_.
 **Note**: some countries are not supported and therefore it won't be possible to request an API key. A possible workaround is to use a VPN or run the library on Colab.
 .. code-block:: python
    graph_config = {
        "llm": {
            "api_key": "GEMINI_API_KEY",
            "model": "gemini-pro"
        },
    }
 Groq
 ^^^^
 You can get the API key from `here <https://console.groq.com/keys>`_. Groq doesn't support embedding models, so in the following example we are using Ollama one.
 .. code-block:: python
    graph_config = {
        "llm": {
            "model": "groq/gemma-7b-it",
            "api_key": "GROQ_API_KEY",
            "temperature": 0
        },
        "embeddings": {
            "model": "ollama/nomic-embed-text",
        },
    }
 Azure
 ^^^^^
 We can also pass a model instance for the chat model and the embedding model. For Azure, a possible configuration would be:
 .. code-block:: python
    llm_model_instance = AzureChatOpenAI(
        openai_api_version="AZURE_OPENAI_API_VERSION",
        azure_deployment="AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"
    )
    embedder_model_instance = AzureOpenAIEmbeddings(
        azure_deployment="AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME",
        openai_api_version="AZURE_OPENAI_API_VERSION",
    )
    graph_config = {
        "llm": {
            "model_instance": llm_model_instance
        },
        "embeddings": {
            "model_instance": embedder_model_instance
        }
    }
 Hugging Face Hub
 ^^^^^^^^^^^^^^^^
 We can also pass a model instance for the chat model and the embedding model. For Hugging Face, a possible configuration would be:
 .. code-block:: python
    llm_model_instance = HuggingFaceEndpoint(
        repo_id="mistralai/Mistral-7B-Instruct-v0.2",
        max_length=128,
        temperature=0.5,
        token="HUGGINGFACEHUB_API_TOKEN"
    )
    embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
        api_key="HUGGINGFACEHUB_API_TOKEN",
        model_name="sentence-transformers/all-MiniLM-l6-v2"
    )
    graph_config = {
        "llm": {
            "model_instance": llm_model_instance
        },
        "embeddings": {
            "model_instance": embedder_model_instance
        }
    }
 Anthropic
 ^^^^^^^^^
 We can also pass a model instance for the chat model and the embedding model. For Anthropic, a possible configuration would be:
 .. code-block:: python
    embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
        api_key="HUGGINGFACEHUB_API_TOKEN",
        model_name="sentence-transformers/all-MiniLM-l6-v2"
    )
    graph_config = {
        "llm": {
            "api_key": "ANTHROPIC_API_KEY",
            "model": "claude-3-haiku-20240307",
            "max_tokens": 4000
        },
        "embeddings": {
            "model_instance": embedder_model_instance
        }
    }