diff --git a/docs/assets/searchgraph.png b/docs/assets/searchgraph.png new file mode 100644 index 00000000..f57c652e Binary files /dev/null and b/docs/assets/searchgraph.png differ diff --git a/docs/assets/smartscrapergraph.png b/docs/assets/smartscrapergraph.png new file mode 100644 index 00000000..021531a3 Binary files /dev/null and b/docs/assets/smartscrapergraph.png differ diff --git a/docs/assets/speechgraph.png b/docs/assets/speechgraph.png new file mode 100644 index 00000000..70b13062 Binary files /dev/null and b/docs/assets/speechgraph.png differ diff --git a/docs/source/conf.py b/docs/source/conf.py index 8c46d4c2..3f323d6a 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -30,4 +30,3 @@ exclude_patterns = [] # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output html_theme = 'sphinx_rtd_theme' -html_static_path = ['_static'] diff --git a/docs/source/getting_started/examples.rst b/docs/source/getting_started/examples.rst index b6e2eb36..b406f7b3 100644 --- a/docs/source/getting_started/examples.rst +++ b/docs/source/getting_started/examples.rst @@ -1,7 +1,9 @@ Examples ======== -Here some example of the different ways to scrape with ScrapegraphAI +Let's suppose you want to scrape a website to get a list of projects with their descriptions. +You can use the `SmartScraperGraph` class to do that. +The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models. OpenAI models ^^^^^^^^^^^^^ @@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br # ************************************************ smart_scraper_graph = SmartScraperGraph( - prompt="List me all the news with their description.", + prompt="List me all the projects with their description.", # also accepts a string with the already downloaded HTML code source="https://perinim.github.io/projects", config=graph_config @@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br result = smart_scraper_graph.run() print(result) +To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section! \ No newline at end of file diff --git a/docs/source/getting_started/installation.rst b/docs/source/getting_started/installation.rst index 3bca044b..3e40f1c3 100644 --- a/docs/source/getting_started/installation.rst +++ b/docs/source/getting_started/installation.rst @@ -7,26 +7,35 @@ for this project. Prerequisites ^^^^^^^^^^^^^ -- `Python 3.8+ `_ -- `pip ` -- `ollama ` *optional for local models +- `Python >=3.9,<3.12 `_ +- `pip `_ +- `Ollama `_ (optional for local models) Install the library ^^^^^^^^^^^^^^^^^^^^ +The library is available on PyPI, so it can be installed using the following command: + .. code-block:: bash pip install scrapegraphai +**Note:** It is higly recommended to install the library in a virtual environment (conda, venv, etc.) + +If your clone the repository, you can install the library using `poetry `_: + +.. code-block:: bash + + poetry install + Additionally on Windows when using WSL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If you are using Windows Subsystem for Linux (WSL) and you are facing issues with the installation of the library, you might need to install the following packages: + .. code-block:: bash sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2 -As simple as that! You are now ready to scrape gnamgnamgnam 👿👿👿 - - diff --git a/docs/source/index.rst b/docs/source/index.rst index 712bb7c3..ab0c6180 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -3,12 +3,6 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to scrapegraphai-ai's documentation! -======================================= - -Here you will find all the information you need to get started. -The following sections will guide you through the installation process and the usage of the library. - .. toctree:: :maxdepth: 2 :caption: Introduction @@ -22,6 +16,19 @@ The following sections will guide you through the installation process and the u getting_started/installation getting_started/examples + +.. toctree:: + :maxdepth: 2 + :caption: Scrapers + + scrapers/graphs + scrapers/llm + scrapers/graph_config + +.. toctree:: + :maxdepth: 2 + :caption: Modules + modules/modules Indices and tables diff --git a/docs/source/introduction/contributing.rst b/docs/source/introduction/contributing.rst index dd0d529a..75f5adab 100644 --- a/docs/source/introduction/contributing.rst +++ b/docs/source/introduction/contributing.rst @@ -2,7 +2,7 @@ Contributing ============ Hey, you want to contribute? Awesome! -Just fork the repo, make your changes, and send me a pull request. +Just fork the repo, make your changes, and send a pull request. If you're not sure if it's a good idea, open an issue and we'll discuss it. Go and check out the `contributing guidelines `__ for more information. diff --git a/docs/source/introduction/overview.rst b/docs/source/introduction/overview.rst index 46ed21a5..ffb0a5b3 100644 --- a/docs/source/introduction/overview.rst +++ b/docs/source/introduction/overview.rst @@ -1,20 +1,25 @@ +.. image:: ../../assets/scrapegraphai_logo.png + :align: center + :width: 50% + :alt: ScrapegraphAI + Overview ======== -In a world where web pages are constantly changing and in a data-hungry world there is a need for a new generation of scrapers, and this is where ScrapegraphAI was born. -An opensource library with the aim of starting a new era of scraping tools that are more flexible and require less maintenance by developers, with the use of LLMs. +ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools. +In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and +direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML, +HTML, JSON, and more. -.. image:: ../../assets/scrapegraphai_logo.png - :align: center - :width: 100px - :alt: ScrapegraphAI +Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, +providing a more flexible and low-maintenance solution compared to traditional scraping tools. Why ScrapegraphAI? ================== -ScrapegraphAI in our vision represents a significant step forward in the field of web scraping, offering an open-source solution designed to meet the needs of a constantly evolving web landscape. Here's why ScrapegraphAI stands out: - -Flexibility and Adaptability -^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. +Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. +ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention. This flexibility ensures that scrapers remain functional even when website layouts change. + +We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc. +as well as local models which can run on your machine using Ollama. \ No newline at end of file diff --git a/docs/source/modules/modules.rst b/docs/source/modules/modules.rst index eaa8b0f6..f22d1cea 100644 --- a/docs/source/modules/modules.rst +++ b/docs/source/modules/modules.rst @@ -1,6 +1,3 @@ -scrapegraphai -============= - .. toctree:: :maxdepth: 4 diff --git a/docs/source/modules/yosoai.graphs.rst b/docs/source/modules/yosoai.graphs.rst deleted file mode 100644 index 5d096474..00000000 --- a/docs/source/modules/yosoai.graphs.rst +++ /dev/null @@ -1,29 +0,0 @@ -scrapegraphai.graphs package -===================== - -Submodules ----------- - -scrapegraphai.graphs.base\_graph module --------------------------------- - -.. automodule:: scrapegraphai.graphs.base_graph - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.graphs.smart\_scraper\_graph module ------------------------------------------- - -.. automodule:: scrapegraphai.graphs.smart_scraper_graph - :members: - :undoc-members: - :show-inheritance: - -Module contents ---------------- - -.. automodule:: scrapegraphai.graphs - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/source/modules/yosoai.nodes.rst b/docs/source/modules/yosoai.nodes.rst deleted file mode 100644 index 167f83fa..00000000 --- a/docs/source/modules/yosoai.nodes.rst +++ /dev/null @@ -1,61 +0,0 @@ -scrapegraphai.nodes package -==================== - -Submodules ----------- - -scrapegraphai.nodes.base\_node module ------------------------------- - -.. automodule:: scrapegraphai.nodes.base_node - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.nodes.conditional\_node module -------------------------------------- - -.. automodule:: scrapegraphai.nodes.conditional_node - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.nodes.fetch\_html\_node module -------------------------------------- - -.. automodule:: scrapegraphai.nodes.fetch_html_node - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.nodes.generate\_answer\_node module ------------------------------------------- - -.. automodule:: scrapegraphai.nodes.generate_answer_node - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.nodes.get\_probable\_tags\_node module ---------------------------------------------- - -.. automodule:: scrapegraphai.nodes.get_probable_tags_node - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.nodes.parse\_html\_node module -------------------------------------- - -.. automodule:: scrapegraphai.nodes.parse_html_node - :members: - :undoc-members: - :show-inheritance: - -Module contents ---------------- - -.. automodule:: scrapegraphai.nodes - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/source/modules/yosoai.rst b/docs/source/modules/yosoai.rst deleted file mode 100644 index 43251cb3..00000000 --- a/docs/source/modules/yosoai.rst +++ /dev/null @@ -1,110 +0,0 @@ -scrapegraphai package -============== - -Subpackages ------------ - -.. toctree:: - :maxdepth: 4 - - scrapegraphai.graphs - scrapegraphai.nodes - -Submodules ----------- - -scrapegraphai.class\_creator module ----------------------------- - -.. automodule:: scrapegraphai.class_creator - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.class\_generator module ------------------------------- - -.. automodule:: scrapegraphai.class_generator - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.convert\_to\_csv module ------------------------------- - -.. automodule:: scrapegraphai.convert_to_csv - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.convert\_to\_json module -------------------------------- - -.. automodule:: scrapegraphai.convert_to_json - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.dictionaries module --------------------------- - -.. automodule:: scrapegraphai.dictionaries - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.getter module --------------------- - -.. automodule:: scrapegraphai.getter - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.json\_getter module --------------------------- - -.. automodule:: scrapegraphai.json_getter - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.pydantic\_class module ------------------------------ - -.. automodule:: scrapegraphai.pydantic_class - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.remover module ---------------------- - -.. automodule:: scrapegraphai.remover - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.request module ---------------------- - -.. automodule:: scrapegraphai.request - :members: - :undoc-members: - :show-inheritance: - -scrapegraphai.token\_calculator module -------------------------------- - -.. automodule:: scrapegraphai.token_calculator - :members: - :undoc-members: - :show-inheritance: - -Module contents ---------------- - -.. automodule:: scrapegraphai - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/source/scrapers/graph_config.rst b/docs/source/scrapers/graph_config.rst new file mode 100644 index 00000000..a5ade9c5 --- /dev/null +++ b/docs/source/scrapers/graph_config.rst @@ -0,0 +1,49 @@ +Additional Parameters +===================== + +It is possible to customize the behavior of the graphs by setting some configuration options. +Some interesting ones are: + +- `verbose`: If set to `True`, some debug information will be printed to the console. +- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched. +- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`. +- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`. + +Proxy Rotation +^^^^^^^^^^^^^^ + +It is possible to rotate the proxy by setting the `proxy` option in the graph configuration. +We provide a free proxy service which is based on `free-proxy `_ library and can be used as follows: + +.. code-block:: python + + graph_config = { + "llm":{...}, + "loader_kwargs": { + "proxy" : { + "server": "broker", + "criteria": { + "anonymous": True, + "secure": True, + "countryset": {"IT"}, + "timeout": 10.0, + "max_shape": 3 + }, + }, + }, + } + +Do you have a proxy server? You can use it as follows: + +.. code-block:: python + + graph_config = { + "llm":{...}, + "loader_kwargs": { + "proxy" : { + "server": "http://your_proxy_server:port", + "username": "your_username", + "password": "your_password", + }, + }, + } diff --git a/docs/source/scrapers/graphs.rst b/docs/source/scrapers/graphs.rst new file mode 100644 index 00000000..efd87537 --- /dev/null +++ b/docs/source/scrapers/graphs.rst @@ -0,0 +1,109 @@ +Graphs +====== + +Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.). + +There are currently three types of graphs available in the library: + +- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM. +- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph. +- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file). + +**Note:** they all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the `LLM`_ and `Configuration`_ sections. + +SmartScraperGraph +^^^^^^^^^^^^^^^^^ + +.. image:: ../../assets/smartscrapergraph.png + :align: center + :width: 90% + :alt: SmartScraperGraph +| + +First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the SmartScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result. +It will fetch the data from the source and extract the information based on the prompt in JSON format. + +.. code-block:: python + + from scrapegraphai.graphs import SmartScraperGraph + + graph_config = { + "llm": {...}, + } + + smart_scraper_graph = SmartScraperGraph( + prompt="List me all the projects with their descriptions", + source="https://perinim.github.io/projects", + config=graph_config + ) + + result = smart_scraper_graph.run() + print(result) + + +SearchGraph +^^^^^^^^^^^ + +.. image:: ../../assets/searchgraph.png + :align: center + :width: 80% + :alt: SearchGraph +| + +Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SearchGraph class, and run the graph. +It will create a search query, fetch the first n results from the search engine, run n SmartScraperGraph instances, and return the results in JSON format. + + +.. code-block:: python + + from scrapegraphai.graphs import SearchGraph + + graph_config = { + "llm": {...}, + "embeddings": {...}, + } + + # Create the SearchGraph instance + search_graph = SearchGraph( + prompt="List me all the traditional recipes from Chioggia", + config=graph_config + ) + + # Run the graph + result = search_graph.run() + print(result) + + +SpeechGraph +^^^^^^^^^^^ + +.. image:: ../../assets/speechgraph.png + :align: center + :width: 90% + :alt: SpeechGraph +| + +Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SpeechGraph class, and run the graph. +It will fetch the data from the source, extract the information based on the prompt, and generate an audio file with the answer, as well as the answer itself, in JSON format. + +.. code-block:: python + + from scrapegraphai.graphs import SpeechGraph + + graph_config = { + "llm": {...}, + "tts_model": {...}, + } + + # ************************************************ + # Create the SpeechGraph instance and run it + # ************************************************ + + speech_graph = SpeechGraph( + prompt="Make a detailed audio summary of the projects.", + source="https://perinim.github.io/projects/", + config=graph_config, + ) + + result = speech_graph.run() + print(result) \ No newline at end of file diff --git a/docs/source/scrapers/llm.rst b/docs/source/scrapers/llm.rst new file mode 100644 index 00000000..486668b1 --- /dev/null +++ b/docs/source/scrapers/llm.rst @@ -0,0 +1,190 @@ +LLM +=== + +We support many known LLM models and providers used to analyze the web pages and extract the information requested by the user. Models can be split in **Chat Models** and **Embedding Models** (the latter are mainly used for Retrieval Augmented Generation RAG). +These models are specified inside the graph configuration dictionary and can be used interchangeably, for example by defining a different model for llm and embeddings. + +- **Local Models**: These models are hosted on the local machine and can be used without any API key. +- **API-based Models**: These models are hosted on the cloud and require an API key to access them (eg. OpenAI, Groq, etc). + +**Note**: If the emebedding model is not specified, the library will use the default one for that LLM, if available. + +Local Models +------------ + +Currently, local models are supported through Ollama integration. Ollama is a provider of LLM models which can be downloaded from here `Ollama `_. +Let's say we want to use **llama3** as chat model and **nomic-embed-text** as embedding model. We first need to pull them from ollama using: + +.. code-block:: bash + + ollama pull llama3 + ollama pull nomic-embed-text + +Then we can use them in the graph configuration as follows: + +.. code-block:: python + + graph_config = { + "llm": { + "model": "llama3", + "temperature": 0.0, + "format": "json", + }, + "embeddings": { + "model": "nomic-embed-text", + }, + } + +You can also specify the **base_url** parameter to specify the models endpoint. By default, it is set to http://localhost:11434. This is useful if you are running Ollama on a Docker container or on a different machine. + +If you want to host Ollama in a Docker container, you can use the following command: + +.. code-block:: bash + + docker-compose up -d + docker exec -it ollama ollama pull llama3 + +API-based Models +---------------- + +OpenAI +^^^^^^ + +You can get the API key from `here `_. + +.. code-block:: python + + graph_config = { + "llm": { + "api_key": "OPENAI_API_KEY", + "model": "gpt-3.5-turbo", + }, + } + +If you want to use text to speech models, you can specify the `tts_model` parameter: + +.. code-block:: python + + graph_config = { + "llm": { + "api_key": "OPENAI_API_KEY", + "model": "gpt-3.5-turbo", + "temperature": 0.7, + }, + "tts_model": { + "api_key": "OPENAI_API_KEY", + "model": "tts-1", + "voice": "alloy" + }, + } + +Gemini +^^^^^^ + +You can get the API key from `here `_. + +**Note**: some countries are not supported and therefore it won't be possible to request an API key. A possible workaround is to use a VPN or run the library on Colab. + +.. code-block:: python + + graph_config = { + "llm": { + "api_key": "GEMINI_API_KEY", + "model": "gemini-pro" + }, + } + +Groq +^^^^ + +You can get the API key from `here `_. Groq doesn't support embedding models, so in the following example we are using Ollama one. + +.. code-block:: python + + graph_config = { + "llm": { + "model": "groq/gemma-7b-it", + "api_key": "GROQ_API_KEY", + "temperature": 0 + }, + "embeddings": { + "model": "ollama/nomic-embed-text", + }, + } + +Azure +^^^^^ + +We can also pass a model instance for the chat model and the embedding model. For Azure, a possible configuration would be: + +.. code-block:: python + + llm_model_instance = AzureChatOpenAI( + openai_api_version="AZURE_OPENAI_API_VERSION", + azure_deployment="AZURE_OPENAI_CHAT_DEPLOYMENT_NAME" + ) + + embedder_model_instance = AzureOpenAIEmbeddings( + azure_deployment="AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME", + openai_api_version="AZURE_OPENAI_API_VERSION", + ) + + graph_config = { + "llm": { + "model_instance": llm_model_instance + }, + "embeddings": { + "model_instance": embedder_model_instance + } + } + +Hugging Face Hub +^^^^^^^^^^^^^^^^ + +We can also pass a model instance for the chat model and the embedding model. For Hugging Face, a possible configuration would be: + +.. code-block:: python + + llm_model_instance = HuggingFaceEndpoint( + repo_id="mistralai/Mistral-7B-Instruct-v0.2", + max_length=128, + temperature=0.5, + token="HUGGINGFACEHUB_API_TOKEN" + ) + + embedder_model_instance = HuggingFaceInferenceAPIEmbeddings( + api_key="HUGGINGFACEHUB_API_TOKEN", + model_name="sentence-transformers/all-MiniLM-l6-v2" + ) + + graph_config = { + "llm": { + "model_instance": llm_model_instance + }, + "embeddings": { + "model_instance": embedder_model_instance + } + } + +Anthropic +^^^^^^^^^ + +We can also pass a model instance for the chat model and the embedding model. For Anthropic, a possible configuration would be: + +.. code-block:: python + + embedder_model_instance = HuggingFaceInferenceAPIEmbeddings( + api_key="HUGGINGFACEHUB_API_TOKEN", + model_name="sentence-transformers/all-MiniLM-l6-v2" + ) + + graph_config = { + "llm": { + "api_key": "ANTHROPIC_API_KEY", + "model": "claude-3-haiku-20240307", + "max_tokens": 4000 + }, + "embeddings": { + "model_instance": embedder_model_instance + } + } \ No newline at end of file