mirror of
https://github.com/VinciGit00/Scrapegraph-ai.git
synced 2026-06-25 21:11:11 +08:00
docs(refactor): added proxy-rotation usage and refactor readthedocs
This commit is contained in:
parent
5d6d996e8f
commit
e256b758b2
BIN
docs/assets/searchgraph.png
Normal file
BIN
docs/assets/searchgraph.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 50 KiB |
BIN
docs/assets/smartscrapergraph.png
Normal file
BIN
docs/assets/smartscrapergraph.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 58 KiB |
BIN
docs/assets/speechgraph.png
Normal file
BIN
docs/assets/speechgraph.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 46 KiB |
@ -30,4 +30,3 @@ exclude_patterns = []
|
|||||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
|
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
|
||||||
|
|
||||||
html_theme = 'sphinx_rtd_theme'
|
html_theme = 'sphinx_rtd_theme'
|
||||||
html_static_path = ['_static']
|
|
||||||
|
|||||||
@ -1,7 +1,9 @@
|
|||||||
Examples
|
Examples
|
||||||
========
|
========
|
||||||
|
|
||||||
Here some example of the different ways to scrape with ScrapegraphAI
|
Let's suppose you want to scrape a website to get a list of projects with their descriptions.
|
||||||
|
You can use the `SmartScraperGraph` class to do that.
|
||||||
|
The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models.
|
||||||
|
|
||||||
OpenAI models
|
OpenAI models
|
||||||
^^^^^^^^^^^^^
|
^^^^^^^^^^^^^
|
||||||
@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br
|
|||||||
# ************************************************
|
# ************************************************
|
||||||
|
|
||||||
smart_scraper_graph = SmartScraperGraph(
|
smart_scraper_graph = SmartScraperGraph(
|
||||||
prompt="List me all the news with their description.",
|
prompt="List me all the projects with their description.",
|
||||||
# also accepts a string with the already downloaded HTML code
|
# also accepts a string with the already downloaded HTML code
|
||||||
source="https://perinim.github.io/projects",
|
source="https://perinim.github.io/projects",
|
||||||
config=graph_config
|
config=graph_config
|
||||||
@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br
|
|||||||
result = smart_scraper_graph.run()
|
result = smart_scraper_graph.run()
|
||||||
print(result)
|
print(result)
|
||||||
|
|
||||||
|
To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!
|
||||||
@ -7,26 +7,35 @@ for this project.
|
|||||||
Prerequisites
|
Prerequisites
|
||||||
^^^^^^^^^^^^^
|
^^^^^^^^^^^^^
|
||||||
|
|
||||||
- `Python 3.8+ <https://www.python.org/downloads/>`_
|
- `Python >=3.9,<3.12 <https://www.python.org/downloads/>`_
|
||||||
- `pip <https://pip.pypa.io/en/stable/getting-started/>`
|
- `pip <https://pip.pypa.io/en/stable/getting-started/>`_
|
||||||
- `ollama <https://ollama.com/>` *optional for local models
|
- `Ollama <https://ollama.com/>`_ (optional for local models)
|
||||||
|
|
||||||
|
|
||||||
Install the library
|
Install the library
|
||||||
^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The library is available on PyPI, so it can be installed using the following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
pip install scrapegraphai
|
pip install scrapegraphai
|
||||||
|
|
||||||
|
**Note:** It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
|
||||||
|
|
||||||
|
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
poetry install
|
||||||
|
|
||||||
Additionally on Windows when using WSL
|
Additionally on Windows when using WSL
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
If you are using Windows Subsystem for Linux (WSL) and you are facing issues with the installation of the library, you might need to install the following packages:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
|
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
|
||||||
|
|
||||||
As simple as that! You are now ready to scrape gnamgnamgnam 👿👿👿
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -3,12 +3,6 @@
|
|||||||
You can adapt this file completely to your liking, but it should at least
|
You can adapt this file completely to your liking, but it should at least
|
||||||
contain the root `toctree` directive.
|
contain the root `toctree` directive.
|
||||||
|
|
||||||
Welcome to scrapegraphai-ai's documentation!
|
|
||||||
=======================================
|
|
||||||
|
|
||||||
Here you will find all the information you need to get started.
|
|
||||||
The following sections will guide you through the installation process and the usage of the library.
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
:caption: Introduction
|
:caption: Introduction
|
||||||
@ -22,6 +16,19 @@ The following sections will guide you through the installation process and the u
|
|||||||
|
|
||||||
getting_started/installation
|
getting_started/installation
|
||||||
getting_started/examples
|
getting_started/examples
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:caption: Scrapers
|
||||||
|
|
||||||
|
scrapers/graphs
|
||||||
|
scrapers/llm
|
||||||
|
scrapers/graph_config
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:caption: Modules
|
||||||
|
|
||||||
modules/modules
|
modules/modules
|
||||||
|
|
||||||
Indices and tables
|
Indices and tables
|
||||||
|
|||||||
@ -2,7 +2,7 @@ Contributing
|
|||||||
============
|
============
|
||||||
|
|
||||||
Hey, you want to contribute? Awesome!
|
Hey, you want to contribute? Awesome!
|
||||||
Just fork the repo, make your changes, and send me a pull request.
|
Just fork the repo, make your changes, and send a pull request.
|
||||||
If you're not sure if it's a good idea, open an issue and we'll discuss it.
|
If you're not sure if it's a good idea, open an issue and we'll discuss it.
|
||||||
|
|
||||||
Go and check out the `contributing guidelines <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md>`__ for more information.
|
Go and check out the `contributing guidelines <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md>`__ for more information.
|
||||||
|
|||||||
@ -1,20 +1,25 @@
|
|||||||
|
.. image:: ../../assets/scrapegraphai_logo.png
|
||||||
|
:align: center
|
||||||
|
:width: 50%
|
||||||
|
:alt: ScrapegraphAI
|
||||||
|
|
||||||
Overview
|
Overview
|
||||||
========
|
========
|
||||||
|
|
||||||
In a world where web pages are constantly changing and in a data-hungry world there is a need for a new generation of scrapers, and this is where ScrapegraphAI was born.
|
ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
|
||||||
An opensource library with the aim of starting a new era of scraping tools that are more flexible and require less maintenance by developers, with the use of LLMs.
|
In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
|
||||||
|
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
|
||||||
|
HTML, JSON, and more.
|
||||||
|
|
||||||
.. image:: ../../assets/scrapegraphai_logo.png
|
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
|
||||||
:align: center
|
providing a more flexible and low-maintenance solution compared to traditional scraping tools.
|
||||||
:width: 100px
|
|
||||||
:alt: ScrapegraphAI
|
|
||||||
|
|
||||||
Why ScrapegraphAI?
|
Why ScrapegraphAI?
|
||||||
==================
|
==================
|
||||||
|
|
||||||
ScrapegraphAI in our vision represents a significant step forward in the field of web scraping, offering an open-source solution designed to meet the needs of a constantly evolving web landscape. Here's why ScrapegraphAI stands out:
|
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages.
|
||||||
|
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
|
||||||
Flexibility and Adaptability
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
Traditional web scraping tools often rely on fixed patterns or manual configuration to extract data from web pages. ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
|
|
||||||
This flexibility ensures that scrapers remain functional even when website layouts change.
|
This flexibility ensures that scrapers remain functional even when website layouts change.
|
||||||
|
|
||||||
|
We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
|
||||||
|
as well as local models which can run on your machine using Ollama.
|
||||||
@ -1,6 +1,3 @@
|
|||||||
scrapegraphai
|
|
||||||
=============
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 4
|
:maxdepth: 4
|
||||||
|
|
||||||
|
|||||||
@ -1,29 +0,0 @@
|
|||||||
scrapegraphai.graphs package
|
|
||||||
=====================
|
|
||||||
|
|
||||||
Submodules
|
|
||||||
----------
|
|
||||||
|
|
||||||
scrapegraphai.graphs.base\_graph module
|
|
||||||
--------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.graphs.base_graph
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.graphs.smart\_scraper\_graph module
|
|
||||||
------------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.graphs.smart_scraper_graph
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
Module contents
|
|
||||||
---------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.graphs
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
@ -1,61 +0,0 @@
|
|||||||
scrapegraphai.nodes package
|
|
||||||
====================
|
|
||||||
|
|
||||||
Submodules
|
|
||||||
----------
|
|
||||||
|
|
||||||
scrapegraphai.nodes.base\_node module
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.base_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.nodes.conditional\_node module
|
|
||||||
-------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.conditional_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.nodes.fetch\_html\_node module
|
|
||||||
-------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.fetch_html_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.nodes.generate\_answer\_node module
|
|
||||||
------------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.generate_answer_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.nodes.get\_probable\_tags\_node module
|
|
||||||
---------------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.get_probable_tags_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.nodes.parse\_html\_node module
|
|
||||||
-------------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes.parse_html_node
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
Module contents
|
|
||||||
---------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.nodes
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
@ -1,110 +0,0 @@
|
|||||||
scrapegraphai package
|
|
||||||
==============
|
|
||||||
|
|
||||||
Subpackages
|
|
||||||
-----------
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 4
|
|
||||||
|
|
||||||
scrapegraphai.graphs
|
|
||||||
scrapegraphai.nodes
|
|
||||||
|
|
||||||
Submodules
|
|
||||||
----------
|
|
||||||
|
|
||||||
scrapegraphai.class\_creator module
|
|
||||||
----------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.class_creator
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.class\_generator module
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.class_generator
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.convert\_to\_csv module
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.convert_to_csv
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.convert\_to\_json module
|
|
||||||
-------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.convert_to_json
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.dictionaries module
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.dictionaries
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.getter module
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.getter
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.json\_getter module
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.json_getter
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.pydantic\_class module
|
|
||||||
-----------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.pydantic_class
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.remover module
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.remover
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.request module
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.request
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
scrapegraphai.token\_calculator module
|
|
||||||
-------------------------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai.token_calculator
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
|
|
||||||
Module contents
|
|
||||||
---------------
|
|
||||||
|
|
||||||
.. automodule:: scrapegraphai
|
|
||||||
:members:
|
|
||||||
:undoc-members:
|
|
||||||
:show-inheritance:
|
|
||||||
49
docs/source/scrapers/graph_config.rst
Normal file
49
docs/source/scrapers/graph_config.rst
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
Additional Parameters
|
||||||
|
=====================
|
||||||
|
|
||||||
|
It is possible to customize the behavior of the graphs by setting some configuration options.
|
||||||
|
Some interesting ones are:
|
||||||
|
|
||||||
|
- `verbose`: If set to `True`, some debug information will be printed to the console.
|
||||||
|
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
|
||||||
|
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
|
||||||
|
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
|
||||||
|
|
||||||
|
Proxy Rotation
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
It is possible to rotate the proxy by setting the `proxy` option in the graph configuration.
|
||||||
|
We provide a free proxy service which is based on `free-proxy <https://pypi.org/project/free-proxy/>`_ library and can be used as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm":{...},
|
||||||
|
"loader_kwargs": {
|
||||||
|
"proxy" : {
|
||||||
|
"server": "broker",
|
||||||
|
"criteria": {
|
||||||
|
"anonymous": True,
|
||||||
|
"secure": True,
|
||||||
|
"countryset": {"IT"},
|
||||||
|
"timeout": 10.0,
|
||||||
|
"max_shape": 3
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
Do you have a proxy server? You can use it as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm":{...},
|
||||||
|
"loader_kwargs": {
|
||||||
|
"proxy" : {
|
||||||
|
"server": "http://your_proxy_server:port",
|
||||||
|
"username": "your_username",
|
||||||
|
"password": "your_password",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
}
|
||||||
109
docs/source/scrapers/graphs.rst
Normal file
109
docs/source/scrapers/graphs.rst
Normal file
@ -0,0 +1,109 @@
|
|||||||
|
Graphs
|
||||||
|
======
|
||||||
|
|
||||||
|
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
|
||||||
|
|
||||||
|
There are currently three types of graphs available in the library:
|
||||||
|
|
||||||
|
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
|
||||||
|
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
|
||||||
|
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
|
||||||
|
|
||||||
|
**Note:** they all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the `LLM`_ and `Configuration`_ sections.
|
||||||
|
|
||||||
|
SmartScraperGraph
|
||||||
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. image:: ../../assets/smartscrapergraph.png
|
||||||
|
:align: center
|
||||||
|
:width: 90%
|
||||||
|
:alt: SmartScraperGraph
|
||||||
|
|
|
||||||
|
|
||||||
|
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the SmartScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
|
||||||
|
It will fetch the data from the source and extract the information based on the prompt in JSON format.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from scrapegraphai.graphs import SmartScraperGraph
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {...},
|
||||||
|
}
|
||||||
|
|
||||||
|
smart_scraper_graph = SmartScraperGraph(
|
||||||
|
prompt="List me all the projects with their descriptions",
|
||||||
|
source="https://perinim.github.io/projects",
|
||||||
|
config=graph_config
|
||||||
|
)
|
||||||
|
|
||||||
|
result = smart_scraper_graph.run()
|
||||||
|
print(result)
|
||||||
|
|
||||||
|
|
||||||
|
SearchGraph
|
||||||
|
^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. image:: ../../assets/searchgraph.png
|
||||||
|
:align: center
|
||||||
|
:width: 80%
|
||||||
|
:alt: SearchGraph
|
||||||
|
|
|
||||||
|
|
||||||
|
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SearchGraph class, and run the graph.
|
||||||
|
It will create a search query, fetch the first n results from the search engine, run n SmartScraperGraph instances, and return the results in JSON format.
|
||||||
|
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from scrapegraphai.graphs import SearchGraph
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {...},
|
||||||
|
"embeddings": {...},
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create the SearchGraph instance
|
||||||
|
search_graph = SearchGraph(
|
||||||
|
prompt="List me all the traditional recipes from Chioggia",
|
||||||
|
config=graph_config
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run the graph
|
||||||
|
result = search_graph.run()
|
||||||
|
print(result)
|
||||||
|
|
||||||
|
|
||||||
|
SpeechGraph
|
||||||
|
^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. image:: ../../assets/speechgraph.png
|
||||||
|
:align: center
|
||||||
|
:width: 90%
|
||||||
|
:alt: SpeechGraph
|
||||||
|
|
|
||||||
|
|
||||||
|
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SpeechGraph class, and run the graph.
|
||||||
|
It will fetch the data from the source, extract the information based on the prompt, and generate an audio file with the answer, as well as the answer itself, in JSON format.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from scrapegraphai.graphs import SpeechGraph
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {...},
|
||||||
|
"tts_model": {...},
|
||||||
|
}
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Create the SpeechGraph instance and run it
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
speech_graph = SpeechGraph(
|
||||||
|
prompt="Make a detailed audio summary of the projects.",
|
||||||
|
source="https://perinim.github.io/projects/",
|
||||||
|
config=graph_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = speech_graph.run()
|
||||||
|
print(result)
|
||||||
190
docs/source/scrapers/llm.rst
Normal file
190
docs/source/scrapers/llm.rst
Normal file
@ -0,0 +1,190 @@
|
|||||||
|
LLM
|
||||||
|
===
|
||||||
|
|
||||||
|
We support many known LLM models and providers used to analyze the web pages and extract the information requested by the user. Models can be split in **Chat Models** and **Embedding Models** (the latter are mainly used for Retrieval Augmented Generation RAG).
|
||||||
|
These models are specified inside the graph configuration dictionary and can be used interchangeably, for example by defining a different model for llm and embeddings.
|
||||||
|
|
||||||
|
- **Local Models**: These models are hosted on the local machine and can be used without any API key.
|
||||||
|
- **API-based Models**: These models are hosted on the cloud and require an API key to access them (eg. OpenAI, Groq, etc).
|
||||||
|
|
||||||
|
**Note**: If the emebedding model is not specified, the library will use the default one for that LLM, if available.
|
||||||
|
|
||||||
|
Local Models
|
||||||
|
------------
|
||||||
|
|
||||||
|
Currently, local models are supported through Ollama integration. Ollama is a provider of LLM models which can be downloaded from here `Ollama <https://ollama.com/>`_.
|
||||||
|
Let's say we want to use **llama3** as chat model and **nomic-embed-text** as embedding model. We first need to pull them from ollama using:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ollama pull llama3
|
||||||
|
ollama pull nomic-embed-text
|
||||||
|
|
||||||
|
Then we can use them in the graph configuration as follows:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"model": "llama3",
|
||||||
|
"temperature": 0.0,
|
||||||
|
"format": "json",
|
||||||
|
},
|
||||||
|
"embeddings": {
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
You can also specify the **base_url** parameter to specify the models endpoint. By default, it is set to http://localhost:11434. This is useful if you are running Ollama on a Docker container or on a different machine.
|
||||||
|
|
||||||
|
If you want to host Ollama in a Docker container, you can use the following command:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
docker-compose up -d
|
||||||
|
docker exec -it ollama ollama pull llama3
|
||||||
|
|
||||||
|
API-based Models
|
||||||
|
----------------
|
||||||
|
|
||||||
|
OpenAI
|
||||||
|
^^^^^^
|
||||||
|
|
||||||
|
You can get the API key from `here <https://platform.openai.com/api-keys>`_.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"api_key": "OPENAI_API_KEY",
|
||||||
|
"model": "gpt-3.5-turbo",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
If you want to use text to speech models, you can specify the `tts_model` parameter:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"api_key": "OPENAI_API_KEY",
|
||||||
|
"model": "gpt-3.5-turbo",
|
||||||
|
"temperature": 0.7,
|
||||||
|
},
|
||||||
|
"tts_model": {
|
||||||
|
"api_key": "OPENAI_API_KEY",
|
||||||
|
"model": "tts-1",
|
||||||
|
"voice": "alloy"
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
Gemini
|
||||||
|
^^^^^^
|
||||||
|
|
||||||
|
You can get the API key from `here <https://ai.google.dev/gemini-api/docs/api-key>`_.
|
||||||
|
|
||||||
|
**Note**: some countries are not supported and therefore it won't be possible to request an API key. A possible workaround is to use a VPN or run the library on Colab.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"api_key": "GEMINI_API_KEY",
|
||||||
|
"model": "gemini-pro"
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
Groq
|
||||||
|
^^^^
|
||||||
|
|
||||||
|
You can get the API key from `here <https://console.groq.com/keys>`_. Groq doesn't support embedding models, so in the following example we are using Ollama one.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"model": "groq/gemma-7b-it",
|
||||||
|
"api_key": "GROQ_API_KEY",
|
||||||
|
"temperature": 0
|
||||||
|
},
|
||||||
|
"embeddings": {
|
||||||
|
"model": "ollama/nomic-embed-text",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
Azure
|
||||||
|
^^^^^
|
||||||
|
|
||||||
|
We can also pass a model instance for the chat model and the embedding model. For Azure, a possible configuration would be:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
llm_model_instance = AzureChatOpenAI(
|
||||||
|
openai_api_version="AZURE_OPENAI_API_VERSION",
|
||||||
|
azure_deployment="AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"
|
||||||
|
)
|
||||||
|
|
||||||
|
embedder_model_instance = AzureOpenAIEmbeddings(
|
||||||
|
azure_deployment="AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME",
|
||||||
|
openai_api_version="AZURE_OPENAI_API_VERSION",
|
||||||
|
)
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"model_instance": llm_model_instance
|
||||||
|
},
|
||||||
|
"embeddings": {
|
||||||
|
"model_instance": embedder_model_instance
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Hugging Face Hub
|
||||||
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
We can also pass a model instance for the chat model and the embedding model. For Hugging Face, a possible configuration would be:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
llm_model_instance = HuggingFaceEndpoint(
|
||||||
|
repo_id="mistralai/Mistral-7B-Instruct-v0.2",
|
||||||
|
max_length=128,
|
||||||
|
temperature=0.5,
|
||||||
|
token="HUGGINGFACEHUB_API_TOKEN"
|
||||||
|
)
|
||||||
|
|
||||||
|
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
|
||||||
|
api_key="HUGGINGFACEHUB_API_TOKEN",
|
||||||
|
model_name="sentence-transformers/all-MiniLM-l6-v2"
|
||||||
|
)
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"model_instance": llm_model_instance
|
||||||
|
},
|
||||||
|
"embeddings": {
|
||||||
|
"model_instance": embedder_model_instance
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Anthropic
|
||||||
|
^^^^^^^^^
|
||||||
|
|
||||||
|
We can also pass a model instance for the chat model and the embedding model. For Anthropic, a possible configuration would be:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
|
||||||
|
api_key="HUGGINGFACEHUB_API_TOKEN",
|
||||||
|
model_name="sentence-transformers/all-MiniLM-l6-v2"
|
||||||
|
)
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"api_key": "ANTHROPIC_API_KEY",
|
||||||
|
"model": "claude-3-haiku-20240307",
|
||||||
|
"max_tokens": 4000
|
||||||
|
},
|
||||||
|
"embeddings": {
|
||||||
|
"model_instance": embedder_model_instance
|
||||||
|
}
|
||||||
|
}
|
||||||
Loading…
Reference in New Issue
Block a user