mirror of
https://github.com/VinciGit00/Scrapegraph-ai.git
synced 2026-06-25 21:11:11 +08:00
feat(omni-search): added omni search graph and updated docs
This commit is contained in:
parent
a296927624
commit
fcb3abb01d
BIN
docs/assets/omniscrapergraph.png
Normal file
BIN
docs/assets/omniscrapergraph.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 72 KiB |
BIN
docs/assets/omnisearchgraph.png
Normal file
BIN
docs/assets/omnisearchgraph.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 57 KiB |
@ -10,6 +10,8 @@ Some interesting ones are:
|
|||||||
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
|
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
|
||||||
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
|
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
|
||||||
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
|
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
|
||||||
|
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
|
||||||
|
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
|
||||||
|
|
||||||
Proxy Rotation
|
Proxy Rotation
|
||||||
^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^
|
||||||
|
|||||||
@ -3,16 +3,80 @@ Graphs
|
|||||||
|
|
||||||
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
|
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
|
||||||
|
|
||||||
There are currently three types of graphs available in the library:
|
There are three types of graphs available in the library:
|
||||||
|
|
||||||
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
|
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
|
||||||
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
|
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
|
||||||
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
|
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
|
||||||
|
|
||||||
|
With the introduction of `GPT-4o`, two new powerful graphs have been created:
|
||||||
|
|
||||||
|
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
|
||||||
|
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
|
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
|
||||||
|
|
||||||
|
OmniScraperGraph
|
||||||
|
^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. image:: ../../assets/omniscrapergraph.png
|
||||||
|
:align: center
|
||||||
|
:width: 90%
|
||||||
|
:alt: OmniScraperGraph
|
||||||
|
|
|
||||||
|
|
||||||
|
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the OmniScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
|
||||||
|
It will fetch the data from the source and extract the information based on the prompt in JSON format.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from scrapegraphai.graphs import OmniScraperGraph
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {...},
|
||||||
|
}
|
||||||
|
|
||||||
|
omni_scraper_graph = OmniScraperGraph(
|
||||||
|
prompt="List me all the projects with their titles and image links and descriptions.",
|
||||||
|
source="https://perinim.github.io/projects",
|
||||||
|
config=graph_config
|
||||||
|
)
|
||||||
|
|
||||||
|
result = omni_scraper_graph.run()
|
||||||
|
print(result)
|
||||||
|
|
||||||
|
OmniSearchGraph
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. image:: ../../assets/omnisearchgraph.png
|
||||||
|
:align: center
|
||||||
|
:width: 80%
|
||||||
|
:alt: OmniSearchGraph
|
||||||
|
|
|
||||||
|
|
||||||
|
Similar to OmniScraperGraph, we define the graph configuration, create multiple of the OmniSearchGraph class, and run the graph.
|
||||||
|
It will create a search query, fetch the first n results from the search engine, run n OmniScraperGraph instances, and return the results in JSON format.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from scrapegraphai.graphs import OmniSearchGraph
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {...},
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create the OmniSearchGraph instance
|
||||||
|
omni_search_graph = OmniSearchGraph(
|
||||||
|
prompt="List me all Chioggia's famous dishes and describe their pictures.",
|
||||||
|
config=graph_config
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run the graph
|
||||||
|
result = omni_search_graph.run()
|
||||||
|
print(result)
|
||||||
|
|
||||||
SmartScraperGraph
|
SmartScraperGraph
|
||||||
^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|||||||
@ -5,7 +5,7 @@ Basic example of scraping pipeline using OmniScraper
|
|||||||
import os, json
|
import os, json
|
||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
from scrapegraphai.graphs import OmniScraperGraph
|
from scrapegraphai.graphs import OmniScraperGraph
|
||||||
from scrapegraphai.utils import prettify_exec_info, convert_to_csv
|
from scrapegraphai.utils import prettify_exec_info
|
||||||
|
|
||||||
load_dotenv()
|
load_dotenv()
|
||||||
|
|
||||||
@ -22,7 +22,8 @@ graph_config = {
|
|||||||
"model": "gpt-4o",
|
"model": "gpt-4o",
|
||||||
},
|
},
|
||||||
"verbose": True,
|
"verbose": True,
|
||||||
"headless": False,
|
"headless": True,
|
||||||
|
"max_images": 5
|
||||||
}
|
}
|
||||||
|
|
||||||
# ************************************************
|
# ************************************************
|
||||||
|
|||||||
45
examples/openai/omni_search_graph_openai.py
Normal file
45
examples/openai/omni_search_graph_openai.py
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
"""
|
||||||
|
Example of OmniSearchGraph
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os, json
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from scrapegraphai.graphs import OmniSearchGraph
|
||||||
|
from scrapegraphai.utils import prettify_exec_info
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Define the configuration for the graph
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
openai_key = os.getenv("OPENAI_APIKEY")
|
||||||
|
|
||||||
|
graph_config = {
|
||||||
|
"llm": {
|
||||||
|
"api_key": openai_key,
|
||||||
|
"model": "gpt-4o",
|
||||||
|
},
|
||||||
|
"max_results": 2,
|
||||||
|
"max_images": 5,
|
||||||
|
"verbose": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Create the OmniSearchGraph instance and run it
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
omni_search_graph = OmniSearchGraph(
|
||||||
|
prompt="List me all Chioggia's famous dishes and describe their pictures.",
|
||||||
|
config=graph_config
|
||||||
|
)
|
||||||
|
|
||||||
|
result = omni_search_graph.run()
|
||||||
|
print(json.dumps(result, indent=2))
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Get graph execution info
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
graph_exec_info = omni_search_graph.get_execution_info()
|
||||||
|
print(prettify_exec_info(graph_exec_info))
|
||||||
|
|
||||||
@ -14,3 +14,4 @@ from .json_scraper_graph import JSONScraperGraph
|
|||||||
from .csv_scraper_graph import CSVScraperGraph
|
from .csv_scraper_graph import CSVScraperGraph
|
||||||
from .pdf_scraper_graph import PDFScraperGraph
|
from .pdf_scraper_graph import PDFScraperGraph
|
||||||
from .omni_scraper_graph import OmniScraperGraph
|
from .omni_scraper_graph import OmniScraperGraph
|
||||||
|
from .omni_search_graph import OmniSearchGraph
|
||||||
|
|||||||
@ -29,6 +29,7 @@ class OmniScraperGraph(AbstractGraph):
|
|||||||
configured for generating embeddings.
|
configured for generating embeddings.
|
||||||
verbose (bool): A flag indicating whether to show print statements during execution.
|
verbose (bool): A flag indicating whether to show print statements during execution.
|
||||||
headless (bool): A flag indicating whether to run the graph in headless mode.
|
headless (bool): A flag indicating whether to run the graph in headless mode.
|
||||||
|
max_images (int): The maximum number of images to process.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
prompt (str): The prompt for the graph.
|
prompt (str): The prompt for the graph.
|
||||||
@ -48,7 +49,7 @@ class OmniScraperGraph(AbstractGraph):
|
|||||||
def __init__(self, prompt: str, source: str, config: dict):
|
def __init__(self, prompt: str, source: str, config: dict):
|
||||||
|
|
||||||
self.max_images = 5 if config is None else config.get("max_images", 5)
|
self.max_images = 5 if config is None else config.get("max_images", 5)
|
||||||
|
|
||||||
super().__init__(prompt, config, source)
|
super().__init__(prompt, config, source)
|
||||||
|
|
||||||
self.input_key = "url" if source.startswith("http") else "local_dir"
|
self.input_key = "url" if source.startswith("http") else "local_dir"
|
||||||
|
|||||||
119
scrapegraphai/graphs/omni_search_graph.py
Normal file
119
scrapegraphai/graphs/omni_search_graph.py
Normal file
@ -0,0 +1,119 @@
|
|||||||
|
"""
|
||||||
|
OmniSearchGraph Module
|
||||||
|
"""
|
||||||
|
|
||||||
|
from copy import deepcopy
|
||||||
|
|
||||||
|
from .base_graph import BaseGraph
|
||||||
|
from ..nodes import (
|
||||||
|
SearchInternetNode,
|
||||||
|
GraphIteratorNode,
|
||||||
|
MergeAnswersNode
|
||||||
|
)
|
||||||
|
from .abstract_graph import AbstractGraph
|
||||||
|
from .omni_scraper_graph import OmniScraperGraph
|
||||||
|
|
||||||
|
|
||||||
|
class OmniSearchGraph(AbstractGraph):
|
||||||
|
"""
|
||||||
|
OmniSearchGraph is a scraping pipeline that searches the internet for answers to a given prompt.
|
||||||
|
It only requires a user prompt to search the internet and generate an answer.
|
||||||
|
|
||||||
|
Attributes:
|
||||||
|
prompt (str): The user prompt to search the internet.
|
||||||
|
llm_model (dict): The configuration for the language model.
|
||||||
|
embedder_model (dict): The configuration for the embedder model.
|
||||||
|
headless (bool): A flag to run the browser in headless mode.
|
||||||
|
verbose (bool): A flag to display the execution information.
|
||||||
|
model_token (int): The token limit for the language model.
|
||||||
|
max_results (int): The maximum number of results to return.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
prompt (str): The user prompt to search the internet.
|
||||||
|
config (dict): Configuration parameters for the graph.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
>>> omni_search_graph = OmniSearchGraph(
|
||||||
|
... "What is Chioggia famous for?",
|
||||||
|
... {"llm": {"model": "gpt-4o"}}
|
||||||
|
... )
|
||||||
|
>>> result = search_graph.run()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, prompt: str, config: dict):
|
||||||
|
|
||||||
|
self.max_results = config.get("max_results", 3)
|
||||||
|
self.copy_config = deepcopy(config)
|
||||||
|
|
||||||
|
super().__init__(prompt, config)
|
||||||
|
|
||||||
|
def _create_graph(self) -> BaseGraph:
|
||||||
|
"""
|
||||||
|
Creates the graph of nodes representing the workflow for web scraping and searching.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BaseGraph: A graph instance representing the web scraping and searching workflow.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Create a OmniScraperGraph instance
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
omni_scraper_instance = OmniScraperGraph(
|
||||||
|
prompt="",
|
||||||
|
source="",
|
||||||
|
config=self.copy_config
|
||||||
|
)
|
||||||
|
|
||||||
|
# ************************************************
|
||||||
|
# Define the graph nodes
|
||||||
|
# ************************************************
|
||||||
|
|
||||||
|
search_internet_node = SearchInternetNode(
|
||||||
|
input="user_prompt",
|
||||||
|
output=["urls"],
|
||||||
|
node_config={
|
||||||
|
"llm_model": self.llm_model,
|
||||||
|
"max_results": self.max_results
|
||||||
|
}
|
||||||
|
)
|
||||||
|
graph_iterator_node = GraphIteratorNode(
|
||||||
|
input="user_prompt & urls",
|
||||||
|
output=["results"],
|
||||||
|
node_config={
|
||||||
|
"graph_instance": omni_scraper_instance,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
merge_answers_node = MergeAnswersNode(
|
||||||
|
input="user_prompt & results",
|
||||||
|
output=["answer"],
|
||||||
|
node_config={
|
||||||
|
"llm_model": self.llm_model,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
return BaseGraph(
|
||||||
|
nodes=[
|
||||||
|
search_internet_node,
|
||||||
|
graph_iterator_node,
|
||||||
|
merge_answers_node
|
||||||
|
],
|
||||||
|
edges=[
|
||||||
|
(search_internet_node, graph_iterator_node),
|
||||||
|
(graph_iterator_node, merge_answers_node)
|
||||||
|
],
|
||||||
|
entry_point=search_internet_node
|
||||||
|
)
|
||||||
|
|
||||||
|
def run(self) -> str:
|
||||||
|
"""
|
||||||
|
Executes the web scraping and searching process.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: The answer to the prompt.
|
||||||
|
"""
|
||||||
|
inputs = {"user_prompt": self.prompt}
|
||||||
|
self.final_state, self.execution_info = self.graph.execute(inputs)
|
||||||
|
|
||||||
|
return self.final_state.get("answer", "No answer found.")
|
||||||
Loading…
Reference in New Issue
Block a user