mirror of
https://github.com/VinciGit00/Scrapegraph-ai.git
synced 2026-06-04 21:01:04 +08:00
Some checks failed
CodeQL / Analyze (python) (push) Has been cancelled
Release / Build (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (macos-latest, 3.10) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (macos-latest, 3.11) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (macos-latest, 3.12) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (ubuntu-latest, 3.10) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (ubuntu-latest, 3.11) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (ubuntu-latest, 3.12) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (windows-latest, 3.10) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (windows-latest, 3.11) (push) Has been cancelled
Test Suite / Unit Tests (Python ${{ matrix.python-version }}) (windows-latest, 3.12) (push) Has been cancelled
Test Suite / Integration Tests (file-formats) (push) Has been cancelled
Test Suite / Integration Tests (multi-graph) (push) Has been cancelled
Test Suite / Integration Tests (smart-scraper) (push) Has been cancelled
Test Suite / Performance Benchmarks (push) Has been cancelled
Test Suite / Code Quality Checks (push) Has been cancelled
Release / Release (push) Has been cancelled
Test Suite / Test Coverage Report (push) Has been cancelled
Test Suite / Test Summary (push) Has been cancelled
225 lines
13 KiB
Markdown
225 lines
13 KiB
Markdown
## 🚀 **Looking for an even faster and simpler way to scrape at scale (only 5 lines of code)?** Check out our enhanced version at [**ScrapeGraphAI.com**](https://scrapegraphai.com/?utm_source=github&utm_medium=readme&utm_campaign=oss_cta&ut#m_content=top_banner)! 🚀
|
|
|
|
---
|
|
|
|
# 🕷️ ScrapeGraphAI: You Only Scrape Once
|
|
|
|
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md) | [日本語](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/japanese.md)
|
|
| [한국어](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/korean.md)
|
|
| [Русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md) | [Türkçe](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/turkish.md)
|
|
| [Deutsch](https://www.readme-i18n.com/ScrapeGraphAI/Scrapegraph-ai?lang=de)
|
|
| [Español](https://www.readme-i18n.com/ScrapeGraphAI/Scrapegraph-ai?lang=es)
|
|
| [français](https://www.readme-i18n.com/ScrapeGraphAI/Scrapegraph-ai?lang=fr)
|
|
| [Português](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/portuguese.md)
|
|
|
|
[](https://pepy.tech/projects/scrapegraphai)
|
|
[](https://github.com/pylint-dev/pylint)
|
|
[](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/code-quality.yml)
|
|
[](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
|
|
[](https://opensource.org/licenses/MIT)
|
|
[](https://discord.gg/gkxQDAjfeX)
|
|
|
|
[](https://scrapegraphai.com/?utm_source=github&utm_medium=readme&utm_campaign=api_banner&utm_content=api_banner_image)
|
|
|
|
<p align="center">
|
|
<a href="https://trendshift.io/repositories/9761" target="_blank"><img src="https://trendshift.io/api/badge/repositories/9761" alt="VinciGit00%2FScrapegraph-ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
|
<p align="center">
|
|
|
|
[ScrapeGraphAI](https://scrapegraphai.com) is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
|
|
|
|
Just say which information you want to extract and the library will do it for you!
|
|
|
|
<p align="center">
|
|
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
|
|
</p>
|
|
|
|
|
|
## 🚀 Integrations
|
|
ScrapeGraphAI offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python or Node.js, using LLM frameworks, or working with no-code platforms, we've got you covered with our comprehensive integration options..
|
|
|
|
You can find more informations at the following [link](https://scrapegraphai.com)
|
|
|
|
**Integrations**:
|
|
- **API**: [Documentation](https://docs.scrapegraphai.com/introduction)
|
|
- **SDKs**: [Python](https://docs.scrapegraphai.com/sdks/python), [Node](https://docs.scrapegraphai.com/sdks/javascript)
|
|
- **LLM Frameworks**: [Langchain](https://docs.scrapegraphai.com/integrations/langchain), [Llama Index](https://docs.scrapegraphai.com/integrations/llamaindex), [Crew.ai](https://docs.scrapegraphai.com/integrations/crewai), [Agno](https://docs.scrapegraphai.com/integrations/agno), [CamelAI](https://github.com/camel-ai/camel)
|
|
- **Low-code Frameworks**: [Pipedream](https://pipedream.com/apps/scrapegraphai), [Bubble](https://bubble.io/plugin/scrapegraphai-1745408893195x213542371433906180), [Zapier](https://zapier.com/apps/scrapegraphai/integrations), [n8n](http://localhost:5001/dashboard), [Dify](https://dify.ai), [Toolhouse](https://app.toolhouse.ai/mcp-servers/scrapegraph_smartscraper)
|
|
- **MCP server**: [Link](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp)
|
|
|
|
## 🚀 Quick install
|
|
|
|
The reference page for Scrapegraph-ai is available on the official page of PyPI: [pypi](https://pypi.org/project/scrapegraphai/).
|
|
|
|
```bash
|
|
pip install scrapegraphai
|
|
|
|
# IMPORTANT (for fetching websites content)
|
|
playwright install
|
|
```
|
|
|
|
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
|
|
|
|
|
|
## 💻 Usage
|
|
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
|
|
|
|
The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL.
|
|
|
|
|
|
```python
|
|
from scrapegraphai.graphs import SmartScraperGraph
|
|
|
|
# Define the configuration for the scraping pipeline
|
|
graph_config = {
|
|
"llm": {
|
|
"model": "ollama/llama3.2",
|
|
"model_tokens": 8192,
|
|
"format": "json",
|
|
},
|
|
"verbose": True,
|
|
"headless": False,
|
|
}
|
|
|
|
# Create the SmartScraperGraph instance
|
|
smart_scraper_graph = SmartScraperGraph(
|
|
prompt="Extract useful information from the webpage, including a description of what the company does, founders and social media links",
|
|
source="https://scrapegraphai.com/",
|
|
config=graph_config
|
|
)
|
|
|
|
# Run the pipeline
|
|
result = smart_scraper_graph.run()
|
|
|
|
import json
|
|
print(json.dumps(result, indent=4))
|
|
```
|
|
|
|
> [!NOTE]
|
|
> For OpenAI and other models you just need to change the llm config!
|
|
> ```python
|
|
>graph_config = {
|
|
> "llm": {
|
|
> "api_key": "YOUR_OPENAI_API_KEY",
|
|
> "model": "openai/gpt-4o-mini",
|
|
> },
|
|
> "verbose": True,
|
|
> "headless": False,
|
|
>}
|
|
>```
|
|
|
|
|
|
The output will be a dictionary like the following:
|
|
|
|
```python
|
|
{
|
|
"description": "ScrapeGraphAI transforms websites into clean, organized data for AI agents and data analytics. It offers an AI-powered API for effortless and cost-effective data extraction.",
|
|
"founders": [
|
|
{
|
|
"name": "",
|
|
"role": "Founder & Technical Lead",
|
|
"linkedin": "https://www.linkedin.com/in/perinim/"
|
|
},
|
|
{
|
|
"name": "Marco Vinciguerra",
|
|
"role": "Founder & Software Engineer",
|
|
"linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
|
|
},
|
|
{
|
|
"name": "Lorenzo Padoan",
|
|
"role": "Founder & Product Engineer",
|
|
"linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/"
|
|
}
|
|
],
|
|
"social_media_links": {
|
|
"linkedin": "https://www.linkedin.com/company/101881123",
|
|
"twitter": "https://x.com/scrapegraphai",
|
|
"github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai"
|
|
}
|
|
}
|
|
```
|
|
There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files.
|
|
|
|
| Pipeline Name | Description |
|
|
|-------------------------|------------------------------------------------------------------------------------------------------------------|
|
|
| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. |
|
|
| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. |
|
|
| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. |
|
|
| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. |
|
|
| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. |
|
|
| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. |
|
|
|
|
For each of these graphs there is the multi version. It allows to make calls of the LLM in parallel.
|
|
|
|
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure**, **Gemini**, **MiniMax** and more, or local models using **Ollama**.
|
|
|
|
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models.
|
|
|
|
|
|
## 📖 Documentation
|
|
|
|
[](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
|
|
|
|
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
|
|
Check out also the Docusaurus [here](https://docs-oss.scrapegraphai.com/).
|
|
|
|
## 🤝 Contributing
|
|
|
|
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
|
|
|
|
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
|
|
|
|
[](https://discord.gg/uJN7TYcpNa)
|
|
[](https://www.linkedin.com/company/scrapegraphai/)
|
|
[](https://twitter.com/scrapegraphai)
|
|
|
|
## 🔗 ScrapeGraph API & SDKs
|
|
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)
|
|
|
|
[](https://dashboard.scrapegraphai.com/login)
|
|
|
|
We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:
|
|
|
|
| SDK | Language | GitHub Link |
|
|
|-----------|----------|-----------------------------------------------------------------------------|
|
|
| Python SDK | Python | [scrapegraph-py](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) |
|
|
| Node.js SDK | Node.js | [scrapegraph-js](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-js) |
|
|
|
|
The Official API Documentation can be found [here](https://docs.scrapegraphai.com/).
|
|
|
|
## 📈 Telemetry
|
|
We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html).
|
|
|
|
## ❤️ Contributors
|
|
[](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
|
|
|
|
## 🎓 Citations
|
|
If you have used our library for research purposes please quote us with the following reference:
|
|
```text
|
|
@misc{scrapegraph-ai,
|
|
author = {Lorenzo Padoan, Marco Vinciguerra},
|
|
title = {Scrapegraph-ai},
|
|
year = {2024},
|
|
url = {https://github.com/VinciGit00/Scrapegraph-ai},
|
|
note = {A Python library for scraping leveraging large language models}
|
|
}
|
|
```
|
|
## Authors
|
|
|
|
| | Contact Info |
|
|
|--------------------|----------------------|
|
|
| Marco Vinciguerra | [](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) |
|
|
| Lorenzo Padoan | [](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) |
|
|
|
|
## 📜 License
|
|
|
|
ScrapeGraphAI is licensed under the MIT License. See the [LICENSE](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/LICENSE) file for more information.
|
|
|
|
## Acknowledgements
|
|
|
|
- We would like to thank all the contributors to the project and the open-source community for their support.
|
|
- ScrapeGraphAI is meant to be used for data exploration and research purposes only. We are not responsible for any misuse of the library.
|
|
|
|
Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)
|
|
|
|
[Scarf tracking](https://static.scarf.sh/a.png?x-pxid=102d4b8c-cd6a-4b9e-9a16-d6d141b9212d)
|