基于AI的Python爬虫
Go to file
2024-02-14 13:12:20 +01:00
.github Update sphinx_building.yml 2024-02-14 13:12:20 +01:00
docs add info for read the docs 2024-02-14 12:58:16 +01:00
examples add refactoring for pylint standards 2024-02-14 12:53:03 +01:00
tests refactoring of token_calculator 2024-02-13 12:35:38 +01:00
yosoai add refactoring for pylint standards 2024-02-14 12:53:03 +01:00
.gitattributes Initial commit 2024-01-27 17:54:23 +01:00
.gitignore fix: remover bug 2024-02-07 14:21:07 +01:00
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md 2024-02-05 10:20:57 +01:00
CONTRIBUTING.md Update CONTRIBUTING.md 2024-02-04 11:00:55 +01:00
LICENSE Add documentation example 2024-01-31 12:08:34 +01:00
README.md update read the docs link again 2024-02-13 16:40:06 +01:00
readthedocs.yml upd: add dev requirements readthedocs 2024-02-07 15:30:31 +01:00
requirements-dev.txt dev: docstrings, nodes folder and fixed setup.py 2024-02-14 11:04:20 +01:00
requirements.txt dev: docstrings, nodes folder and fixed setup.py 2024-02-14 11:04:20 +01:00
SECURITY.md changed documentation + fixed a typo for the path 2024-02-07 16:56:03 +01:00
setup.py dev: docstrings, nodes folder and fixed setup.py 2024-02-14 11:04:20 +01:00

🤖 YOSO-ai: You Only Scrape Once

YOSO-ai is a Python Open Source library that uses LLM and Langchain for faster and efficient web scraping. Just say which information you want to extract and the library will do it for you.

Official documentation page: yoso-ai.readthedocs.io

🔍 Demo

Try out YOSO-ai in your browser:

Open in GitHub Codespaces

🔧 Quick Setup

Follow the following steps:

git clone https://github.com/VinciGit00/yoso-ai.git
  1. (Optional)
python -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
# if you want to install it as a library
pip install .

# or if you plan on developing new features it is best to also install the extra dependencies using

pip install -r requirements-dev.txt
# if you want to install it as a library
pip install .[dev]
  1. Create your personal OpenAI API key from here
  2. (Optional) Create a .env file inside the main and paste the API key
API_KEY="your openai.com api key"
  1. You are ready to go! 🚀
  2. Try running the examples using:
python -m examples.html_scraping
# or if you are outside of the project folder
python -m yoso-ai.examples.html_scraping

📖 Examples

import os
from dotenv import load_dotenv
from yosoai import _get_function, send_request

load_dotenv()

def main():
    # Get OpenAI API key from environment variables
    openai_key = os.getenv("API_KEY")
    if not openai_key:
        print("Error: OpenAI API key not found in environment variables.")
        return

    # Example values for the request
    request_settings = [
        {
        "title": "title_news",
        "type": "str",
        "description": "Give me the name of the news"
        }
    ]

    # Choose the desired model and other parameters
    selected_model = "gpt-3.5-turbo"
    temperature_value = 0.7

    # Mockup World URL
    mockup_world_url = "https://sport.sky.it/nba?gr=www"

    # Invoke send_request function
    result = send_request(openai_key, _get_function(mockup_world_url), request_settings, selected_model, temperature_value, 'cl100k_base')

    # Print or process the result as needed
    print("Result:", result)

if __name__ == "__main__":
    main()

Case 2: Passing your own HTML code

import os
from dotenv import load_dotenv
from yosoai import send_request

load_dotenv()

# Example using a HTML code
query_info = '''
        Given this code extract all the information in a json format about the news.

        <article class="c-card__wrapper aem_card_check_wrapper" data-cardindex="0">
            <div class="c-card__content">
                <h2 class="c-card__title">Booker show with 52 points, whoever has the most games over 50</h2>
                <div class="c-card__label-wrapper c-label-wrapper">
                    <span class="c-label c-label--article-heading">Standings</span>
                </div>
                <p class="c-card__abstract">The Suns' No. 1 dominated the match won in New Orleans, scoring 52 points. It's about...</p>
                <div class="c-card__info">
                    <time class="c-card__date" datetime="20 gen - 07:54">20 gen - 07:54</time>
                ...
                </div>
            </div>
            <div class="c-card__img-wrapper">
                <figure class="o-aspect-ratio o-aspect-ratio--16-10 ">
                    <img crossorigin="anonymous" class="c-card__img j-lazyload" alt="Partite con 50+ punti: Booker in Top-20" data-srcset="..." sizes="..." loading="lazy" data-src="...">
                    <noscript>
                        <img crossorigin="anonymous" class="c-card__img" alt="Partite con 50+ punti: Booker in Top-20" srcset="..." sizes="..." src="...">
                    </noscript>
                </figure>
                <i class="icon icon--media icon--gallery icon--medium icon--c-primary">
                </i>
            </div>
        </article>
    '''
def main():
    # Get OpenAI API key from environment variables
    openai_key = os.getenv("API_KEY")
    if not openai_key:
        print("Error: OpenAI API key not found in environment variables.")
        return

    # Example values for the request
    request_settings = [
        {
            "title": "title",
            "type": "str",
            "description": "Title of the news"
        }
    ]

    # Choose the desired model and other parameters
    selected_model = "gpt-3.5-turbo"
    temperature_value = 0.7

    # Invoke send_request function
    result = send_request(openai_key, query_info, request_settings, selected_model, temperature_value, 'cl100k_base')

    # Print or process the result as needed
    print("Result:", result)

if __name__ == "__main__":
    main()

Note: all the model are available at the following link: https://platform.openai.com/docs/models, be sure you have enabled that keys

Example of output

Given the following input

    [
        {
            "title": "title",
            "type": "str",
            "description": "Title of the news"
        }
    ]

using as a input the website https://sport.sky.it/nba?gr=www

The oputput format is a dict and its the following:

    {
    'title': 'Booker show with 52 points, whoever has the most games over 50'
    }

Credits

Thanks to:

  • nicolapiazzalunga: for inspiring yosoai/convert_to_csv.py and yosoai/convert_to_json.py functions

Developed by

Vincios Logo Lurenss Logo PeriniLab Logo