From b1df16181831d8bc9f5b40595b924f68c213496c Mon Sep 17 00:00:00 2001 From: kahwoo Date: Wed, 8 May 2024 20:39:54 +1000 Subject: [PATCH 1/6] Update examples.rst fix formatting and add other needed models --- docs/source/getting_started/examples.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/source/getting_started/examples.rst b/docs/source/getting_started/examples.rst index 11fb5a05..b6e2eb36 100644 --- a/docs/source/getting_started/examples.rst +++ b/docs/source/getting_started/examples.rst @@ -44,9 +44,12 @@ Local models Remember to have installed in your pc ollama `ollama ` Remember to pull the right model for LLM and for the embeddings, like: + .. code-block:: bash ollama pull llama3 + ollama pull nomic-embed-text + ollama pull mistral After that, you can run the following code, using only your machine resources brum brum brum: From 0ca52b1da672d7e6f126d25c0658f4a114b206d5 Mon Sep 17 00:00:00 2001 From: semantic-release-bot Date: Wed, 8 May 2024 13:52:51 +0000 Subject: [PATCH 2/6] ci(release): 0.10.0 [skip ci] ## [0.10.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0...v0.10.0) (2024-05-08) ### Features * add claude documentation ([5bdee55](https://github.com/VinciGit00/Scrapegraph-ai/commit/5bdee558760521bab818efc6725739e2a0f55d20)) * add gemini embeddings ([79daa4c](https://github.com/VinciGit00/Scrapegraph-ai/commit/79daa4c112e076e9c5f7cd70bbbc6f5e4930832c)) * add llava integration ([019b722](https://github.com/VinciGit00/Scrapegraph-ai/commit/019b7223dc969c87c3c36b6a42a19b4423b5d2af)) * add new hugging_face models ([d5547a4](https://github.com/VinciGit00/Scrapegraph-ai/commit/d5547a450ccd8908f1cf73707142b3481fbc6baa)) * Fix bug for gemini case when embeddings config not passed ([726de28](https://github.com/VinciGit00/Scrapegraph-ai/commit/726de288982700dab8ab9f22af8e26f01c6198a7)) * fixed custom_graphs example and robots_node ([84fcb44](https://github.com/VinciGit00/Scrapegraph-ai/commit/84fcb44aaa36e84f775884138d04f4a60bb389be)) * multiple graph instances ([dbb614a](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbb614a8dd88d7667fe3daaf0263f5d6e9be1683)) * **node:** multiple url search in SearchGraph + fixes ([930adb3](https://github.com/VinciGit00/Scrapegraph-ai/commit/930adb38f2154ba225342466bfd1846c47df72a0)) * refactoring search function ([aeb1acb](https://github.com/VinciGit00/Scrapegraph-ai/commit/aeb1acbf05e63316c91672c99d88f8a6f338147f)) ### Bug Fixes * bug on .toml ([f7d66f5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f7d66f51818dbdfddd0fa326f26265a3ab686b20)) * **llm:** fixed gemini api_key ([fd01b73](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd01b73b71b515206cfdf51c1d52136293494389)) * **examples:** local, mixed models and fixed SearchGraph embeddings problem ([6b71ec1](https://github.com/VinciGit00/Scrapegraph-ai/commit/6b71ec1d2be953220b6767bc429f4cf6529803fd)) * **examples:** openai std examples ([186c0d0](https://github.com/VinciGit00/Scrapegraph-ai/commit/186c0d035d1d211aff33c38c449f2263d9716a07)) * removed .lock file for deployment ([d4c7d4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/d4c7d4e7fcc2110beadcb2fc91efc657ec6a485c)) ### Docs * update README.md ([17ec992](https://github.com/VinciGit00/Scrapegraph-ai/commit/17ec992b498839e001277e7bc3f0ebea49fbd00d)) ### CI * **release:** 0.10.0-beta.1 [skip ci] ([c47a505](https://github.com/VinciGit00/Scrapegraph-ai/commit/c47a505750ee63e0220b339478953155ef1f1771)) * **release:** 0.10.0-beta.2 [skip ci] ([3f0e069](https://github.com/VinciGit00/Scrapegraph-ai/commit/3f0e0694f3b08463f025586777f7c0594b5ecb14)) * **release:** 0.9.0-beta.2 [skip ci] ([5aa600c](https://github.com/VinciGit00/Scrapegraph-ai/commit/5aa600cb0a85d320ad8dc786af26ffa46dd4d097)) * **release:** 0.9.0-beta.3 [skip ci] ([da8c72c](https://github.com/VinciGit00/Scrapegraph-ai/commit/da8c72ce138bcfe2627924d25a67afcd22cfafd5)) * **release:** 0.9.0-beta.4 [skip ci] ([8c5397f](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c5397f67a9f05e0c00f631dd297b5527263a888)) * **release:** 0.9.0-beta.5 [skip ci] ([532adb6](https://github.com/VinciGit00/Scrapegraph-ai/commit/532adb639d58640bc89e8b162903b2ed97be9853)) * **release:** 0.9.0-beta.6 [skip ci] ([8c0b46e](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c0b46eb40b446b270c665c11b2c6508f4d5f4be)) * **release:** 0.9.0-beta.7 [skip ci] ([6911e21](https://github.com/VinciGit00/Scrapegraph-ai/commit/6911e21584767460c59c5a563c3fd010857cbb67)) * **release:** 0.9.0-beta.8 [skip ci] ([739aaa3](https://github.com/VinciGit00/Scrapegraph-ai/commit/739aaa33c39c12e7ab7df8a0656cad140b35c9db)) --- CHANGELOG.md | 42 ++++++++++++++++++++++++++++++++++++++++++ pyproject.toml | 2 +- 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index bdd1ccf4..03ea0c69 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,45 @@ +## [0.10.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0...v0.10.0) (2024-05-08) + + +### Features + +* add claude documentation ([5bdee55](https://github.com/VinciGit00/Scrapegraph-ai/commit/5bdee558760521bab818efc6725739e2a0f55d20)) +* add gemini embeddings ([79daa4c](https://github.com/VinciGit00/Scrapegraph-ai/commit/79daa4c112e076e9c5f7cd70bbbc6f5e4930832c)) +* add llava integration ([019b722](https://github.com/VinciGit00/Scrapegraph-ai/commit/019b7223dc969c87c3c36b6a42a19b4423b5d2af)) +* add new hugging_face models ([d5547a4](https://github.com/VinciGit00/Scrapegraph-ai/commit/d5547a450ccd8908f1cf73707142b3481fbc6baa)) +* Fix bug for gemini case when embeddings config not passed ([726de28](https://github.com/VinciGit00/Scrapegraph-ai/commit/726de288982700dab8ab9f22af8e26f01c6198a7)) +* fixed custom_graphs example and robots_node ([84fcb44](https://github.com/VinciGit00/Scrapegraph-ai/commit/84fcb44aaa36e84f775884138d04f4a60bb389be)) +* multiple graph instances ([dbb614a](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbb614a8dd88d7667fe3daaf0263f5d6e9be1683)) +* **node:** multiple url search in SearchGraph + fixes ([930adb3](https://github.com/VinciGit00/Scrapegraph-ai/commit/930adb38f2154ba225342466bfd1846c47df72a0)) +* refactoring search function ([aeb1acb](https://github.com/VinciGit00/Scrapegraph-ai/commit/aeb1acbf05e63316c91672c99d88f8a6f338147f)) + + +### Bug Fixes + +* bug on .toml ([f7d66f5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f7d66f51818dbdfddd0fa326f26265a3ab686b20)) +* **llm:** fixed gemini api_key ([fd01b73](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd01b73b71b515206cfdf51c1d52136293494389)) +* **examples:** local, mixed models and fixed SearchGraph embeddings problem ([6b71ec1](https://github.com/VinciGit00/Scrapegraph-ai/commit/6b71ec1d2be953220b6767bc429f4cf6529803fd)) +* **examples:** openai std examples ([186c0d0](https://github.com/VinciGit00/Scrapegraph-ai/commit/186c0d035d1d211aff33c38c449f2263d9716a07)) +* removed .lock file for deployment ([d4c7d4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/d4c7d4e7fcc2110beadcb2fc91efc657ec6a485c)) + + +### Docs + +* update README.md ([17ec992](https://github.com/VinciGit00/Scrapegraph-ai/commit/17ec992b498839e001277e7bc3f0ebea49fbd00d)) + + +### CI + +* **release:** 0.10.0-beta.1 [skip ci] ([c47a505](https://github.com/VinciGit00/Scrapegraph-ai/commit/c47a505750ee63e0220b339478953155ef1f1771)) +* **release:** 0.10.0-beta.2 [skip ci] ([3f0e069](https://github.com/VinciGit00/Scrapegraph-ai/commit/3f0e0694f3b08463f025586777f7c0594b5ecb14)) +* **release:** 0.9.0-beta.2 [skip ci] ([5aa600c](https://github.com/VinciGit00/Scrapegraph-ai/commit/5aa600cb0a85d320ad8dc786af26ffa46dd4d097)) +* **release:** 0.9.0-beta.3 [skip ci] ([da8c72c](https://github.com/VinciGit00/Scrapegraph-ai/commit/da8c72ce138bcfe2627924d25a67afcd22cfafd5)) +* **release:** 0.9.0-beta.4 [skip ci] ([8c5397f](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c5397f67a9f05e0c00f631dd297b5527263a888)) +* **release:** 0.9.0-beta.5 [skip ci] ([532adb6](https://github.com/VinciGit00/Scrapegraph-ai/commit/532adb639d58640bc89e8b162903b2ed97be9853)) +* **release:** 0.9.0-beta.6 [skip ci] ([8c0b46e](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c0b46eb40b446b270c665c11b2c6508f4d5f4be)) +* **release:** 0.9.0-beta.7 [skip ci] ([6911e21](https://github.com/VinciGit00/Scrapegraph-ai/commit/6911e21584767460c59c5a563c3fd010857cbb67)) +* **release:** 0.9.0-beta.8 [skip ci] ([739aaa3](https://github.com/VinciGit00/Scrapegraph-ai/commit/739aaa33c39c12e7ab7df8a0656cad140b35c9db)) + ## [0.10.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0-beta.1...v0.10.0-beta.2) (2024-05-08) diff --git a/pyproject.toml b/pyproject.toml index 39b0d030..498ac4c0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [tool.poetry] name = "scrapegraphai" -version = "0.10.0b2" +version = "0.10.0" description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines." authors = [ From f8ce3d5916eab926275d59d4d48b0d89ec9cd43f Mon Sep 17 00:00:00 2001 From: mayurdb Date: Fri, 10 May 2024 13:28:53 +0530 Subject: [PATCH 3/6] fix: Augment the information getting fetched from a webpage --- scrapegraphai/nodes/fetch_node.py | 21 ++++++++++++++++--- .../utils/{remover.py => cleanup_html.py} | 11 ++++++---- 2 files changed, 25 insertions(+), 7 deletions(-) rename scrapegraphai/utils/{remover.py => cleanup_html.py} (78%) diff --git a/scrapegraphai/nodes/fetch_node.py b/scrapegraphai/nodes/fetch_node.py index bcd207f3..2667f0be 100644 --- a/scrapegraphai/nodes/fetch_node.py +++ b/scrapegraphai/nodes/fetch_node.py @@ -6,7 +6,9 @@ from typing import List, Optional from langchain_community.document_loaders import AsyncChromiumLoader from langchain_core.documents import Document from .base_node import BaseNode -from ..utils.remover import remover +from ..utils.cleanup_html import cleanup_html +import requests +from bs4 import BeautifulSoup class FetchNode(BaseNode): @@ -32,6 +34,7 @@ class FetchNode(BaseNode): def __init__(self, input: str, output: List[str], node_config: Optional[dict]=None, node_name: str = "Fetch"): super().__init__(node_name, "node", input, output, 1) + self.useSoup = True if node_config is None else node_config.get("useSoup", True) self.headless = True if node_config is None else node_config.get("headless", True) self.verbose = False if node_config is None else node_config.get("verbose", False) @@ -67,10 +70,22 @@ class FetchNode(BaseNode): })] # if it is a local directory elif not source.startswith("http"): - compressed_document = [Document(page_content=remover(source), metadata={ + compressed_document = [Document(page_content=cleanup_html(source), metadata={ "source": "local_dir" })] + elif self.useSoup: + response = requests.get(source) + if response.status_code == 200: + soup = BeautifulSoup(response.text, 'html.parser') + links = soup.find_all('a') + link_urls = [] + for link in links: + if 'href' in link.attrs: + link_urls.append(link['href']) + compressed_document = [Document(page_content=cleanup_html(soup.prettify(), link_urls))] + else: + print(f"Failed to retrieve contents from the webpage at url: {url}") else: if self.node_config is not None and self.node_config.get("endpoint") is not None: @@ -87,7 +102,7 @@ class FetchNode(BaseNode): document = loader.load() compressed_document = [ - Document(page_content=remover(str(document[0].page_content)))] + Document(page_content=cleanup_html(str(document[0].page_content)))] state.update({self.output[0]: compressed_document}) return state diff --git a/scrapegraphai/utils/remover.py b/scrapegraphai/utils/cleanup_html.py similarity index 78% rename from scrapegraphai/utils/remover.py rename to scrapegraphai/utils/cleanup_html.py index 5e203249..aab1db65 100644 --- a/scrapegraphai/utils/remover.py +++ b/scrapegraphai/utils/cleanup_html.py @@ -5,7 +5,7 @@ from bs4 import BeautifulSoup from minify_html import minify -def remover(html_content: str) -> str: +def cleanup_html(html_content: str, urls: list = []) -> str: """ Processes HTML content by removing unnecessary tags, minifying the HTML, and extracting the title and body content. @@ -17,7 +17,7 @@ def remover(html_content: str) -> str: Example: >>> html_content = "Example

Hello World!

" - >>> remover(html_content) + >>> cleanup_html(html_content) 'Title: Example, Body:

Hello World!

' This function is particularly useful for preparing HTML content for environments where bandwidth usage needs to be minimized. @@ -35,9 +35,12 @@ def remover(html_content: str) -> str: # Body Extraction (if it exists) body_content = soup.find('body') + urls_content = "" + if urls: + urls_content = f", URLs in page: {urls}" if body_content: # Minify the HTML within the body tag minimized_body = minify(str(body_content)) - return "Title: " + title + ", Body: " + minimized_body + return "Title: " + title + ", Body: " + minimized_body + urls_content - return "Title: " + title + ", Body: No body content found" + return "Title: " + title + ", Body: No body content found" + urls_content From 63c0dd93723c2ab55df0a66b555e7fbb4716ea77 Mon Sep 17 00:00:00 2001 From: semantic-release-bot Date: Fri, 10 May 2024 09:15:24 +0000 Subject: [PATCH 4/6] ci(release): 0.11.0-beta.1 [skip ci] ## [0.11.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0...v0.11.0-beta.1) (2024-05-10) ### Features * Add support for passing pdf path as source ([f10f3b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/f10f3b1438e0c625b7f2fa52faeb5a6c12116113)) * update info ([4ed0fb8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ed0fb89c3e6068190a7775bedcb6ae65ba59d18)) ### Bug Fixes * add json integration ([0ab31c3](https://github.com/VinciGit00/Scrapegraph-ai/commit/0ab31c3fdbd56652ed306e60109301f60e8042d3)) * Augment the information getting fetched from a webpage ([f8ce3d5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f8ce3d5916eab926275d59d4d48b0d89ec9cd43f)) * fixed bugs for csv and xml ([324e977](https://github.com/VinciGit00/Scrapegraph-ai/commit/324e977b853ecaa55bac4bf86e7cd927f7f43d0d)) * limit python version to < 3.12 ([a37fbbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/a37fbbcbcfc3ddd0cc66f586f279676b52c4abfe)) ### CI * **release:** 0.10.0-beta.3 [skip ci] ([ad32298](https://github.com/VinciGit00/Scrapegraph-ai/commit/ad32298e70fc626fd62c897e153b806f79dba9b9)) * **release:** 0.10.0-beta.4 [skip ci] ([548bff9](https://github.com/VinciGit00/Scrapegraph-ai/commit/548bff9d77c8b4d2aadee40e966a06cc9d7fd4ab)) * **release:** 0.10.0-beta.5 [skip ci] ([28c9dce](https://github.com/VinciGit00/Scrapegraph-ai/commit/28c9dce7cbda49750172bafd7767fa48a0c33859)) * **release:** 0.10.0-beta.6 [skip ci] ([460d292](https://github.com/VinciGit00/Scrapegraph-ai/commit/460d292af21fabad3fdd2b66110913ccee22ba91)) --- CHANGELOG.md | 23 +++++++++++++++++++++++ pyproject.toml | 2 +- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index dffb9062..5e781284 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,26 @@ +## [0.11.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0...v0.11.0-beta.1) (2024-05-10) + + +### Features + +* Add support for passing pdf path as source ([f10f3b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/f10f3b1438e0c625b7f2fa52faeb5a6c12116113)) +* update info ([4ed0fb8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ed0fb89c3e6068190a7775bedcb6ae65ba59d18)) + + +### Bug Fixes + +* add json integration ([0ab31c3](https://github.com/VinciGit00/Scrapegraph-ai/commit/0ab31c3fdbd56652ed306e60109301f60e8042d3)) +* Augment the information getting fetched from a webpage ([f8ce3d5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f8ce3d5916eab926275d59d4d48b0d89ec9cd43f)) +* fixed bugs for csv and xml ([324e977](https://github.com/VinciGit00/Scrapegraph-ai/commit/324e977b853ecaa55bac4bf86e7cd927f7f43d0d)) +* limit python version to < 3.12 ([a37fbbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/a37fbbcbcfc3ddd0cc66f586f279676b52c4abfe)) + + +### CI + +* **release:** 0.10.0-beta.3 [skip ci] ([ad32298](https://github.com/VinciGit00/Scrapegraph-ai/commit/ad32298e70fc626fd62c897e153b806f79dba9b9)) +* **release:** 0.10.0-beta.4 [skip ci] ([548bff9](https://github.com/VinciGit00/Scrapegraph-ai/commit/548bff9d77c8b4d2aadee40e966a06cc9d7fd4ab)) +* **release:** 0.10.0-beta.5 [skip ci] ([28c9dce](https://github.com/VinciGit00/Scrapegraph-ai/commit/28c9dce7cbda49750172bafd7767fa48a0c33859)) +* **release:** 0.10.0-beta.6 [skip ci] ([460d292](https://github.com/VinciGit00/Scrapegraph-ai/commit/460d292af21fabad3fdd2b66110913ccee22ba91)) ### Bug Fixes diff --git a/pyproject.toml b/pyproject.toml index 9cd6f618..074aedcc 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [tool.poetry] name = "scrapegraphai" -version = "0.10.0b6" +version = "0.11.0b1" description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines." authors = [ From 864aa91326c360992326e04811d272e55eac8355 Mon Sep 17 00:00:00 2001 From: Marco Perini Date: Fri, 10 May 2024 15:11:54 +0200 Subject: [PATCH 5/6] feat: revert fetch_node --- scrapegraphai/nodes/fetch_node.py | 23 ++++--------------- scrapegraphai/utils/__init__.py | 1 + .../utils/{cleanup_html.py => remover.py} | 11 ++++----- 3 files changed, 9 insertions(+), 26 deletions(-) rename scrapegraphai/utils/{cleanup_html.py => remover.py} (78%) diff --git a/scrapegraphai/nodes/fetch_node.py b/scrapegraphai/nodes/fetch_node.py index eeb2d0b4..3eabc66f 100644 --- a/scrapegraphai/nodes/fetch_node.py +++ b/scrapegraphai/nodes/fetch_node.py @@ -8,9 +8,7 @@ from langchain_community.document_loaders import AsyncChromiumLoader from langchain_core.documents import Document from langchain_community.document_loaders import PyPDFLoader from .base_node import BaseNode -from ..utils.cleanup_html import cleanup_html -import requests -from bs4 import BeautifulSoup +from ..utils.remover import remover class FetchNode(BaseNode): @@ -36,7 +34,6 @@ class FetchNode(BaseNode): def __init__(self, input: str, output: List[str], node_config: Optional[dict] = None, node_name: str = "Fetch"): super().__init__(node_name, "node", input, output, 1) - self.headless = True if node_config is None else node_config.get( "headless", True) self.verbose = False if node_config is None else node_config.get( @@ -97,22 +94,10 @@ class FetchNode(BaseNode): pass elif not source.startswith("http"): - compressed_document = [Document(page_content=cleanup_html(source), metadata={ + compressed_document = [Document(page_content=remover(source), metadata={ "source": "local_dir" })] - elif self.useSoup: - response = requests.get(source) - if response.status_code == 200: - soup = BeautifulSoup(response.text, 'html.parser') - links = soup.find_all('a') - link_urls = [] - for link in links: - if 'href' in link.attrs: - link_urls.append(link['href']) - compressed_document = [Document(page_content=cleanup_html(soup.prettify(), link_urls))] - else: - print(f"Failed to retrieve contents from the webpage at url: {url}") else: if self.node_config is not None and self.node_config.get("endpoint") is not None: @@ -129,7 +114,7 @@ class FetchNode(BaseNode): document = loader.load() compressed_document = [ - Document(page_content=cleanup_html(str(document[0].page_content)))] + Document(page_content=remover(str(document[0].page_content)))] state.update({self.output[0]: compressed_document}) - return state + return state \ No newline at end of file diff --git a/scrapegraphai/utils/__init__.py b/scrapegraphai/utils/__init__.py index 0aee7839..218506f3 100644 --- a/scrapegraphai/utils/__init__.py +++ b/scrapegraphai/utils/__init__.py @@ -6,3 +6,4 @@ from .convert_to_csv import convert_to_csv from .convert_to_json import convert_to_json from .prettify_exec_info import prettify_exec_info from .proxy_rotation import proxy_generator +from .remover import remover diff --git a/scrapegraphai/utils/cleanup_html.py b/scrapegraphai/utils/remover.py similarity index 78% rename from scrapegraphai/utils/cleanup_html.py rename to scrapegraphai/utils/remover.py index aab1db65..c5a0507b 100644 --- a/scrapegraphai/utils/cleanup_html.py +++ b/scrapegraphai/utils/remover.py @@ -5,7 +5,7 @@ from bs4 import BeautifulSoup from minify_html import minify -def cleanup_html(html_content: str, urls: list = []) -> str: +def remover(html_content: str) -> str: """ Processes HTML content by removing unnecessary tags, minifying the HTML, and extracting the title and body content. @@ -17,7 +17,7 @@ def cleanup_html(html_content: str, urls: list = []) -> str: Example: >>> html_content = "Example

Hello World!

" - >>> cleanup_html(html_content) + >>> remover(html_content) 'Title: Example, Body:

Hello World!

' This function is particularly useful for preparing HTML content for environments where bandwidth usage needs to be minimized. @@ -35,12 +35,9 @@ def cleanup_html(html_content: str, urls: list = []) -> str: # Body Extraction (if it exists) body_content = soup.find('body') - urls_content = "" - if urls: - urls_content = f", URLs in page: {urls}" if body_content: # Minify the HTML within the body tag minimized_body = minify(str(body_content)) - return "Title: " + title + ", Body: " + minimized_body + urls_content + return "Title: " + title + ", Body: " + minimized_body - return "Title: " + title + ", Body: No body content found" + urls_content + return "Title: " + title + ", Body: No body content found" \ No newline at end of file From 7ae50c035e87be9a3d7b5eef42232dae6e345914 Mon Sep 17 00:00:00 2001 From: semantic-release-bot Date: Fri, 10 May 2024 13:13:20 +0000 Subject: [PATCH 6/6] ci(release): 0.11.0-beta.2 [skip ci] ## [0.11.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.0-beta.1...v0.11.0-beta.2) (2024-05-10) ### Features * revert fetch_node ([864aa91](https://github.com/VinciGit00/Scrapegraph-ai/commit/864aa91326c360992326e04811d272e55eac8355)) --- CHANGELOG.md | 7 +++++++ pyproject.toml | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5e781284..4d89d3f4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,10 @@ +## [0.11.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.0-beta.1...v0.11.0-beta.2) (2024-05-10) + + +### Features + +* revert fetch_node ([864aa91](https://github.com/VinciGit00/Scrapegraph-ai/commit/864aa91326c360992326e04811d272e55eac8355)) + ## [0.11.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0...v0.11.0-beta.1) (2024-05-10) diff --git a/pyproject.toml b/pyproject.toml index 074aedcc..df00dfce 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [tool.poetry] name = "scrapegraphai" -version = "0.11.0b1" +version = "0.11.0b2" description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines." authors = [