Commit Graph

2256 Commits

Author SHA1 Message Date
Umut CAN
827f7260ad This commit focuses on optimizing the utility modules in the codebase for better performance and maintainability. Key improvements include: - More efficient HTML processing with combined regex operations and optimized tag handling - Enhanced deep copy functionality with better type handling and optimized recursion - Refactored web search with improved error handling and modular helper functions The changes maintain all existing functionality while improving code quality, performance, and maintainability. Documentation and type hints have been enhanced throughout.
Optimize utils modules for better performance and maintainability

- Improve HTML cleanup and minification:
  - Combine regex operations for better performance
  - Add better error handling for HTML processing
  - Optimize tag removal and attribute filtering

- Enhance deep copy functionality:
  - Add special case handling for primitive types
  - Improve type checking and error handling
  - Optimize recursive copying for collections

- Refactor web search functionality:
  - Add input validation and error handling
  - Split search logic into separate helper functions
  - Improve proxy handling and configuration
  - Add better timeout and error management
  - Optimize URL filtering and processing

Technical improvements:
- Better type hints and documentation
- More efficient data structures
- Improved error handling and validation
- Reduced code duplication
- Better separation of concerns

No breaking changes - all existing functionality maintained
2024-10-28 22:40:32 +03:00
Marco Vinciguerra
2d91848b76 a
Some checks failed
CodeQL / Analyze (python) (push) Has been cancelled
Release / Build (push) Has been cancelled
Release / Release (push) Has been cancelled
2024-10-28 14:16:47 +01:00
Marco Vinciguerra
15415eebbc Update funding.json 2024-10-28 14:15:24 +01:00
Marco Vinciguerra
bc19e898ae Update funding.json 2024-10-28 14:07:28 +01:00
Marco Vinciguerra
5ed28976d2 Update funding.json 2024-10-28 14:05:47 +01:00
Marco Vinciguerra
6418479f49 Update funding.json 2024-10-28 13:59:46 +01:00
Marco Vinciguerra
e97add5daf Update funding.json 2024-10-28 13:58:01 +01:00
Marco Vinciguerra
8a69fb5ccc Update funding.json 2024-10-28 13:55:23 +01:00
Marco Vinciguerra
300fd5ac5b Create funding.json 2024-10-28 13:38:56 +01:00
Marco Vinciguerra
eb24da5a8d Update overview.rst
Some checks failed
CodeQL / Analyze (python) (push) Has been cancelled
/ build (push) Has been cancelled
Release / Build (push) Has been cancelled
Release / Release (push) Has been cancelled
2024-10-26 10:29:37 +02:00
Marco Vinciguerra
a7df68490e Merge branch 'main' of https://github.com/ScrapeGraphAI/Scrapegraph-ai 2024-10-26 10:27:55 +02:00
Marco Vinciguerra
849fe395da update doc 2024-10-26 10:27:53 +02:00
semantic-release-bot
3933d64601 ci(release): 1.27.0 [skip ci]
## [1.27.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.7...v1.27.0) (2024-10-26)

### Features

* add conditional node structure to the smart_scraper_graph and implemented a structured way to check condition ([cacd9cd](cacd9cde00))
* add integration with scrape.do ([ae275ec](ae275ec5e8))
* add model integration gpt4 ([51c55eb](51c55eb3a2))
* implement ScrapeGraph class for only web scraping automation ([612c644](612c644623))
* Implement SmartScraperMultiParseMergeFirstGraph class that scrapes a list of URLs and merge the content first and finally generates answers to a given prompt. ([3e3e1b2](3e3e1b2f3a))
* refactoring of export functions ([0ea00c0](0ea00c078f))
* refactoring of get_probable_tags node ([f658092](f658092dff))
* refactoring of ScrapeGraph to SmartScraperLiteGraph ([52b6bf5](52b6bf5fb8))

### Bug Fixes

* fix export function ([c8a000f](c8a000f1d9))
* fix the example variable name ([69ff649](69ff649556))
* remove variable "max_result" not being used in the code ([e76a68a](e76a68a782))

### chore

* fix example ([9cd9a87](9cd9a874f9))

### Test

* Add scrape_graph test ([cdb3c11](cdb3c1100e))
* Add smart_scraper_multi_parse_merge_first_graph test ([464b8b0](464b8b04ea))

### CI

* **release:** 1.26.6-beta.1 [skip ci] ([e0fc457](e0fc457d1a))
* **release:** 1.27.0-beta.1 [skip ci] ([9266a36](9266a36b2e))
* **release:** 1.27.0-beta.10 [skip ci] ([eee131e](eee131e959))
* **release:** 1.27.0-beta.2 [skip ci] ([d84d295](d84d295389))
* **release:** 1.27.0-beta.3 [skip ci] ([f576afa](f576afaf0c))
* **release:** 1.27.0-beta.4 [skip ci] ([3d6bbcd](3d6bbcdaa3))
* **release:** 1.27.0-beta.5 [skip ci] ([5002c71](5002c713d5))
* **release:** 1.27.0-beta.6 [skip ci] ([94b9836](94b9836ef6))
* **release:** 1.27.0-beta.7 [skip ci] ([407f1ce](407f1ce4eb))
* **release:** 1.27.0-beta.8 [skip ci] ([4f1ed93](4f1ed939e6))
* **release:** 1.27.0-beta.9 [skip ci] ([fd57cc7](fd57cc7c12))
2024-10-26 08:06:36 +00:00
Marco Vinciguerra
b7d5a20ae0
Merge pull request #764 from ScrapeGraphAI/pre/beta 2024-10-26 10:05:15 +02:00
semantic-release-bot
eee131e959 ci(release): 1.27.0-beta.10 [skip ci]
## [1.27.0-beta.10](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.9...v1.27.0-beta.10) (2024-10-25)

### Bug Fixes

* fix export function ([c8a000f](c8a000f1d9))
2024-10-25 06:45:23 +00:00
Marco Vinciguerra
f9c1432342
Merge pull request #767 from ScrapeGraphAI/fix-export-function 2024-10-25 08:43:40 +02:00
semantic-release-bot
fd57cc7c12 ci(release): 1.27.0-beta.9 [skip ci]
## [1.27.0-beta.9](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.8...v1.27.0-beta.9) (2024-10-24)

### Features

* add model integration gpt4 ([51c55eb](51c55eb3a2))
2024-10-24 22:39:44 +00:00
Marco Vinciguerra
9e5e76abbb
Merge pull request #765 from ScrapeGraphAI/add-model-integration-for-images
feat: add model integration gpt4
2024-10-25 00:38:16 +02:00
Marco Vinciguerra
4cd5ef296e add docstring files
Some checks failed
CodeQL / Analyze (python) (push) Has been cancelled
/ build (push) Has been cancelled
Release / Build (push) Has been cancelled
Release / Release (push) Has been cancelled
2024-10-24 15:28:27 +02:00
Marco Vinciguerra
6179ab99a4 Update data_export.py 2024-10-24 15:20:36 +02:00
Marco Vinciguerra
c8a000f1d9 fix: fix export function 2024-10-24 10:11:36 +02:00
Marco Vinciguerra
51c55eb3a2 feat: add model integration gpt4 2024-10-24 09:10:51 +02:00
semantic-release-bot
4f1ed939e6 ci(release): 1.27.0-beta.8 [skip ci]
## [1.27.0-beta.8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.7...v1.27.0-beta.8) (2024-10-24)

### Bug Fixes

* removed tokenizer ([a184716](a18471688f))

### CI

* **release:** 1.26.7 [skip ci] ([ec9ef2b](ec9ef2bcda))
2024-10-24 06:55:58 +00:00
Marco Vinciguerra
066e77dbe7
Merge branch 'main' into pre/beta 2024-10-24 08:54:17 +02:00
semantic-release-bot
407f1ce4eb ci(release): 1.27.0-beta.7 [skip ci]
## [1.27.0-beta.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.6...v1.27.0-beta.7) (2024-10-24)

### Features

* refactoring of get_probable_tags node ([f658092](f658092dff))
2024-10-24 06:45:14 +00:00
Marco Vinciguerra
a1bd05da10
Merge pull request #763 from ScrapeGraphAI/refactoring-get-probable-tags
feat: refactoring of get_probable_tags node
2024-10-24 08:43:49 +02:00
Marco Vinciguerra
f658092dff feat: refactoring of get_probable_tags node 2024-10-23 12:15:16 +02:00
semantic-release-bot
94b9836ef6 ci(release): 1.27.0-beta.6 [skip ci]
## [1.27.0-beta.6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.5...v1.27.0-beta.6) (2024-10-23)

### Features

* add integration with scrape.do ([ae275ec](ae275ec5e8))
2024-10-23 10:09:36 +00:00
Marco Vinciguerra
ae275ec5e8 feat: add integration with scrape.do 2024-10-23 12:08:00 +02:00
semantic-release-bot
5002c713d5 ci(release): 1.27.0-beta.5 [skip ci]
## [1.27.0-beta.5](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.4...v1.27.0-beta.5) (2024-10-22)

### Features

* refactoring of export functions ([0ea00c0](0ea00c078f))
2024-10-22 07:06:26 +00:00
Marco Vinciguerra
34d2964f08
Merge pull request #761 from ScrapeGraphAI/refactoring-export-functions
feat: refactoring of export functions
2024-10-22 09:04:57 +02:00
Marco Vinciguerra
11ae717623 add new doc
Some checks failed
CodeQL / Analyze (python) (push) Has been cancelled
/ build (push) Has been cancelled
Release / Build (push) Has been cancelled
Release / Release (push) Has been cancelled
2024-10-21 11:16:29 +02:00
Marco Vinciguerra
0ea00c078f feat: refactoring of export functions 2024-10-21 10:30:21 +02:00
semantic-release-bot
3d6bbcdaa3 ci(release): 1.27.0-beta.4 [skip ci]
## [1.27.0-beta.4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.3...v1.27.0-beta.4) (2024-10-21)

### Features

* refactoring of ScrapeGraph to SmartScraperLiteGraph ([52b6bf5](52b6bf5fb8))
2024-10-21 08:14:25 +00:00
Marco Vinciguerra
52b6bf5fb8 feat: refactoring of ScrapeGraph to SmartScraperLiteGraph 2024-10-21 10:12:53 +02:00
Marco Vinciguerra
b84883bfd1 add smartscraper lite 2024-10-21 09:39:17 +02:00
Marco Vinciguerra
2991ca8dd2 add examples smart scraper lite 2024-10-21 09:33:40 +02:00
semantic-release-bot
f576afaf0c ci(release): 1.27.0-beta.3 [skip ci]
## [1.27.0-beta.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.2...v1.27.0-beta.3) (2024-10-20)

### Features

* implement ScrapeGraph class for only web scraping automation ([612c644](612c644623))
* Implement SmartScraperMultiParseMergeFirstGraph class that scrapes a list of URLs and merge the content first and finally generates answers to a given prompt. ([3e3e1b2](3e3e1b2f3a))

### Bug Fixes

* fix the example variable name ([69ff649](69ff649556))

### chore

* fix example ([9cd9a87](9cd9a874f9))

### Test

* Add scrape_graph test ([cdb3c11](cdb3c1100e))
* Add smart_scraper_multi_parse_merge_first_graph test ([464b8b0](464b8b04ea))
2024-10-20 08:15:19 +00:00
Marco Vinciguerra
ffa1067f0d
Merge pull request #756 from shenghongtw/pre/beta
The smart_scraper_multi_graph method is too expensive
2024-10-20 10:13:47 +02:00
Marco Vinciguerra
b912904313
Merge pull request #758 from ScrapeGraphAI/fix-together-ai
chore: fix example
2024-10-19 07:25:57 +02:00
semantic-release-bot
ec9ef2bcda ci(release): 1.26.7 [skip ci]
## [1.26.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.6...v1.26.7) (2024-10-19)

### Bug Fixes

* removed tokenizer ([a184716](a18471688f))
2024-10-19 05:20:39 +00:00
Marco Vinciguerra
a18471688f fix: removed tokenizer 2024-10-19 07:18:56 +02:00
Federico Aguzzi
9cd9a874f9 chore: fix example
Committing even though this is not the bug we were looking for
2024-10-18 22:35:33 +02:00
semantic-release-bot
d84d295389 ci(release): 1.27.0-beta.2 [skip ci]
## [1.27.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.27.0-beta.1...v1.27.0-beta.2) (2024-10-18)

### Bug Fixes

* refactoring of gpt2 tokenizer ([44c3f9c](44c3f9c989))

### CI

* **release:** 1.26.6 [skip ci] ([a4634c7](a4634c7331))
2024-10-18 20:18:25 +00:00
Federico Aguzzi
8cb9646a45 Merge branch 'main' into pre/beta 2024-10-18 22:16:39 +02:00
Marco Vinciguerra
58b11334d3 Merge branch 'main' of https://github.com/ScrapeGraphAI/Scrapegraph-ai 2024-10-18 17:11:36 +02:00
Marco Vinciguerra
3f71f103a7 scrape do key added 2024-10-18 17:11:33 +02:00
semantic-release-bot
a4634c7331 ci(release): 1.26.6 [skip ci]
## [1.26.6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.5...v1.26.6) (2024-10-18)

### Bug Fixes

* refactoring of gpt2 tokenizer ([44c3f9c](44c3f9c989))
2024-10-18 07:00:26 +00:00
Marco Vinciguerra
44c3f9c989 fix: refactoring of gpt2 tokenizer 2024-10-18 08:58:53 +02:00
Marco Vinciguerra
bde1e0fbad
Merge pull request #757 from yusefes/fix-tokenizer-loading
Fix tokenizer loading for GPT2
2024-10-18 08:57:42 +02:00