mirror of
https://github.com/VinciGit00/Scrapegraph-ai.git
synced 2026-06-04 21:01:04 +08:00
Add comprehensive timeout feature documentation
Co-authored-by: VinciGit00 <88108002+VinciGit00@users.noreply.github.com>
This commit is contained in:
parent
9439fe5932
commit
323f26a7a5
292
docs/timeout_configuration.md
Normal file
292
docs/timeout_configuration.md
Normal file
@ -0,0 +1,292 @@
|
||||
# FetchNode Timeout Configuration
|
||||
|
||||
## Overview
|
||||
|
||||
The `FetchNode` in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:
|
||||
|
||||
- HTTP requests (when using `use_soup=True`)
|
||||
- PDF file parsing
|
||||
- ChromiumLoader operations
|
||||
|
||||
## Configuration
|
||||
|
||||
### Default Behavior
|
||||
|
||||
By default, `FetchNode` uses a **30-second timeout** for all blocking operations when a `node_config` is provided:
|
||||
|
||||
```python
|
||||
from scrapegraphai.nodes import FetchNode
|
||||
|
||||
# Default 30-second timeout
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={}
|
||||
)
|
||||
```
|
||||
|
||||
### Custom Timeout
|
||||
|
||||
You can specify a custom timeout value (in seconds) via the `timeout` parameter:
|
||||
|
||||
```python
|
||||
# Custom 10-second timeout
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={"timeout": 10}
|
||||
)
|
||||
```
|
||||
|
||||
### Disabling Timeout
|
||||
|
||||
To disable timeout and allow operations to run indefinitely, set `timeout` to `None`:
|
||||
|
||||
```python
|
||||
# No timeout - operations will wait indefinitely
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={"timeout": None}
|
||||
)
|
||||
```
|
||||
|
||||
### No Configuration
|
||||
|
||||
If you don't provide any `node_config`, the timeout defaults to `None` (no timeout):
|
||||
|
||||
```python
|
||||
# No timeout (backward compatible)
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config=None
|
||||
)
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### HTTP Requests
|
||||
|
||||
When `use_soup=True`, the timeout applies to `requests.get()` calls:
|
||||
|
||||
```python
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"use_soup": True,
|
||||
"timeout": 15 # HTTP request will timeout after 15 seconds
|
||||
}
|
||||
)
|
||||
|
||||
state = {"url": "https://example.com"}
|
||||
result = node.execute(state)
|
||||
```
|
||||
|
||||
If the timeout is `None`, no timeout parameter is passed to `requests.get()`:
|
||||
|
||||
```python
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"use_soup": True,
|
||||
"timeout": None # No timeout for HTTP requests
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### PDF Parsing
|
||||
|
||||
The timeout applies to PDF file parsing operations using `PyPDFLoader`:
|
||||
|
||||
```python
|
||||
node = FetchNode(
|
||||
input="pdf",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"timeout": 60 # PDF parsing will timeout after 60 seconds
|
||||
}
|
||||
)
|
||||
|
||||
state = {"pdf": "/path/to/large_document.pdf"}
|
||||
try:
|
||||
result = node.execute(state)
|
||||
except TimeoutError as e:
|
||||
print(f"PDF parsing took too long: {e}")
|
||||
```
|
||||
|
||||
If parsing exceeds the timeout, a `TimeoutError` is raised with a descriptive message:
|
||||
|
||||
```
|
||||
TimeoutError: PDF parsing exceeded timeout of 60 seconds
|
||||
```
|
||||
|
||||
### ChromiumLoader
|
||||
|
||||
The timeout is automatically propagated to `ChromiumLoader` via `loader_kwargs`:
|
||||
|
||||
```python
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"timeout": 30, # ChromiumLoader will use 30-second timeout
|
||||
"headless": True
|
||||
}
|
||||
)
|
||||
|
||||
state = {"url": "https://example.com"}
|
||||
result = node.execute(state)
|
||||
```
|
||||
|
||||
If you need different timeout behavior for ChromiumLoader specifically, you can override it in `loader_kwargs`:
|
||||
|
||||
```python
|
||||
node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"timeout": 30, # General timeout for other operations
|
||||
"loader_kwargs": {
|
||||
"timeout": 60 # ChromiumLoader gets 60-second timeout
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Graph Examples
|
||||
|
||||
### SmartScraperGraph
|
||||
|
||||
```python
|
||||
from scrapegraphai.graphs import SmartScraperGraph
|
||||
|
||||
graph_config = {
|
||||
"llm": {
|
||||
"model": "gpt-3.5-turbo",
|
||||
"api_key": "your-api-key"
|
||||
},
|
||||
"timeout": 20 # 20-second timeout for fetch operations
|
||||
}
|
||||
|
||||
smart_scraper = SmartScraperGraph(
|
||||
prompt="Extract all article titles",
|
||||
source="https://news.example.com",
|
||||
config=graph_config
|
||||
)
|
||||
|
||||
result = smart_scraper.run()
|
||||
```
|
||||
|
||||
### Custom Graph with FetchNode
|
||||
|
||||
```python
|
||||
from scrapegraphai.nodes import FetchNode
|
||||
from langgraph.graph import StateGraph
|
||||
|
||||
# Create a custom graph with timeout
|
||||
fetch_node = FetchNode(
|
||||
input="url",
|
||||
output=["doc"],
|
||||
node_config={
|
||||
"timeout": 15,
|
||||
"headless": True
|
||||
}
|
||||
)
|
||||
|
||||
# Add to graph...
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose appropriate timeouts**: Consider the expected response time of your target websites
|
||||
- Fast APIs: 5-10 seconds
|
||||
- Regular websites: 15-30 seconds
|
||||
- Large PDFs or slow sites: 60+ seconds
|
||||
|
||||
2. **Handle TimeoutError**: Always wrap your code in try-except when using timeouts:
|
||||
|
||||
```python
|
||||
try:
|
||||
result = node.execute(state)
|
||||
except TimeoutError as e:
|
||||
logger.error(f"Operation timed out: {e}")
|
||||
# Handle timeout gracefully
|
||||
```
|
||||
|
||||
3. **Use different timeouts for different operations**: Set higher timeouts for PDF parsing and lower for HTTP requests:
|
||||
|
||||
```python
|
||||
# For PDFs
|
||||
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})
|
||||
|
||||
# For web pages
|
||||
web_node = FetchNode("url", ["doc"], {"timeout": 15})
|
||||
```
|
||||
|
||||
4. **Monitor timeout occurrences**: Log timeout errors to identify problematic sources:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
try:
|
||||
result = node.execute(state)
|
||||
except TimeoutError as e:
|
||||
logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The timeout feature is implemented using:
|
||||
|
||||
- **HTTP requests**: `requests.get(url, timeout=X)` parameter
|
||||
- **PDF parsing**: `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=X)`
|
||||
- **ChromiumLoader**: Propagated via `loader_kwargs` dictionary
|
||||
|
||||
When `timeout=None`, no timeout constraints are applied, allowing operations to run until completion.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Timeout is too short
|
||||
|
||||
If you're seeing frequent timeout errors, increase the timeout value:
|
||||
|
||||
```python
|
||||
node_config = {"timeout": 60} # Increase from 30 to 60 seconds
|
||||
```
|
||||
|
||||
### Need different timeouts for different operations
|
||||
|
||||
Use separate FetchNode instances with different configurations:
|
||||
|
||||
```python
|
||||
fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
|
||||
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})
|
||||
```
|
||||
|
||||
### ChromiumLoader timeout not working
|
||||
|
||||
Ensure you're not overriding the timeout in `loader_kwargs`:
|
||||
|
||||
```python
|
||||
# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
|
||||
node_config = {
|
||||
"timeout": 30,
|
||||
"loader_kwargs": {"timeout": 10} # This takes precedence
|
||||
}
|
||||
|
||||
# ✅ Correct - let node timeout propagate
|
||||
node_config = {
|
||||
"timeout": 30 # ChromiumLoader will use 30 seconds
|
||||
}
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [FetchNode API Documentation](../api/nodes/fetch_node.md)
|
||||
- [Graph Configuration](./graph_configuration.md)
|
||||
- [Error Handling](./error_handling.md)
|
||||
Loading…
Reference in New Issue
Block a user