Add comprehensive timeout feature documentation

Co-authored-by: VinciGit00 <88108002+VinciGit00@users.noreply.github.com>
2026-06-04 21:01:04 +08:00 · 2025-11-26 17:36:16 +00:00 · 2025-11-26 17:36:16 +00:00 · 323f26a7a5
commit 323f26a7a5
parent 9439fe5932
1 changed files with 292 additions and 0 deletions
--- a/docs/timeout_configuration.md
+++ b/docs/timeout_configuration.md
@ -0,0 +1,292 @@
+# FetchNode Timeout Configuration
+
+## Overview
+
+The `FetchNode` in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:
+
+- HTTP requests (when using `use_soup=True`)
+- PDF file parsing
+- ChromiumLoader operations
+
+## Configuration
+
+### Default Behavior
+
+By default, `FetchNode` uses a **30-second timeout** for all blocking operations when a `node_config` is provided:
+
+```python
+from scrapegraphai.nodes import FetchNode
+
+# Default 30-second timeout
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={}
+)
+```
+
+### Custom Timeout
+
+You can specify a custom timeout value (in seconds) via the `timeout` parameter:
+
+```python
+# Custom 10-second timeout
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={"timeout": 10}
+)
+```
+
+### Disabling Timeout
+
+To disable timeout and allow operations to run indefinitely, set `timeout` to `None`:
+
+```python
+# No timeout - operations will wait indefinitely
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={"timeout": None}
+)
+```
+
+### No Configuration
+
+If you don't provide any `node_config`, the timeout defaults to `None` (no timeout):
+
+```python
+# No timeout (backward compatible)
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config=None
+)
+```
+
+## Use Cases
+
+### HTTP Requests
+
+When `use_soup=True`, the timeout applies to `requests.get()` calls:
+
+```python
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={
+        "use_soup": True,
+        "timeout": 15  # HTTP request will timeout after 15 seconds
+    }
+)
+
+state = {"url": "https://example.com"}
+result = node.execute(state)
+```
+
+If the timeout is `None`, no timeout parameter is passed to `requests.get()`:
+
+```python
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={
+        "use_soup": True,
+        "timeout": None  # No timeout for HTTP requests
+    }
+)
+```
+
+### PDF Parsing
+
+The timeout applies to PDF file parsing operations using `PyPDFLoader`:
+
+```python
+node = FetchNode(
+    input="pdf",
+    output=["doc"],
+    node_config={
+        "timeout": 60  # PDF parsing will timeout after 60 seconds
+    }
+)
+
+state = {"pdf": "/path/to/large_document.pdf"}
+try:
+    result = node.execute(state)
+except TimeoutError as e:
+    print(f"PDF parsing took too long: {e}")
+```
+
+If parsing exceeds the timeout, a `TimeoutError` is raised with a descriptive message:
+
+```
+TimeoutError: PDF parsing exceeded timeout of 60 seconds
+```
+
+### ChromiumLoader
+
+The timeout is automatically propagated to `ChromiumLoader` via `loader_kwargs`:
+
+```python
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={
+        "timeout": 30,  # ChromiumLoader will use 30-second timeout
+        "headless": True
+    }
+)
+
+state = {"url": "https://example.com"}
+result = node.execute(state)
+```
+
+If you need different timeout behavior for ChromiumLoader specifically, you can override it in `loader_kwargs`:
+
+```python
+node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={
+        "timeout": 30,  # General timeout for other operations
+        "loader_kwargs": {
+            "timeout": 60  # ChromiumLoader gets 60-second timeout
+        }
+    }
+)
+```
+
+## Graph Examples
+
+### SmartScraperGraph
+
+```python
+from scrapegraphai.graphs import SmartScraperGraph
+
+graph_config = {
+    "llm": {
+        "model": "gpt-3.5-turbo",
+        "api_key": "your-api-key"
+    },
+    "timeout": 20  # 20-second timeout for fetch operations
+}
+
+smart_scraper = SmartScraperGraph(
+    prompt="Extract all article titles",
+    source="https://news.example.com",
+    config=graph_config
+)
+
+result = smart_scraper.run()
+```
+
+### Custom Graph with FetchNode
+
+```python
+from scrapegraphai.nodes import FetchNode
+from langgraph.graph import StateGraph
+
+# Create a custom graph with timeout
+fetch_node = FetchNode(
+    input="url",
+    output=["doc"],
+    node_config={
+        "timeout": 15,
+        "headless": True
+    }
+)
+
+# Add to graph...
+```
+
+## Best Practices
+
+1. **Choose appropriate timeouts**: Consider the expected response time of your target websites
+   - Fast APIs: 5-10 seconds
+   - Regular websites: 15-30 seconds
+   - Large PDFs or slow sites: 60+ seconds
+
+2. **Handle TimeoutError**: Always wrap your code in try-except when using timeouts:
+
+```python
+try:
+    result = node.execute(state)
+except TimeoutError as e:
+    logger.error(f"Operation timed out: {e}")
+    # Handle timeout gracefully
+```
+
+3. **Use different timeouts for different operations**: Set higher timeouts for PDF parsing and lower for HTTP requests:
+
+```python
+# For PDFs
+pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})
+
+# For web pages
+web_node = FetchNode("url", ["doc"], {"timeout": 15})
+```
+
+4. **Monitor timeout occurrences**: Log timeout errors to identify problematic sources:
+
+```python
+import logging
+
+logger = logging.getLogger(__name__)
+
+try:
+    result = node.execute(state)
+except TimeoutError as e:
+    logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")
+```
+
+## Implementation Details
+
+The timeout feature is implemented using:
+
+- **HTTP requests**: `requests.get(url, timeout=X)` parameter
+- **PDF parsing**: `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=X)`
+- **ChromiumLoader**: Propagated via `loader_kwargs` dictionary
+
+When `timeout=None`, no timeout constraints are applied, allowing operations to run until completion.
+
+## Troubleshooting
+
+### Timeout is too short
+
+If you're seeing frequent timeout errors, increase the timeout value:
+
+```python
+node_config = {"timeout": 60}  # Increase from 30 to 60 seconds
+```
+
+### Need different timeouts for different operations
+
+Use separate FetchNode instances with different configurations:
+
+```python
+fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
+slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})
+```
+
+### ChromiumLoader timeout not working
+
+Ensure you're not overriding the timeout in `loader_kwargs`:
+
+```python
+# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
+node_config = {
+    "timeout": 30,
+    "loader_kwargs": {"timeout": 10}  # This takes precedence
+}
+
+# ✅ Correct - let node timeout propagate
+node_config = {
+    "timeout": 30  # ChromiumLoader will use 30 seconds
+}
+```
+
+## See Also
+
+- [FetchNode API Documentation](../api/nodes/fetch_node.md)
+- [Graph Configuration](./graph_configuration.md)
+- [Error Handling](./error_handling.md)