# FetchNode Timeout Configuration ## Overview The `FetchNode` in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for: - HTTP requests (when using `use_soup=True`) - PDF file parsing - ChromiumLoader operations ## Configuration ### Default Behavior By default, `FetchNode` uses a **30-second timeout** for all blocking operations when a `node_config` is provided: ```python from scrapegraphai.nodes import FetchNode # Default 30-second timeout node = FetchNode( input="url", output=["doc"], node_config={} ) ``` ### Custom Timeout You can specify a custom timeout value (in seconds) via the `timeout` parameter: ```python # Custom 10-second timeout node = FetchNode( input="url", output=["doc"], node_config={"timeout": 10} ) ``` ### Disabling Timeout To disable timeout and allow operations to run indefinitely, set `timeout` to `None`: ```python # No timeout - operations will wait indefinitely node = FetchNode( input="url", output=["doc"], node_config={"timeout": None} ) ``` ### No Configuration If you don't provide any `node_config`, the timeout defaults to `None` (no timeout): ```python # No timeout (backward compatible) node = FetchNode( input="url", output=["doc"], node_config=None ) ``` ## Use Cases ### HTTP Requests When `use_soup=True`, the timeout applies to `requests.get()` calls: ```python node = FetchNode( input="url", output=["doc"], node_config={ "use_soup": True, "timeout": 15 # HTTP request will timeout after 15 seconds } ) state = {"url": "https://example.com"} result = node.execute(state) ``` If the timeout is `None`, no timeout parameter is passed to `requests.get()`: ```python node = FetchNode( input="url", output=["doc"], node_config={ "use_soup": True, "timeout": None # No timeout for HTTP requests } ) ``` ### PDF Parsing The timeout applies to PDF file parsing operations using `PyPDFLoader`: ```python node = FetchNode( input="pdf", output=["doc"], node_config={ "timeout": 60 # PDF parsing will timeout after 60 seconds } ) state = {"pdf": "/path/to/large_document.pdf"} try: result = node.execute(state) except TimeoutError as e: print(f"PDF parsing took too long: {e}") ``` If parsing exceeds the timeout, a `TimeoutError` is raised with a descriptive message: ``` TimeoutError: PDF parsing exceeded timeout of 60 seconds ``` ### ChromiumLoader The timeout is automatically propagated to `ChromiumLoader` via `loader_kwargs`: ```python node = FetchNode( input="url", output=["doc"], node_config={ "timeout": 30, # ChromiumLoader will use 30-second timeout "headless": True } ) state = {"url": "https://example.com"} result = node.execute(state) ``` If you need different timeout behavior for ChromiumLoader specifically, you can override it in `loader_kwargs`: ```python node = FetchNode( input="url", output=["doc"], node_config={ "timeout": 30, # General timeout for other operations "loader_kwargs": { "timeout": 60 # ChromiumLoader gets 60-second timeout } } ) ``` ## Graph Examples ### SmartScraperGraph ```python from scrapegraphai.graphs import SmartScraperGraph graph_config = { "llm": { "model": "gpt-3.5-turbo", "api_key": "your-api-key" }, "timeout": 20 # 20-second timeout for fetch operations } smart_scraper = SmartScraperGraph( prompt="Extract all article titles", source="https://news.example.com", config=graph_config ) result = smart_scraper.run() ``` ### Custom Graph with FetchNode ```python from scrapegraphai.nodes import FetchNode from langgraph.graph import StateGraph # Create a custom graph with timeout fetch_node = FetchNode( input="url", output=["doc"], node_config={ "timeout": 15, "headless": True } ) # Add to graph... ``` ## Best Practices 1. **Choose appropriate timeouts**: Consider the expected response time of your target websites - Fast APIs: 5-10 seconds - Regular websites: 15-30 seconds - Large PDFs or slow sites: 60+ seconds 2. **Handle TimeoutError**: Always wrap your code in try-except when using timeouts: ```python try: result = node.execute(state) except TimeoutError as e: logger.error(f"Operation timed out: {e}") # Handle timeout gracefully ``` 3. **Use different timeouts for different operations**: Set higher timeouts for PDF parsing and lower for HTTP requests: ```python # For PDFs pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120}) # For web pages web_node = FetchNode("url", ["doc"], {"timeout": 15}) ``` 4. **Monitor timeout occurrences**: Log timeout errors to identify problematic sources: ```python import logging logger = logging.getLogger(__name__) try: result = node.execute(state) except TimeoutError as e: logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}") ``` ## Implementation Details The timeout feature is implemented using: - **HTTP requests**: `requests.get(url, timeout=X)` parameter - **PDF parsing**: `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=X)` - **ChromiumLoader**: Propagated via `loader_kwargs` dictionary When `timeout=None`, no timeout constraints are applied, allowing operations to run until completion. ## Troubleshooting ### Timeout is too short If you're seeing frequent timeout errors, increase the timeout value: ```python node_config = {"timeout": 60} # Increase from 30 to 60 seconds ``` ### Need different timeouts for different operations Use separate FetchNode instances with different configurations: ```python fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10}) slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120}) ``` ### ChromiumLoader timeout not working Ensure you're not overriding the timeout in `loader_kwargs`: ```python # ❌ Wrong - explicit loader_kwargs timeout overrides node timeout node_config = { "timeout": 30, "loader_kwargs": {"timeout": 10} # This takes precedence } # ✅ Correct - let node timeout propagate node_config = { "timeout": 30 # ChromiumLoader will use 30 seconds } ``` ## See Also - [FetchNode API Documentation](../api/nodes/fetch_node.md) - [Graph Configuration](./graph_configuration.md) - [Error Handling](./error_handling.md)