6.4 KiB
FetchNode Timeout Configuration
Overview
The FetchNode in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:
- HTTP requests (when using
use_soup=True) - PDF file parsing
- ChromiumLoader operations
Configuration
Default Behavior
By default, FetchNode uses a 30-second timeout for all blocking operations when a node_config is provided:
from scrapegraphai.nodes import FetchNode
# Default 30-second timeout
node = FetchNode(
input="url",
output=["doc"],
node_config={}
)
Custom Timeout
You can specify a custom timeout value (in seconds) via the timeout parameter:
# Custom 10-second timeout
node = FetchNode(
input="url",
output=["doc"],
node_config={"timeout": 10}
)
Disabling Timeout
To disable timeout and allow operations to run indefinitely, set timeout to None:
# No timeout - operations will wait indefinitely
node = FetchNode(
input="url",
output=["doc"],
node_config={"timeout": None}
)
No Configuration
If you don't provide any node_config, the timeout defaults to None (no timeout):
# No timeout (backward compatible)
node = FetchNode(
input="url",
output=["doc"],
node_config=None
)
Use Cases
HTTP Requests
When use_soup=True, the timeout applies to requests.get() calls:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"use_soup": True,
"timeout": 15 # HTTP request will timeout after 15 seconds
}
)
state = {"url": "https://example.com"}
result = node.execute(state)
If the timeout is None, no timeout parameter is passed to requests.get():
node = FetchNode(
input="url",
output=["doc"],
node_config={
"use_soup": True,
"timeout": None # No timeout for HTTP requests
}
)
PDF Parsing
The timeout applies to PDF file parsing operations using PyPDFLoader:
node = FetchNode(
input="pdf",
output=["doc"],
node_config={
"timeout": 60 # PDF parsing will timeout after 60 seconds
}
)
state = {"pdf": "/path/to/large_document.pdf"}
try:
result = node.execute(state)
except TimeoutError as e:
print(f"PDF parsing took too long: {e}")
If parsing exceeds the timeout, a TimeoutError is raised with a descriptive message:
TimeoutError: PDF parsing exceeded timeout of 60 seconds
ChromiumLoader
The timeout is automatically propagated to ChromiumLoader via loader_kwargs:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 30, # ChromiumLoader will use 30-second timeout
"headless": True
}
)
state = {"url": "https://example.com"}
result = node.execute(state)
If you need different timeout behavior for ChromiumLoader specifically, you can override it in loader_kwargs:
node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 30, # General timeout for other operations
"loader_kwargs": {
"timeout": 60 # ChromiumLoader gets 60-second timeout
}
}
)
Graph Examples
SmartScraperGraph
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "gpt-3.5-turbo",
"api_key": "your-api-key"
},
"timeout": 20 # 20-second timeout for fetch operations
}
smart_scraper = SmartScraperGraph(
prompt="Extract all article titles",
source="https://news.example.com",
config=graph_config
)
result = smart_scraper.run()
Custom Graph with FetchNode
from scrapegraphai.nodes import FetchNode
from langgraph.graph import StateGraph
# Create a custom graph with timeout
fetch_node = FetchNode(
input="url",
output=["doc"],
node_config={
"timeout": 15,
"headless": True
}
)
# Add to graph...
Best Practices
-
Choose appropriate timeouts: Consider the expected response time of your target websites
- Fast APIs: 5-10 seconds
- Regular websites: 15-30 seconds
- Large PDFs or slow sites: 60+ seconds
-
Handle TimeoutError: Always wrap your code in try-except when using timeouts:
try:
result = node.execute(state)
except TimeoutError as e:
logger.error(f"Operation timed out: {e}")
# Handle timeout gracefully
- Use different timeouts for different operations: Set higher timeouts for PDF parsing and lower for HTTP requests:
# For PDFs
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})
# For web pages
web_node = FetchNode("url", ["doc"], {"timeout": 15})
- Monitor timeout occurrences: Log timeout errors to identify problematic sources:
import logging
logger = logging.getLogger(__name__)
try:
result = node.execute(state)
except TimeoutError as e:
logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")
Implementation Details
The timeout feature is implemented using:
- HTTP requests:
requests.get(url, timeout=X)parameter - PDF parsing:
concurrent.futures.ThreadPoolExecutorwithfuture.result(timeout=X) - ChromiumLoader: Propagated via
loader_kwargsdictionary
When timeout=None, no timeout constraints are applied, allowing operations to run until completion.
Troubleshooting
Timeout is too short
If you're seeing frequent timeout errors, increase the timeout value:
node_config = {"timeout": 60} # Increase from 30 to 60 seconds
Need different timeouts for different operations
Use separate FetchNode instances with different configurations:
fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})
ChromiumLoader timeout not working
Ensure you're not overriding the timeout in loader_kwargs:
# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
node_config = {
"timeout": 30,
"loader_kwargs": {"timeout": 10} # This takes precedence
}
# ✅ Correct - let node timeout propagate
node_config = {
"timeout": 30 # ChromiumLoader will use 30 seconds
}