mirror of https://github.com/VinciGit00/Scrapegraph-ai.git synced 2026-06-04 21:01:04 +08:00

copilot-swe-agent[bot] 323f26a7a5 Add comprehensive timeout feature documentation

Co-authored-by: VinciGit00 <88108002+VinciGit00@users.noreply.github.com>

2025-11-26 17:36:16 +00:00

6.4 KiB

Raw Permalink Blame History

FetchNode Timeout Configuration

Overview

The FetchNode in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:

HTTP requests (when using use_soup=True)
PDF file parsing
ChromiumLoader operations

Configuration

Default Behavior

By default, FetchNode uses a 30-second timeout for all blocking operations when a node_config is provided:

from scrapegraphai.nodes import FetchNode

# Default 30-second timeout
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={}
)

Custom Timeout

You can specify a custom timeout value (in seconds) via the timeout parameter:

# Custom 10-second timeout
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={"timeout": 10}
)

Disabling Timeout

To disable timeout and allow operations to run indefinitely, set timeout to None:

# No timeout - operations will wait indefinitely
node = FetchNode(
    input="url",
    output=["doc"],
    node_config={"timeout": None}
)

No Configuration

If you don't provide any node_config, the timeout defaults to None (no timeout):

# No timeout (backward compatible)
node = FetchNode(
    input="url",
    output=["doc"],
    node_config=None
)

Use Cases

HTTP Requests

When use_soup=True, the timeout applies to requests.get() calls:

node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "use_soup": True,
        "timeout": 15  # HTTP request will timeout after 15 seconds
    }
)

state = {"url": "https://example.com"}
result = node.execute(state)

If the timeout is None, no timeout parameter is passed to requests.get():

node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "use_soup": True,
        "timeout": None  # No timeout for HTTP requests
    }
)

PDF Parsing

The timeout applies to PDF file parsing operations using PyPDFLoader:

node = FetchNode(
    input="pdf",
    output=["doc"],
    node_config={
        "timeout": 60  # PDF parsing will timeout after 60 seconds
    }
)

state = {"pdf": "/path/to/large_document.pdf"}
try:
    result = node.execute(state)
except TimeoutError as e:
    print(f"PDF parsing took too long: {e}")

If parsing exceeds the timeout, a TimeoutError is raised with a descriptive message:

TimeoutError: PDF parsing exceeded timeout of 60 seconds

ChromiumLoader

The timeout is automatically propagated to ChromiumLoader via loader_kwargs:

node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 30,  # ChromiumLoader will use 30-second timeout
        "headless": True
    }
)

state = {"url": "https://example.com"}
result = node.execute(state)

If you need different timeout behavior for ChromiumLoader specifically, you can override it in loader_kwargs:

node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 30,  # General timeout for other operations
        "loader_kwargs": {
            "timeout": 60  # ChromiumLoader gets 60-second timeout
        }
    }
)

Graph Examples

SmartScraperGraph

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "gpt-3.5-turbo",
        "api_key": "your-api-key"
    },
    "timeout": 20  # 20-second timeout for fetch operations
}

smart_scraper = SmartScraperGraph(
    prompt="Extract all article titles",
    source="https://news.example.com",
    config=graph_config
)

result = smart_scraper.run()

Custom Graph with FetchNode

from scrapegraphai.nodes import FetchNode
from langgraph.graph import StateGraph

# Create a custom graph with timeout
fetch_node = FetchNode(
    input="url",
    output=["doc"],
    node_config={
        "timeout": 15,
        "headless": True
    }
)

# Add to graph...

Best Practices

Choose appropriate timeouts: Consider the expected response time of your target websites
- Fast APIs: 5-10 seconds
- Regular websites: 15-30 seconds
- Large PDFs or slow sites: 60+ seconds
Handle TimeoutError: Always wrap your code in try-except when using timeouts:

try:
    result = node.execute(state)
except TimeoutError as e:
    logger.error(f"Operation timed out: {e}")
    # Handle timeout gracefully

Use different timeouts for different operations: Set higher timeouts for PDF parsing and lower for HTTP requests:

# For PDFs
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})

# For web pages
web_node = FetchNode("url", ["doc"], {"timeout": 15})

Monitor timeout occurrences: Log timeout errors to identify problematic sources:

import logging

logger = logging.getLogger(__name__)

try:
    result = node.execute(state)
except TimeoutError as e:
    logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")

Implementation Details

The timeout feature is implemented using:

HTTP requests: requests.get(url, timeout=X) parameter
PDF parsing: concurrent.futures.ThreadPoolExecutor with future.result(timeout=X)
ChromiumLoader: Propagated via loader_kwargs dictionary

When timeout=None, no timeout constraints are applied, allowing operations to run until completion.

Troubleshooting

Timeout is too short

If you're seeing frequent timeout errors, increase the timeout value:

node_config = {"timeout": 60}  # Increase from 30 to 60 seconds

Need different timeouts for different operations

Use separate FetchNode instances with different configurations:

fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})

ChromiumLoader timeout not working

Ensure you're not overriding the timeout in loader_kwargs:

# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
node_config = {
    "timeout": 30,
    "loader_kwargs": {"timeout": 10}  # This takes precedence
}

# ✅ Correct - let node timeout propagate
node_config = {
    "timeout": 30  # ChromiumLoader will use 30 seconds
}

6.4 KiB Raw Permalink Blame History