Skip to content

HTTPX Scraper¶

HTTPX provides a modern, fast async HTTP client for scraping platforms with APIs or simple HTML pages.

Quick Start¶

from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, ScraperConfig

config = SentimatrixConfig(
    scrapers=ScraperConfig(
        provider="httpx",
        timeout=30
    )
)

async with Sentimatrix(config) as sm:
    reviews = await sm.scrape_reviews(url, platform="steam")

Configuration¶

ScraperConfig(
    provider="httpx",
    timeout=30,                    # Request timeout
    user_agent="custom-agent",     # Custom User-Agent
    rate_limit=RateLimitConfig(
        requests_per_second=2.0,
        concurrent_requests=10
    ),
    retry=RetryConfig(
        max_retries=3,
        initial_delay=1.0
    )
)

Features¶

Async Native: Built for asyncio
HTTP/2 Support: Modern protocol support
Connection Pooling: Efficient connection reuse
Automatic Retries: Built-in retry logic
Lightweight: No browser overhead

When to Use HTTPX¶

Scenario	HTTPX	Playwright
API endpoints	Best	Overkill
Static HTML	Good	Good
JavaScript pages	No	Required
High volume	Best	Slower
Resource usage	Low	High

Best For¶

Steam: Uses API endpoints
Reddit: API-based with OAuth
IMDB: Static HTML pages
APIs: Any REST API

Example: High-Volume Scraping¶

config = SentimatrixConfig(
    scrapers=ScraperConfig(
        provider="httpx",
        rate_limit=RateLimitConfig(
            requests_per_second=5.0,
            concurrent_requests=20
        )
    )
)

async with Sentimatrix(config) as sm:
    # Fast parallel scraping
    tasks = [
        sm.scrape_reviews(url, platform="steam")
        for url in game_urls
    ]
    results = await asyncio.gather(*tasks)

Headers Configuration¶

# Custom headers for specific platforms
config = SentimatrixConfig(
    scrapers=ScraperConfig(
        provider="httpx",
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
    )
)