HTTPX Scraper¶
HTTPX provides a modern, fast async HTTP client for scraping platforms with APIs or simple HTML pages.
Quick Start¶
from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, ScraperConfig
config = SentimatrixConfig(
scrapers=ScraperConfig(
provider="httpx",
timeout=30
)
)
async with Sentimatrix(config) as sm:
reviews = await sm.scrape_reviews(url, platform="steam")
Configuration¶
ScraperConfig(
provider="httpx",
timeout=30, # Request timeout
user_agent="custom-agent", # Custom User-Agent
rate_limit=RateLimitConfig(
requests_per_second=2.0,
concurrent_requests=10
),
retry=RetryConfig(
max_retries=3,
initial_delay=1.0
)
)
Features¶
- Async Native: Built for asyncio
- HTTP/2 Support: Modern protocol support
- Connection Pooling: Efficient connection reuse
- Automatic Retries: Built-in retry logic
- Lightweight: No browser overhead
When to Use HTTPX¶
| Scenario | HTTPX | Playwright |
|---|---|---|
| API endpoints | Best | Overkill |
| Static HTML | Good | Good |
| JavaScript pages | No | Required |
| High volume | Best | Slower |
| Resource usage | Low | High |
Best For¶
- Steam: Uses API endpoints
- Reddit: API-based with OAuth
- IMDB: Static HTML pages
- APIs: Any REST API
Example: High-Volume Scraping¶
config = SentimatrixConfig(
scrapers=ScraperConfig(
provider="httpx",
rate_limit=RateLimitConfig(
requests_per_second=5.0,
concurrent_requests=20
)
)
)
async with Sentimatrix(config) as sm:
# Fast parallel scraping
tasks = [
sm.scrape_reviews(url, platform="steam")
for url in game_urls
]
results = await asyncio.gather(*tasks)
Headers Configuration¶
# Custom headers for specific platforms
config = SentimatrixConfig(
scrapers=ScraperConfig(
provider="httpx",
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."
)
)