Web Scrapers¶
Sentimatrix provides a comprehensive scraping infrastructure to collect reviews and feedback from popular platforms.
Scraper Categories¶
:material-web: Core Scrapers
Foundational scrapers for any website.
HTTPX (static), Playwright (dynamic)
:material-store: Platform Scrapers
Pre-built scrapers for popular platforms.
Amazon, Steam, YouTube, Reddit, IMDB, Yelp, Trustpilot, Google Reviews
:material-api: Commercial APIs
Enterprise-grade scraping services.
ScraperAPI, Apify, Bright Data, Oxylabs, Zyte, ScrapingBee, ScrapingAnt
Platform Support Matrix¶
| Platform | Browser Required | Auth Required | Rate Limit | Status |
|---|---|---|---|---|
| Amazon | 10/min | Stable | ||
| Steam | 20/min | Stable | ||
| YouTube | API Key | 100/min | Stable | |
| OAuth | 60/min | Stable | ||
| IMDB | 15/min | Stable | ||
| Yelp | API Key | 50/min | Stable | |
| Trustpilot | 10/min | Stable | ||
| Google Reviews | 5/min | Beta |
Quick Start¶
Basic Scraping (Steam - No Browser)¶
import asyncio
from sentimatrix import Sentimatrix
async def main():
async with Sentimatrix() as sm:
reviews = await sm.scrape_reviews(
url="https://store.steampowered.com/app/1245620/ELDEN_RING/",
platform="steam",
max_reviews=50
)
for review in reviews[:5]:
print(f"Rating: {review.rating}")
print(f"Text: {review.text[:100]}...")
print()
asyncio.run(main())
Browser-Based Scraping (Amazon)¶
async with Sentimatrix() as sm:
reviews = await sm.scrape_reviews(
url="https://www.amazon.com/dp/B0BSHF7WHW",
platform="amazon",
max_reviews=30,
use_browser=True # Enables Playwright
)
Playwright Required
For browser-based scraping:
Using Commercial APIs¶
from sentimatrix.config import SentimatrixConfig, ScraperConfig
config = SentimatrixConfig(
scraper=ScraperConfig(
api_provider="scraperapi",
api_key="your-scraperapi-key"
)
)
async with Sentimatrix(config) as sm:
reviews = await sm.scrape_reviews(
url="https://www.amazon.com/dp/B0BSHF7WHW",
platform="amazon",
max_reviews=100
)
Commercial API Comparison¶
| Service | Proxy Pool | JS Rendering | Starting Price | Best For |
|---|---|---|---|---|
| ScraperAPI | 40M+ | $49/mo | General use | |
| Apify | Varies | Pay-per-use | Pre-built actors | |
| Bright Data | 72M+ | $500/mo | Enterprise | |
| Oxylabs | 100M+ | Custom | E-commerce | |
| Zyte | 50M+ | $450/mo | AI extraction | |
| ScrapingBee | 1M+ | $49/mo | Screenshots | |
| ScrapingAnt | 1M+ | $19/mo | Budget |
Configuration¶
YAML Configuration¶
sentimatrix.yaml
scraper:
# Rate limiting
rate_limit:
requests_per_second: 2
burst_size: 5
# Retry settings
retry:
max_retries: 3
backoff_factor: 2.0
# Browser settings
browser:
headless: true
timeout: 30000
# Commercial API (optional)
api:
provider: scraperapi
# api_key loaded from SCRAPERAPI_KEY env var
Python Configuration¶
from sentimatrix.config import (
SentimatrixConfig,
ScraperConfig,
RateLimitConfig,
RetryConfig,
BrowserConfig,
)
config = SentimatrixConfig(
scraper=ScraperConfig(
rate_limit=RateLimitConfig(
requests_per_second=2,
burst_size=5,
),
retry=RetryConfig(
max_retries=3,
backoff_factor=2.0,
),
browser=BrowserConfig(
headless=True,
timeout=30000,
),
)
)
Error Handling¶
from sentimatrix.exceptions import (
ScraperError,
RateLimitError,
BlockedError,
ParseError,
)
async with Sentimatrix() as sm:
try:
reviews = await sm.scrape_reviews(url, platform="amazon")
except RateLimitError as e:
print(f"Rate limited, retry after {e.retry_after}s")
except BlockedError:
print("IP blocked, try using a commercial API")
except ParseError as e:
print(f"Failed to parse response: {e}")
except ScraperError as e:
print(f"Scraping failed: {e}")
Scraping Infrastructure¶
Sentimatrix includes a robust scraping infrastructure with:
Rate Limiting¶
Built-in token bucket rate limiter with configurable strategies:
from sentimatrix.config import RateLimitConfig
rate_limit = RateLimitConfig(
requests_per_second=2, # Base rate
concurrent_requests=5, # Max concurrent
backoff_factor=2.0, # Exponential backoff
)
Retry Handler¶
Automatic retry with exponential backoff:
from sentimatrix.config import RetryConfig
retry = RetryConfig(
max_retries=3,
initial_delay=1.0,
exponential_base=2.0,
jitter=True, # Add randomness to prevent thundering herd
)
Proxy Support¶
Rotate proxies to avoid blocks:
from sentimatrix.config import ProxyConfig
proxy = ProxyConfig(
enabled=True,
rotation=True,
country="us",
)
Best Practices¶
-
Respect Rate Limits
- Don't exceed platform limits
- Implement exponential backoff
- Use delays between requests
-
Use Browser Only When Needed
- Steam, Reddit: No browser required
- Amazon, IMDB: Browser recommended
-
Handle Blocks Gracefully
- Rotate user agents
- Use commercial APIs for scale
- Implement retry logic
-
Cache Responses
- Avoid repeated requests
- Store scraped data locally
-
Stay Compliant
- Check robots.txt
- Follow terms of service
- Don't overload servers
Scraper Documentation¶
Core Scrapers¶
- HTTPX Scraper - Static pages
- Playwright Scraper - Dynamic pages
Platform Scrapers¶
- Amazon - Product reviews
- Steam - Game reviews
- YouTube - Video comments
- Reddit - Posts and comments
- IMDB - Movie reviews
- Yelp - Business reviews
- Trustpilot - Company reviews
- Google Reviews - Local business reviews