Web Scrapers¶

Sentimatrix provides a comprehensive scraping infrastructure to collect reviews and feedback from popular platforms.

Scraper Categories¶

:material-web: Core Scrapers

Foundational scrapers for any website.

HTTPX (static), Playwright (dynamic)

:material-store: Platform Scrapers

Pre-built scrapers for popular platforms.

Amazon, Steam, YouTube, Reddit, IMDB, Yelp, Trustpilot, Google Reviews

:material-api: Commercial APIs

Enterprise-grade scraping services.

ScraperAPI, Apify, Bright Data, Oxylabs, Zyte, ScrapingBee, ScrapingAnt

Platform Support Matrix¶

Platform	Auth Required	Rate Limit	Status
Amazon		10/min	Stable
Steam		20/min	Stable
YouTube	API Key	100/min	Stable
Reddit	OAuth	60/min	Stable
IMDB		15/min	Stable
Yelp	API Key	50/min	Stable
Trustpilot		10/min	Stable
Google Reviews		5/min	Beta

Quick Start¶

Basic Scraping (Steam - No Browser)¶

import asyncio
from sentimatrix import Sentimatrix

async def main():
    async with Sentimatrix() as sm:
        reviews = await sm.scrape_reviews(
            url="https://store.steampowered.com/app/1245620/ELDEN_RING/",
            platform="steam",
            max_reviews=50
        )

        for review in reviews[:5]:
            print(f"Rating: {review.rating}")
            print(f"Text: {review.text[:100]}...")
            print()

asyncio.run(main())

Browser-Based Scraping (Amazon)¶

async with Sentimatrix() as sm:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        max_reviews=30,
        use_browser=True  # Enables Playwright
    )

Playwright Required

For browser-based scraping:

pip install sentimatrix[scraping]
playwright install chromium

Using Commercial APIs¶

from sentimatrix.config import SentimatrixConfig, ScraperConfig

config = SentimatrixConfig(
    scraper=ScraperConfig(
        api_provider="scraperapi",
        api_key="your-scraperapi-key"
    )
)

async with Sentimatrix(config) as sm:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        max_reviews=100
    )

Commercial API Comparison¶

Service	Proxy Pool	Starting Price	Best For
ScraperAPI	40M+	$49/mo	General use
Apify	Varies	Pay-per-use	Pre-built actors
Bright Data	72M+	$500/mo	Enterprise
Oxylabs	100M+	Custom	E-commerce
Zyte	50M+	$450/mo	AI extraction
ScrapingBee	1M+	$49/mo	Screenshots
ScrapingAnt	1M+	$19/mo	Budget

Configuration¶

YAML Configuration¶

sentimatrix.yaml

scraper:
  # Rate limiting
  rate_limit:
    requests_per_second: 2
    burst_size: 5

  # Retry settings
  retry:
    max_retries: 3
    backoff_factor: 2.0

  # Browser settings
  browser:
    headless: true
    timeout: 30000

  # Commercial API (optional)
  api:
    provider: scraperapi
    # api_key loaded from SCRAPERAPI_KEY env var

Python Configuration¶

from sentimatrix.config import (
    SentimatrixConfig,
    ScraperConfig,
    RateLimitConfig,
    RetryConfig,
    BrowserConfig,
)

config = SentimatrixConfig(
    scraper=ScraperConfig(
        rate_limit=RateLimitConfig(
            requests_per_second=2,
            burst_size=5,
        ),
        retry=RetryConfig(
            max_retries=3,
            backoff_factor=2.0,
        ),
        browser=BrowserConfig(
            headless=True,
            timeout=30000,
        ),
    )
)

Error Handling¶

from sentimatrix.exceptions import (
    ScraperError,
    RateLimitError,
    BlockedError,
    ParseError,
)

async with Sentimatrix() as sm:
    try:
        reviews = await sm.scrape_reviews(url, platform="amazon")
    except RateLimitError as e:
        print(f"Rate limited, retry after {e.retry_after}s")
    except BlockedError:
        print("IP blocked, try using a commercial API")
    except ParseError as e:
        print(f"Failed to parse response: {e}")
    except ScraperError as e:
        print(f"Scraping failed: {e}")

Scraping Infrastructure¶

Sentimatrix includes a robust scraping infrastructure with:

Rate Limiting¶

Built-in token bucket rate limiter with configurable strategies:

from sentimatrix.config import RateLimitConfig

rate_limit = RateLimitConfig(
    requests_per_second=2,      # Base rate
    concurrent_requests=5,      # Max concurrent
    backoff_factor=2.0,         # Exponential backoff
)

Retry Handler¶

Automatic retry with exponential backoff:

from sentimatrix.config import RetryConfig

retry = RetryConfig(
    max_retries=3,
    initial_delay=1.0,
    exponential_base=2.0,
    jitter=True,  # Add randomness to prevent thundering herd
)

Proxy Support¶

Rotate proxies to avoid blocks:

from sentimatrix.config import ProxyConfig

proxy = ProxyConfig(
    enabled=True,
    rotation=True,
    country="us",
)

Best Practices¶

Respect Rate Limits
- Don't exceed platform limits
- Implement exponential backoff
- Use delays between requests
Use Browser Only When Needed
- Steam, Reddit: No browser required
- Amazon, IMDB: Browser recommended
Handle Blocks Gracefully
- Rotate user agents
- Use commercial APIs for scale
- Implement retry logic
Cache Responses
- Avoid repeated requests
- Store scraped data locally
Stay Compliant
- Check robots.txt
- Follow terms of service
- Don't overload servers

Scraper Documentation¶

Core Scrapers¶

HTTPX Scraper - Static pages
Playwright Scraper - Dynamic pages

Platform Scrapers¶

Amazon - Product reviews
Steam - Game reviews
YouTube - Video comments
Reddit - Posts and comments
IMDB - Movie reviews
Yelp - Business reviews
Trustpilot - Company reviews
Google Reviews - Local business reviews