Skip to content

Web Scrapers

Sentimatrix provides a comprehensive scraping infrastructure to collect reviews and feedback from popular platforms.

Scraper Categories

:material-web: Core Scrapers

Foundational scrapers for any website.

HTTPX (static), Playwright (dynamic)

:material-store: Platform Scrapers

Pre-built scrapers for popular platforms.

Amazon, Steam, YouTube, Reddit, IMDB, Yelp, Trustpilot, Google Reviews

:material-api: Commercial APIs

Enterprise-grade scraping services.

ScraperAPI, Apify, Bright Data, Oxylabs, Zyte, ScrapingBee, ScrapingAnt

Platform Support Matrix

Platform Browser Required Auth Required Rate Limit Status
Amazon 10/min Stable
Steam 20/min Stable
YouTube API Key 100/min Stable
Reddit OAuth 60/min Stable
IMDB 15/min Stable
Yelp API Key 50/min Stable
Trustpilot 10/min Stable
Google Reviews 5/min Beta

Quick Start

Basic Scraping (Steam - No Browser)

import asyncio
from sentimatrix import Sentimatrix

async def main():
    async with Sentimatrix() as sm:
        reviews = await sm.scrape_reviews(
            url="https://store.steampowered.com/app/1245620/ELDEN_RING/",
            platform="steam",
            max_reviews=50
        )

        for review in reviews[:5]:
            print(f"Rating: {review.rating}")
            print(f"Text: {review.text[:100]}...")
            print()

asyncio.run(main())

Browser-Based Scraping (Amazon)

async with Sentimatrix() as sm:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        max_reviews=30,
        use_browser=True  # Enables Playwright
    )

Playwright Required

For browser-based scraping:

pip install sentimatrix[scraping]
playwright install chromium

Using Commercial APIs

from sentimatrix.config import SentimatrixConfig, ScraperConfig

config = SentimatrixConfig(
    scraper=ScraperConfig(
        api_provider="scraperapi",
        api_key="your-scraperapi-key"
    )
)

async with Sentimatrix(config) as sm:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        max_reviews=100
    )

Commercial API Comparison

Service Proxy Pool JS Rendering Starting Price Best For
ScraperAPI 40M+ $49/mo General use
Apify Varies Pay-per-use Pre-built actors
Bright Data 72M+ $500/mo Enterprise
Oxylabs 100M+ Custom E-commerce
Zyte 50M+ $450/mo AI extraction
ScrapingBee 1M+ $49/mo Screenshots
ScrapingAnt 1M+ $19/mo Budget

Configuration

YAML Configuration

sentimatrix.yaml
scraper:
  # Rate limiting
  rate_limit:
    requests_per_second: 2
    burst_size: 5

  # Retry settings
  retry:
    max_retries: 3
    backoff_factor: 2.0

  # Browser settings
  browser:
    headless: true
    timeout: 30000

  # Commercial API (optional)
  api:
    provider: scraperapi
    # api_key loaded from SCRAPERAPI_KEY env var

Python Configuration

from sentimatrix.config import (
    SentimatrixConfig,
    ScraperConfig,
    RateLimitConfig,
    RetryConfig,
    BrowserConfig,
)

config = SentimatrixConfig(
    scraper=ScraperConfig(
        rate_limit=RateLimitConfig(
            requests_per_second=2,
            burst_size=5,
        ),
        retry=RetryConfig(
            max_retries=3,
            backoff_factor=2.0,
        ),
        browser=BrowserConfig(
            headless=True,
            timeout=30000,
        ),
    )
)

Error Handling

from sentimatrix.exceptions import (
    ScraperError,
    RateLimitError,
    BlockedError,
    ParseError,
)

async with Sentimatrix() as sm:
    try:
        reviews = await sm.scrape_reviews(url, platform="amazon")
    except RateLimitError as e:
        print(f"Rate limited, retry after {e.retry_after}s")
    except BlockedError:
        print("IP blocked, try using a commercial API")
    except ParseError as e:
        print(f"Failed to parse response: {e}")
    except ScraperError as e:
        print(f"Scraping failed: {e}")

Scraping Infrastructure

Sentimatrix includes a robust scraping infrastructure with:

Rate Limiting

Built-in token bucket rate limiter with configurable strategies:

from sentimatrix.config import RateLimitConfig

rate_limit = RateLimitConfig(
    requests_per_second=2,      # Base rate
    concurrent_requests=5,      # Max concurrent
    backoff_factor=2.0,         # Exponential backoff
)

Retry Handler

Automatic retry with exponential backoff:

from sentimatrix.config import RetryConfig

retry = RetryConfig(
    max_retries=3,
    initial_delay=1.0,
    exponential_base=2.0,
    jitter=True,  # Add randomness to prevent thundering herd
)

Proxy Support

Rotate proxies to avoid blocks:

from sentimatrix.config import ProxyConfig

proxy = ProxyConfig(
    enabled=True,
    rotation=True,
    country="us",
)

Best Practices

  1. Respect Rate Limits

    • Don't exceed platform limits
    • Implement exponential backoff
    • Use delays between requests
  2. Use Browser Only When Needed

    • Steam, Reddit: No browser required
    • Amazon, IMDB: Browser recommended
  3. Handle Blocks Gracefully

    • Rotate user agents
    • Use commercial APIs for scale
    • Implement retry logic
  4. Cache Responses

    • Avoid repeated requests
    • Store scraped data locally
  5. Stay Compliant

    • Check robots.txt
    • Follow terms of service
    • Don't overload servers

Scraper Documentation

Core Scrapers

Platform Scrapers

Commercial APIs