Amazon Scraper¶

Scrape product reviews from Amazon product pages. Browser (Playwright) required for JavaScript rendering.

Stable

Quick Facts¶

Property	Value
Browser Required	Yes (Playwright)
Authentication	None
Rate Limit	10 requests/min
Data Available	Reviews, ratings, titles, verified purchase, helpful votes

Setup¶

Install Playwright:

pip install sentimatrix[scraping]
playwright install chromium

Quick Start¶

import asyncio
from sentimatrix import Sentimatrix

async def main():
    async with Sentimatrix() as sm:
        reviews = await sm.scrape_reviews(
            url="https://www.amazon.com/dp/B0BSHF7WHW",
            platform="amazon",
            max_reviews=50,
            use_browser=True
        )

        print(f"Scraped {len(reviews)} reviews")

        for review in reviews[:3]:
            print(f"\n[{review.rating}/5] {review.title}")
            print(f"Verified: {review.verified_purchase}")
            print(f"Helpful: {review.helpful_count}")
            print(f"Text: {review.text[:150]}...")

asyncio.run(main())

URL Formats¶

# Full product URL
url = "https://www.amazon.com/dp/B0BSHF7WHW"

# With product name
url = "https://www.amazon.com/Apple-iPhone-15-Pro-256GB/dp/B0BSHF7WHW"

# Review page URL
url = "https://www.amazon.com/product-reviews/B0BSHF7WHW"

# Just ASIN
reviews = await sm.scrape_reviews(
    url="B0BSHF7WHW",
    platform="amazon"
)

Options¶

Filter by Rating¶

reviews = await sm.scrape_reviews(
    url="https://www.amazon.com/dp/B0BSHF7WHW",
    platform="amazon",
    max_reviews=100,
    rating_filter=5  # 1-5 or None for all
)

Filter by Verified Purchase¶

reviews = await sm.scrape_reviews(
    url="https://www.amazon.com/dp/B0BSHF7WHW",
    platform="amazon",
    max_reviews=100,
    verified_only=True
)

Sort Order¶

reviews = await sm.scrape_reviews(
    url="https://www.amazon.com/dp/B0BSHF7WHW",
    platform="amazon",
    max_reviews=100,
    sort_by="recent"  # "recent", "helpful", or "top"
)

Different Marketplaces¶

# Amazon UK
reviews = await sm.scrape_reviews(
    url="https://www.amazon.co.uk/dp/B0BSHF7WHW",
    platform="amazon"
)

# Amazon Germany
reviews = await sm.scrape_reviews(
    url="https://www.amazon.de/dp/B0BSHF7WHW",
    platform="amazon"
)

# Amazon Japan
reviews = await sm.scrape_reviews(
    url="https://www.amazon.co.jp/dp/B0BSHF7WHW",
    platform="amazon"
)

Response Schema¶

class AmazonReview:
    text: str               # Review text content
    title: str              # Review title
    rating: int             # 1-5 star rating
    helpful_count: int      # Helpful votes
    verified_purchase: bool # Is verified purchase
    author_name: str        # Reviewer name
    posted_date: datetime   # Review date
    images: list[str]       # Image URLs (if any)
    variant: str            # Product variant purchased
    platform: str           # "amazon"
    marketplace: str        # "amazon.com", "amazon.co.uk", etc.

Using Commercial APIs¶

For high-volume scraping, use commercial APIs to avoid blocks:

from sentimatrix.config import SentimatrixConfig, ScraperConfig

# Using ScraperAPI
config = SentimatrixConfig(
    scraper=ScraperConfig(
        api_provider="scraperapi",
        api_key="your-scraperapi-key"
    )
)

async with Sentimatrix(config) as sm:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        max_reviews=500  # Can scrape more with commercial API
    )

Example: Product Analysis¶

import asyncio
from sentimatrix import Sentimatrix
from sentimatrix.config import SentimatrixConfig, LLMConfig

async def analyze_product(asin: str):
    config = SentimatrixConfig(
        llm=LLMConfig(
            provider="groq",
            model="llama-3.3-70b-versatile"
        )
    )

    async with Sentimatrix(config) as sm:
        # Scrape reviews
        reviews = await sm.scrape_reviews(
            url=asin,
            platform="amazon",
            max_reviews=100,
            use_browser=True
        )

        # Calculate rating distribution
        ratings = {}
        for review in reviews:
            ratings[review.rating] = ratings.get(review.rating, 0) + 1

        print("Rating Distribution:")
        for rating in sorted(ratings.keys(), reverse=True):
            count = ratings[rating]
            pct = count / len(reviews) * 100
            bar = "=" * int(pct / 2)
            print(f"{rating}: {bar} {pct:.1f}%")

        # Analyze sentiments
        results = await sm.analyze_batch([r.text for r in reviews])

        # Generate aspect-based analysis
        aspects = await sm.analyze_aspects(
            [r.text for r in reviews],
            aspects=["quality", "price", "shipping", "durability", "design"]
        )

        print("\nAspect Sentiments:")
        for aspect, sentiment in aspects.items():
            print(f"  {aspect}: {sentiment}")

        # Generate summary
        summary = await sm.summarize_reviews(
            [{"text": r.text, "rating": r.rating} for r in reviews[:50]]
        )
        print(f"\nSummary:\n{summary}")

asyncio.run(analyze_product("B0BSHF7WHW"))

Handling Anti-Bot Measures¶

Amazon has aggressive anti-bot measures. Recommended approaches:

1. Use Stealth Mode¶

from sentimatrix.config import ScraperConfig, BrowserConfig

config = SentimatrixConfig(
    scraper=ScraperConfig(
        browser=BrowserConfig(
            headless=True,
            stealth_mode=True,  # Use stealth settings
            random_delays=True,  # Add random delays
        )
    )
)

2. Rotate User Agents¶

config = SentimatrixConfig(
    scraper=ScraperConfig(
        rotate_user_agents=True
    )
)

3. Use Commercial APIs (Recommended)¶

For reliable scraping at scale:

config = SentimatrixConfig(
    scraper=ScraperConfig(
        api_provider="scraperapi",  # or "brightdata", "oxylabs"
        api_key="your-key"
    )
)

Rate Limiting¶

from sentimatrix.config import ScraperConfig, RateLimitConfig

config = SentimatrixConfig(
    scraper=ScraperConfig(
        rate_limit=RateLimitConfig(
            requests_per_second=0.15,  # ~9 per minute
            burst_size=2,
            cooldown_on_429=60,  # Wait 60s on rate limit
        )
    )
)

Error Handling¶

from sentimatrix.exceptions import (
    ScraperError,
    BlockedError,
    CaptchaError,
    RateLimitError,
)

try:
    reviews = await sm.scrape_reviews(
        url="https://www.amazon.com/dp/B0BSHF7WHW",
        platform="amazon",
        use_browser=True
    )
except CaptchaError:
    print("CAPTCHA detected, try using commercial API")
except BlockedError:
    print("IP blocked, rotate IP or use proxy")
except RateLimitError as e:
    print(f"Rate limited, wait {e.retry_after}s")
except ScraperError as e:
    print(f"Failed: {e}")

Best Practices¶

Use Commercial APIs for Scale
- Direct scraping works for small volumes
- Commercial APIs handle anti-bot measures
Implement Delays
- Wait between requests
- Randomize timing
Monitor for Blocks
- Check for CAPTCHA responses
- Rotate IPs if blocked
Cache Results
- Store scraped reviews
- Avoid repeated requests
Filter Verified Purchases
- Higher quality reviews
- More trustworthy data