Commercial Scraping APIs¶
Commercial scraping APIs provide reliable, scalable web scraping without the hassle of managing proxies, handling CAPTCHAs, or dealing with blocks.
Why Use Commercial APIs?¶
| Challenge | Direct Scraping | Commercial API |
|---|---|---|
| IP Blocks | Manage proxies yourself | Automatic rotation |
| CAPTCHAs | Manual solving | Automatic solving |
| Rate Limits | Careful throttling | Higher limits |
| JavaScript | Run browsers | Cloud rendering |
| Maintenance | Constant updates | Provider handles |
| Scale | Limited | Millions of requests |
Supported Providers¶
| Provider | Proxy Pool | JS Rendering | Starting Price |
|---|---|---|---|
| ScraperAPI | 40M+ | $49/mo | |
| Apify | Varies | Pay-per-use | |
| Bright Data | 72M+ | $500/mo | |
| Oxylabs | 100M+ | Custom | |
| Zyte | 50M+ | $450/mo | |
| ScrapingBee | 1M+ | $49/mo | |
| ScrapingAnt | 1M+ | $19/mo |
Quick Start¶
Using ScraperAPI¶
from sentimatrix.providers.scrapers.commercial import ScraperAPIClient
async with ScraperAPIClient(api_key="your-api-key") as client:
result = await client.scrape("https://example.com")
print(f"Status: {result.status_code}")
print(f"Content: {result.content[:200]}")
Using Apify¶
from sentimatrix.providers.scrapers.commercial import ApifyClient
async with ApifyClient(api_token="your-token") as client:
# Basic scraping (uses cheerio-scraper)
result = await client.scrape("https://example.com")
# Or run specific actors
run = await client.run_actor(
"apify/web-scraper",
input={"startUrls": [{"url": "https://example.com"}]}
)
items = await client.get_dataset_items(run["defaultDatasetId"])
Using ScrapingBee¶
from sentimatrix.providers.scrapers.commercial import ScrapingBeeClient
async with ScrapingBeeClient(api_key="your-api-key") as client:
result = await client.scrape(
"https://example.com",
render_js=True, # Enable JavaScript rendering
premium_proxy=True # Use premium proxies
)
Environment Variables¶
# ScraperAPI
export SCRAPERAPI_KEY="your-key"
# Apify
export APIFY_TOKEN="your-token"
# Bright Data
export BRIGHTDATA_USERNAME="your-username"
export BRIGHTDATA_PASSWORD="your-password"
# Oxylabs
export OXYLABS_USERNAME="your-username"
export OXYLABS_PASSWORD="your-password"
# Zyte
export ZYTE_API_KEY="your-key"
# ScrapingBee
export SCRAPINGBEE_API_KEY="your-key"
# ScrapingAnt
export SCRAPINGANT_API_KEY="your-key"
Provider Comparison¶
By Use Case¶
| Use Case | Recommended |
|---|---|
| Budget | ScrapingAnt ($19/mo) |
| General | ScraperAPI, ScrapingBee |
| E-commerce | Oxylabs, Bright Data |
| Enterprise | Bright Data, Zyte |
| AI Extraction | Zyte, Apify |
By Feature¶
| Feature | Best Provider |
|---|---|
| Largest Proxy Pool | Oxylabs (100M+) |
| Best for Amazon | Oxylabs |
| AI-Powered Extraction | Zyte, Apify |
| Lowest Price | ScrapingAnt |
| Screenshots | ScrapingBee |
| Pre-built Scrapers | Apify |
Configuration Options¶
ScraperConfig(
# Provider selection
api_provider="scraperapi", # or brightdata, oxylabs, etc.
# Authentication
api_key="your-key", # For most providers
username="user", # For Bright Data, Oxylabs
password="pass",
# Request options
render_js=True, # Enable JS rendering
country="us", # Geo-targeting
premium_proxy=False, # Use premium proxies
# Retry settings
retry=RetryConfig(
max_retries=3,
backoff_factor=2.0,
),
)
Cost Estimation¶
| Volume | ScrapingAnt | ScraperAPI | Bright Data |
|---|---|---|---|
| 10K requests | $19 | $49 | ~$50 |
| 50K requests | ~$50 | ~$100 | ~$150 |
| 100K requests | ~$80 | ~$200 | ~$300 |
| 500K requests | ~$300 | ~$600 | ~$1000 |
Best Practices¶
-
Start with Budget Providers
- Test with ScrapingAnt or ScraperAPI
- Scale up as needed
-
Use Geo-Targeting
- Match target site location
- Reduces blocks
-
Enable JS Rendering Selectively
- Only when needed
- Costs more credits
-
Implement Caching
- Avoid repeated requests
- Save costs
-
Monitor Usage
- Track API calls
- Set budget alerts