Basic Usage
Learn the fundamentals of web scraping with ScrapeHub
Simple Scraping
The most basic scraping operation requires just a URL:
simple_scrape.py
from scrapehub import ScrapeHubClient
client = ScrapeHubClient(api_key="sk_live_xxxx_449x")
# Scrape a single page
result = client.scrape(
url="https://example.com",
engine="neural-x1"
)
# Access the extracted data
print(f"Extracted {len(result.data)} items")
for item in result.data:
print(item)Request Configuration
Basic Parameters
| Parameter | Type | Description |
|---|---|---|
| url | string | Target URL to scrape |
| engine | string | Scraper engine (neural-x1, stealth, etc.) |
| format | string | Output format (json, csv, xml) |
| render_js | boolean | Enable JavaScript rendering |
JavaScript Rendering
For pages that load content dynamically with JavaScript:
result = client.scrape(
url="https://example.com/dynamic-page",
engine="neural-x1",
render_js=True,
wait_for_selector=".content-loaded" # Wait for specific element
)Custom Headers
Add custom headers to your requests:
result = client.scrape(
url="https://example.com",
engine="neural-x1",
headers={
"User-Agent": "Mozilla/5.0...",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://google.com"
}
)Working with Results
Accessing Data
result = client.scrape("https://example.com/products")
# Access extracted data
data = result.data # List of dictionaries
# Access metadata
print(f"Status: {result.status}")
print(f"Duration: {result.duration}s")
print(f"Pages scraped: {result.pages_scraped}")
# Iterate through results
for item in data:
print(f"Title: {item.get('title')}")
print(f"Price: {item.get('price')}")
print(f"URL: {item.get('url')}")Exporting Data
import pandas as pd
import json
# Convert to DataFrame
df = pd.DataFrame(result.data)
# Export to CSV
df.to_csv('results.csv', index=False)
# Export to JSON
with open('results.json', 'w') as f:
json.dump(result.data, f, indent=2)
# Export to Excel
df.to_excel('results.xlsx', index=False)Pagination
Automatically scrape multiple pages:
result = client.scrape(
url="https://example.com/products",
engine="neural-x1",
pagination={
"enabled": True,
"max_pages": 10,
"selector": "a.next-page" # CSS selector for "Next" button
}
)
print(f"Scraped {result.pages_scraped} pages")
print(f"Total items: {len(result.data)}")Async vs Sync Scraping
Synchronous
Waits for results before returning. Best for single pages or small batches.
client.scrape()Asynchronous
Returns immediately with job ID. Best for large datasets.
client.create_job()Synchronous Scraping
# Blocks until complete
result = client.scrape(
url="https://example.com",
engine="neural-x1"
)
# Data is immediately available
print(result.data)Asynchronous Scraping
import time
# Create job (returns immediately)
job = client.create_job(
url="https://example.com/large-dataset",
engine="neural-x1"
)
print(f"Job ID: {job.id}")
# Poll for completion
while not job.is_complete():
job.refresh()
print(f"Progress: {job.progress}%")
time.sleep(5)
# Get results when complete
if job.is_successful():
results = job.get_results()
print(f"Extracted {len(results)} items")Best Practices
- Use
render_js=Truefor dynamic content - Enable pagination for complete data collection
- Set appropriate timeouts for slow websites
- Handle errors gracefully with try/except blocks
- Use async jobs for large-scale scraping
- Monitor your rate limits and plan usage