Basic Usage

Learn the fundamentals of web scraping with ScrapeHub

Simple Scraping

The most basic scraping operation requires just a URL:

simple_scrape.py
python
from scrapehub import ScrapeHubClient

client = ScrapeHubClient(api_key="sk_live_xxxx_449x")

# Scrape a single page
result = client.scrape(
    url="https://example.com",
    engine="neural-x1"
)

# Access the extracted data
print(f"Extracted {len(result.data)} items")
for item in result.data:
    print(item)

Request Configuration

Basic Parameters

ParameterTypeDescription
urlstringTarget URL to scrape
enginestringScraper engine (neural-x1, stealth, etc.)
formatstringOutput format (json, csv, xml)
render_jsbooleanEnable JavaScript rendering

JavaScript Rendering

For pages that load content dynamically with JavaScript:

result = client.scrape(
    url="https://example.com/dynamic-page",
    engine="neural-x1",
    render_js=True,
    wait_for_selector=".content-loaded"  # Wait for specific element
)

Custom Headers

Add custom headers to your requests:

result = client.scrape(
    url="https://example.com",
    engine="neural-x1",
    headers={
        "User-Agent": "Mozilla/5.0...",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://google.com"
    }
)

Working with Results

Accessing Data

result = client.scrape("https://example.com/products")

# Access extracted data
data = result.data  # List of dictionaries

# Access metadata
print(f"Status: {result.status}")
print(f"Duration: {result.duration}s")
print(f"Pages scraped: {result.pages_scraped}")

# Iterate through results
for item in data:
    print(f"Title: {item.get('title')}")
    print(f"Price: {item.get('price')}")
    print(f"URL: {item.get('url')}")

Exporting Data

import pandas as pd
import json

# Convert to DataFrame
df = pd.DataFrame(result.data)

# Export to CSV
df.to_csv('results.csv', index=False)

# Export to JSON
with open('results.json', 'w') as f:
    json.dump(result.data, f, indent=2)

# Export to Excel
df.to_excel('results.xlsx', index=False)

Pagination

Automatically scrape multiple pages:

result = client.scrape(
    url="https://example.com/products",
    engine="neural-x1",
    pagination={
        "enabled": True,
        "max_pages": 10,
        "selector": "a.next-page"  # CSS selector for "Next" button
    }
)

print(f"Scraped {result.pages_scraped} pages")
print(f"Total items: {len(result.data)}")

Async vs Sync Scraping

Synchronous

Waits for results before returning. Best for single pages or small batches.

client.scrape()

Asynchronous

Returns immediately with job ID. Best for large datasets.

client.create_job()

Synchronous Scraping

# Blocks until complete
result = client.scrape(
    url="https://example.com",
    engine="neural-x1"
)

# Data is immediately available
print(result.data)

Asynchronous Scraping

import time

# Create job (returns immediately)
job = client.create_job(
    url="https://example.com/large-dataset",
    engine="neural-x1"
)

print(f"Job ID: {job.id}")

# Poll for completion
while not job.is_complete():
    job.refresh()
    print(f"Progress: {job.progress}%")
    time.sleep(5)

# Get results when complete
if job.is_successful():
    results = job.get_results()
    print(f"Extracted {len(results)} items")

Best Practices

  • Use render_js=True for dynamic content
  • Enable pagination for complete data collection
  • Set appropriate timeouts for slow websites
  • Handle errors gracefully with try/except blocks
  • Use async jobs for large-scale scraping
  • Monitor your rate limits and plan usage