Security Best Practices

Protect your data and API keys with security best practices

Security is Critical

Compromised API keys can lead to unauthorized access, data breaches, and unexpected charges. Following these security best practices is essential to protect your account and data.

API Key Security

Never Commit Keys to Version Control

Don't Do This

bad_example.py
python
# ❌ NEVER hardcode API keys
api_key = "sk_live_1234567890"  # This is BAD!

client = ScrapeHubClient(api_key=api_key)

Do This Instead

good_example.py
python
# ✅ Use environment variables
import os

api_key = os.getenv('SCRAPEHUB_API_KEY')
client = ScrapeHubClient(api_key=api_key)

Environment Variables

Store API keys in environment variables or secure credential managers:

.env
bash
Terminal
# .env file (add to .gitignore!) SCRAPEHUB_API_KEY=sk_live_xxxx_xxxx SCRAPEHUB_WEBHOOK_SECRET=whsec_xxxx_xxxx # Never commit this file to git!
.gitignore
bash
Terminal
# Add to .gitignore .env .env.local .env.*.local *.key secrets.json credentials.json

Loading Environment Variables

load_env.py
python
# Python with python-dotenv
from dotenv import load_dotenv
import os

load_dotenv()  # Load from .env file
api_key = os.getenv('SCRAPEHUB_API_KEY')

if not api_key:
    raise ValueError("API key not found in environment variables")
load_env.js
javascript
// Node.js with dotenv
require('dotenv').config();

const apiKey = process.env.SCRAPEHUB_API_KEY;

if (!apiKey) {
  throw new Error('API key not found in environment variables');
}

API Key Management

Use Multiple Keys

Create separate API keys for development, staging, and production environments

Rotate Regularly

Rotate API keys every 90 days or immediately if compromised

Monitor Usage

Regularly review API key usage in your dashboard for suspicious activity

Revoke Unused Keys

Delete API keys that are no longer in use to minimize risk

Secure Data Handling

Sanitize Scraped Data

Always sanitize and validate scraped data before using it:

sanitize_data.py
python
import html
import re

def sanitize_text(text):
    """Remove potentially harmful content from scraped text"""
    if not text:
        return ""

    # HTML entity decode
    text = html.unescape(text)

    # Remove script tags and content
    text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Use sanitization
result = client.scrape("https://example.com")
clean_title = sanitize_text(result.data.get('title'))
clean_description = sanitize_text(result.data.get('description'))

Validate URLs

validate_urls.py
python
from urllib.parse import urlparse

def is_safe_url(url, allowed_domains=None):
    """Validate URL before scraping"""
    try:
        parsed = urlparse(url)

        # Check scheme
        if parsed.scheme not in ['http', 'https']:
            return False

        # Check domain whitelist
        if allowed_domains:
            if parsed.netloc not in allowed_domains:
                return False

        # Avoid localhost and private IPs
        if parsed.netloc in ['localhost', '127.0.0.1', '0.0.0.0']:
            return False

        return True

    except Exception:
        return False

# Use validation
url = user_provided_url  # From user input
if is_safe_url(url, allowed_domains=['example.com', 'api.example.com']):
    result = client.scrape(url)
else:
    raise ValueError("Invalid or unsafe URL")

Encrypt Sensitive Data

encrypt_data.py
python
from cryptography.fernet import Fernet
import os

# Generate or load encryption key
encryption_key = os.getenv('ENCRYPTION_KEY')
if not encryption_key:
    encryption_key = Fernet.generate_key()
    print(f"Save this key securely: {encryption_key.decode()}")

cipher = Fernet(encryption_key)

# Encrypt sensitive scraped data
result = client.scrape("https://example.com")
sensitive_data = str(result.data).encode()
encrypted_data = cipher.encrypt(sensitive_data)

# Store encrypted_data in database

# Later, decrypt when needed
decrypted_data = cipher.decrypt(encrypted_data).decode()
print(decrypted_data)

Network Security

Use HTTPS

Always use HTTPS for API requests and webhook endpoints:

https_only.py
python
# Configure client to enforce HTTPS
client = ScrapeHubClient(
    api_key=api_key,
    enforce_https=True  # Reject non-HTTPS URLs
)

# Webhook endpoints must use HTTPS in production
webhook_url = "https://your-server.com/webhook"  # ✅ HTTPS
# webhook_url = "http://your-server.com/webhook"  # ❌ HTTP

IP Whitelisting

Restrict API access to specific IP addresses in the dashboard:

Terminal
# In ScrapeHub Dashboard → Settings → Security # Add allowed IP addresses: # 203.0.113.0/24 (your server IP range) # 198.51.100.50 (your office IP) # API requests from other IPs will be rejected

Rate Limiting

rate_limiting.py
python
from time import sleep
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, max_requests, time_window):
        self.max_requests = max_requests
        self.time_window = time_window  # seconds
        self.requests = []

    def wait_if_needed(self):
        now = datetime.now()
        cutoff = now - timedelta(seconds=self.time_window)

        # Remove old requests
        self.requests = [r for r in self.requests if r > cutoff]

        if len(self.requests) >= self.max_requests:
            # Wait until oldest request expires
            sleep_time = (self.requests[0] - cutoff).total_seconds()
            sleep(sleep_time)
            self.requests.pop(0)

        self.requests.append(now)

# Use rate limiter
limiter = RateLimiter(max_requests=10, time_window=60)

for url in urls:
    limiter.wait_if_needed()
    result = client.scrape(url)

Error Handling

Don't Expose Sensitive Info in Errors

safe_error_handling.py
python
import logging

# Configure logging (don't log sensitive data)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
    result = client.scrape(url)
except Exception as e:
    # ❌ Bad: Exposes API key in logs
    # logger.error(f"Scraping failed with key {api_key}: {e}")

    # ✅ Good: Generic error message
    logger.error(f"Scraping failed for URL: {e}")

    # Don't expose internal errors to end users
    # ❌ Bad: raise e
    # ✅ Good: raise generic error
    raise Exception("Unable to fetch data. Please try again later.")

Webhook Security

Always Verify Signatures

webhook_security.py
python
import hmac
import hashlib
from flask import Flask, request, abort

app = Flask(__name__)
WEBHOOK_SECRET = os.getenv('SCRAPEHUB_WEBHOOK_SECRET')

@app.route('/webhook', methods=['POST'])
def webhook():
    # Get signature
    signature = request.headers.get('X-ScrapeHub-Signature')
    if not signature:
        logger.warning("Webhook request missing signature")
        abort(401)

    # Verify signature
    payload = request.get_data()
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        logger.warning("Invalid webhook signature")
        abort(401)

    # Process webhook
    data = request.json
    process_webhook(data)

    return {'status': 'success'}, 200

Compliance & Privacy

Respect robots.txt

Check and honor robots.txt directives before scraping. ScrapeHub provides built-in robots.txt parsing.

Handle Personal Data Carefully

If scraping personal information, ensure compliance with GDPR, CCPA, and other privacy regulations.

Terms of Service

Always review and comply with website terms of service before scraping.

Security Checklist

Pre-Production Security Checklist

  • API keys stored in environment variables, not in code
  • .env and credential files added to .gitignore
  • Separate keys for development and production
  • HTTPS enforced for all API requests and webhooks
  • Webhook signature verification implemented
  • Input validation for URLs and scraped data
  • Rate limiting configured appropriately
  • Error handling doesn't expose sensitive information
  • Logging configured to exclude API keys and secrets
  • Regular API key rotation scheduled
  • IP whitelisting configured if applicable
  • Security monitoring and alerts set up

Incident Response

If Your API Key is Compromised

  1. Immediately revoke the compromised key in the ScrapeHub dashboard
  2. Generate a new API key with a different name
  3. Update your applications with the new key
  4. Review recent activity for unauthorized usage
  5. Check your billing for unexpected charges
  6. Contact support if you notice suspicious activity
  7. Audit your codebase to prevent future exposures

Report Security Issues

If you discover a security vulnerability in ScrapeHub, please email security@scrapehub.io immediately. Do not post security issues publicly.