Security Best Practices

Protect your data and API keys with security best practices

Security is Critical

Compromised API keys can lead to unauthorized access, data breaches, and unexpected charges. Following these security best practices is essential to protect your account and data.

API Key Security

Never Commit Keys to Version Control

Don't Do This

bad_example.py

python

# ❌ NEVER hardcode API keys
api_key = "sk_live_1234567890"  # This is BAD!

client = ScrapeHubClient(api_key=api_key)

Do This Instead

good_example.py

python

# ✅ Use environment variables
import os

api_key = os.getenv('SCRAPEHUB_API_KEY')
client = ScrapeHubClient(api_key=api_key)

Environment Variables

Store API keys in environment variables or secure credential managers:

.env

bash

Terminal
# .env file (add to .gitignore!)
SCRAPEHUB_API_KEY=sk_live_xxxx_xxxx
SCRAPEHUB_WEBHOOK_SECRET=whsec_xxxx_xxxx

# Never commit this file to git!

.gitignore

bash

Terminal
# Add to .gitignore
.env
.env.local
.env.*.local
*.key
secrets.json
credentials.json

Loading Environment Variables

load_env.py

python

# Python with python-dotenv
from dotenv import load_dotenv
import os

load_dotenv()  # Load from .env file
api_key = os.getenv('SCRAPEHUB_API_KEY')

if not api_key:
    raise ValueError("API key not found in environment variables")

load_env.js

javascript

// Node.js with dotenv
require('dotenv').config();

const apiKey = process.env.SCRAPEHUB_API_KEY;

if (!apiKey) {
  throw new Error('API key not found in environment variables');
}

API Key Management

Use Multiple Keys

Create separate API keys for development, staging, and production environments

Rotate Regularly

Rotate API keys every 90 days or immediately if compromised

Monitor Usage

Regularly review API key usage in your dashboard for suspicious activity

Revoke Unused Keys

Delete API keys that are no longer in use to minimize risk

Secure Data Handling

Sanitize Scraped Data

Always sanitize and validate scraped data before using it:

sanitize_data.py

python

import html
import re

def sanitize_text(text):
    """Remove potentially harmful content from scraped text"""
    if not text:
        return ""

    # HTML entity decode
    text = html.unescape(text)

    # Remove script tags and content
    text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Use sanitization
result = client.scrape("https://example.com")
clean_title = sanitize_text(result.data.get('title'))
clean_description = sanitize_text(result.data.get('description'))

Validate URLs

validate_urls.py

python

from urllib.parse import urlparse

def is_safe_url(url, allowed_domains=None):
    """Validate URL before scraping"""
    try:
        parsed = urlparse(url)

        # Check scheme
        if parsed.scheme not in ['http', 'https']:
            return False

        # Check domain whitelist
        if allowed_domains:
            if parsed.netloc not in allowed_domains:
                return False

        # Avoid localhost and private IPs
        if parsed.netloc in ['localhost', '127.0.0.1', '0.0.0.0']:
            return False

        return True

    except Exception:
        return False

# Use validation
url = user_provided_url  # From user input
if is_safe_url(url, allowed_domains=['example.com', 'api.example.com']):
    result = client.scrape(url)
else:
    raise ValueError("Invalid or unsafe URL")

Encrypt Sensitive Data

encrypt_data.py

python

from cryptography.fernet import Fernet
import os

# Generate or load encryption key
encryption_key = os.getenv('ENCRYPTION_KEY')
if not encryption_key:
    encryption_key = Fernet.generate_key()
    print(f"Save this key securely: {encryption_key.decode()}")

cipher = Fernet(encryption_key)

# Encrypt sensitive scraped data
result = client.scrape("https://example.com")
sensitive_data = str(result.data).encode()
encrypted_data = cipher.encrypt(sensitive_data)

# Store encrypted_data in database

# Later, decrypt when needed
decrypted_data = cipher.decrypt(encrypted_data).decode()
print(decrypted_data)

Network Security

Use HTTPS

Always use HTTPS for API requests and webhook endpoints:

https_only.py

python

# Configure client to enforce HTTPS
client = ScrapeHubClient(
    api_key=api_key,
    enforce_https=True  # Reject non-HTTPS URLs
)

# Webhook endpoints must use HTTPS in production
webhook_url = "https://your-server.com/webhook"  # ✅ HTTPS
# webhook_url = "http://your-server.com/webhook"  # ❌ HTTP

IP Whitelisting

Restrict API access to specific IP addresses in the dashboard:

Terminal
# In ScrapeHub Dashboard → Settings → Security
# Add allowed IP addresses:
# 203.0.113.0/24  (your server IP range)
# 198.51.100.50   (your office IP)

# API requests from other IPs will be rejected

Rate Limiting

rate_limiting.py

python

from time import sleep
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, max_requests, time_window):
        self.max_requests = max_requests
        self.time_window = time_window  # seconds
        self.requests = []

    def wait_if_needed(self):
        now = datetime.now()
        cutoff = now - timedelta(seconds=self.time_window)

        # Remove old requests
        self.requests = [r for r in self.requests if r > cutoff]

        if len(self.requests) >= self.max_requests:
            # Wait until oldest request expires
            sleep_time = (self.requests[0] - cutoff).total_seconds()
            sleep(sleep_time)
            self.requests.pop(0)

        self.requests.append(now)

# Use rate limiter
limiter = RateLimiter(max_requests=10, time_window=60)

for url in urls:
    limiter.wait_if_needed()
    result = client.scrape(url)

Error Handling

Don't Expose Sensitive Info in Errors

safe_error_handling.py

python

import logging

# Configure logging (don't log sensitive data)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
    result = client.scrape(url)
except Exception as e:
    # ❌ Bad: Exposes API key in logs
    # logger.error(f"Scraping failed with key {api_key}: {e}")

    # ✅ Good: Generic error message
    logger.error(f"Scraping failed for URL: {e}")

    # Don't expose internal errors to end users
    # ❌ Bad: raise e
    # ✅ Good: raise generic error
    raise Exception("Unable to fetch data. Please try again later.")

Webhook Security

Always Verify Signatures

webhook_security.py

python

import hmac
import hashlib
from flask import Flask, request, abort

app = Flask(__name__)
WEBHOOK_SECRET = os.getenv('SCRAPEHUB_WEBHOOK_SECRET')

@app.route('/webhook', methods=['POST'])
def webhook():
    # Get signature
    signature = request.headers.get('X-ScrapeHub-Signature')
    if not signature:
        logger.warning("Webhook request missing signature")
        abort(401)

    # Verify signature
    payload = request.get_data()
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        payload,
        hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        logger.warning("Invalid webhook signature")
        abort(401)

    # Process webhook
    data = request.json
    process_webhook(data)

    return {'status': 'success'}, 200

Compliance & Privacy

Respect robots.txt

Check and honor robots.txt directives before scraping. ScrapeHub provides built-in robots.txt parsing.

Handle Personal Data Carefully

If scraping personal information, ensure compliance with GDPR, CCPA, and other privacy regulations.

Terms of Service

Always review and comply with website terms of service before scraping.

Security Checklist

Pre-Production Security Checklist

API keys stored in environment variables, not in code
.env and credential files added to .gitignore
Separate keys for development and production
HTTPS enforced for all API requests and webhooks
Webhook signature verification implemented
Input validation for URLs and scraped data
Rate limiting configured appropriately
Error handling doesn't expose sensitive information
Logging configured to exclude API keys and secrets
Regular API key rotation scheduled
IP whitelisting configured if applicable
Security monitoring and alerts set up

Incident Response

If Your API Key is Compromised

Immediately revoke the compromised key in the ScrapeHub dashboard
Generate a new API key with a different name
Update your applications with the new key
Review recent activity for unauthorized usage
Check your billing for unexpected charges
Contact support if you notice suspicious activity
Audit your codebase to prevent future exposures

Report Security Issues

If you discover a security vulnerability in ScrapeHub, please email security@scrapehub.io immediately. Do not post security issues publicly.

Security Best Practices

Security is Critical

API Key Security

Never Commit Keys to Version Control

Don't Do This

Do This Instead

Environment Variables

Loading Environment Variables

API Key Management

Use Multiple Keys

Rotate Regularly

Monitor Usage

Revoke Unused Keys

Secure Data Handling

Sanitize Scraped Data

Validate URLs

Encrypt Sensitive Data

Network Security

Use HTTPS

IP Whitelisting

Rate Limiting

Error Handling

Don't Expose Sensitive Info in Errors

Webhook Security

Always Verify Signatures

Compliance & Privacy

Respect robots.txt

Handle Personal Data Carefully

Terms of Service

Security Checklist

Pre-Production Security Checklist

Incident Response

If Your API Key is Compromised

Report Security Issues

Next Steps