Compliance

Understand legal and regulatory compliance requirements

ScrapeHub Compliance Commitment

ScrapeHub is committed to maintaining the highest standards of data protection and regulatory compliance. We continuously monitor and adapt to evolving regulations to ensure our platform meets global compliance requirements.

GDPR Compliance

The General Data Protection Regulation (GDPR) governs the processing of personal data for EU citizens. ScrapeHub helps you stay compliant when scraping and processing personal information.

Key GDPR Principles

Data Minimization

Only collect and process data that is necessary for your specific purpose

Purpose Limitation

Use collected data only for the purposes you specified when collecting it

Data Accuracy

Ensure personal data is accurate, up-to-date, and corrected when necessary

Storage Limitation

Retain personal data only for as long as necessary for processing purposes

Implementing GDPR-Compliant Scraping

gdpr_compliant_scraping.py
python
from scrapehub import ScrapeHubClient
import hashlib
from datetime import datetime, timedelta

client = ScrapeHubClient(api_key="your_api_key")

# Configure data minimization
result = client.scrape(
    url="https://example.com",
    selectors={
        # Only extract necessary fields
        "product_name": ".product-title",
        "price": ".product-price",
        # Don't extract personal data unless required
        # "user_email": ".user-email"  # ❌ Avoid unless necessary
    }
)

# Pseudonymize personal data if collection is required
def pseudonymize(data):
    """Hash personal identifiers for GDPR compliance"""
    if 'email' in data:
        data['email_hash'] = hashlib.sha256(
            data['email'].encode()
        ).hexdigest()
        del data['email']  # Remove original email
    return data

# Apply retention policy
retention_period = timedelta(days=90)
data_expiry = datetime.now() + retention_period

# Store with expiry metadata
scraped_data = {
    'data': result.data,
    'collected_at': datetime.now().isoformat(),
    'expires_at': data_expiry.isoformat(),
    'purpose': 'price_monitoring'  # Document purpose
}

print(f"Data will be deleted after: {data_expiry}")

Data Subject Rights

GDPR grants individuals specific rights over their personal data. Ensure your systems can support:

Right to Access

Individuals can request copies of their personal data you hold

Right to Erasure

Individuals can request deletion of their personal data ("right to be forgotten")

Right to Rectification

Individuals can request corrections to inaccurate personal data

Right to Data Portability

Individuals can request their data in a machine-readable format

CCPA Compliance

The California Consumer Privacy Act (CCPA) provides privacy rights and consumer protection for California residents.

CCPA Requirements

ccpa_compliance.py
python
from scrapehub import ScrapeHubClient

client = ScrapeHubClient(api_key="your_api_key")

class CCPACompliantScraper:
    def __init__(self):
        self.data_collection_log = []

    def scrape_with_notice(self, url, purpose):
        """Scrape with documented purpose (CCPA notice requirement)"""

        # Log collection activity
        self.data_collection_log.append({
            'url': url,
            'timestamp': datetime.now().isoformat(),
            'purpose': purpose,
            'categories': ['commercial_info', 'online_identifiers']
        })

        result = client.scrape(url)
        return result

    def opt_out_handler(self, user_id):
        """Handle California residents' opt-out requests"""
        # Delete all data associated with user
        # Stop future data collection for this user
        print(f"Processing opt-out for user: {user_id}")
        # Implementation depends on your database

    def provide_data_access(self, user_id):
        """Provide users access to their collected data"""
        # Return all data collected about the user
        # Must be provided within 45 days of request
        user_data = self.get_user_data(user_id)
        return {
            'data': user_data,
            'categories': ['personal_info', 'commercial_info'],
            'sources': ['web_scraping'],
            'purposes': ['analytics', 'price_monitoring']
        }

scraper = CCPACompliantScraper()
result = scraper.scrape_with_notice(
    "https://example.com",
    purpose="competitive_price_analysis"
)

robots.txt Compliance

Respecting robots.txt is a fundamental ethical and often legal requirement for web scraping.

Automatic robots.txt Checking

robots_txt.py
python
from scrapehub import ScrapeHubClient

client = ScrapeHubClient(
    api_key="your_api_key",
    respect_robots_txt=True  # Enable automatic checking
)

# ScrapeHub will automatically check robots.txt
# and reject requests to disallowed paths
try:
    result = client.scrape("https://example.com/admin")
except client.RobotsTxtError as e:
    print(f"Scraping blocked by robots.txt: {e}")
    # Handle appropriately

# Check robots.txt manually
robots_info = client.check_robots_txt("https://example.com")
print(f"Can scrape: {robots_info.can_scrape}")
print(f"Crawl delay: {robots_info.crawl_delay} seconds")
print(f"Disallowed paths: {robots_info.disallowed_paths}")

Understanding robots.txt

robots.txt
text
# Example robots.txt file
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 10

User-agent: ScrapeHubBot
Allow: /public-data/
Crawl-delay: 5

# Always respect these directives

Terms of Service Compliance

Review Before Scraping

  • Always read the target website's Terms of Service before scraping
  • Some websites explicitly prohibit automated data collection
  • Respect rate limits and crawl delays specified by websites
  • Identify your scraper with a proper User-Agent string
  • Consider reaching out to site owners for permission or API access

Setting a Proper User-Agent

user_agent.py
python
from scrapehub import ScrapeHubClient

client = ScrapeHubClient(
    api_key="your_api_key",
    user_agent="MyCompanyBot/1.0 (+https://mycompany.com/bot)"
)

# This helps website owners:
# - Identify your scraper
# - Contact you if needed
# - Apply appropriate rate limits

result = client.scrape(
    url="https://example.com",
    headers={
        "User-Agent": "MyCompanyBot/1.0 (+https://mycompany.com/bot)",
        "From": "bot@mycompany.com"  # Contact email
    }
)

Data Protection Certifications

SOC 2 Type II

ScrapeHub maintains SOC 2 Type II certification for security, availability, and confidentiality

ISO 27001

Certified for information security management system standards

GDPR Compliant

Full compliance with EU General Data Protection Regulation

CCPA Compliant

Adheres to California Consumer Privacy Act requirements

Industry-Specific Compliance

Healthcare (HIPAA)

If scraping healthcare-related data, ensure HIPAA compliance:

Do not scrape Protected Health Information (PHI) without proper authorization

Implement encryption for PHI at rest and in transit

Maintain audit logs of all PHI access and processing

Financial Services (PCI DSS)

Never Scrape Payment Card Data

  • Do not extract credit card numbers, CVV codes, or cardholder data
  • Scraping payment information is prohibited and illegal in most jurisdictions
  • PCI DSS compliance requires strict controls that scraping cannot meet

Compliance Best Practices

Compliance Checklist

  • Document the purpose and legal basis for data collection
  • Implement data retention and deletion policies
  • Maintain records of processing activities
  • Conduct regular compliance audits
  • Provide clear privacy notices to data subjects
  • Establish processes for data subject rights requests
  • Train team members on compliance requirements
  • Review and update compliance practices regularly
  • Consult with legal counsel for specific use cases

Data Processing Agreements

ScrapeHub acts as a data processor when you use our service. We provide standard Data Processing Agreements (DPAs) for GDPR compliance.

Terminal
# Request a DPA # 1. Log in to ScrapeHub Dashboard # 2. Navigate to Settings → Legal & Compliance # 3. Download the standard DPA # 4. For custom DPAs, contact enterprise@scrapehub.io