Scraping & Data Collection

Overview

The Observatorio scrapes legislative data from two chambers, each requiring a fundamentally different approach. The Chamber of Deputies (Diputados) exposes an open data portal with no anti-bot protection, so a straightforward HTTP client suffices. The Senate (Senado) sits behind Imperva’s Incapsula Web Application Firewall (WAF), which detects and blocks automated requests via TLS fingerprinting. Getting data out of the Senado required building a specialized anti-WAF client.

Two chambers, two stacks:

Chamber	Client	Difficulty	Reason
Diputados	httpx	Low	Open data portal, no protection
Senado	curl_cffi	High	Incapsula WAF with TLS fingerprinting

Data Sources

Source	Chamber	Base URL	Method	Volume
Datos Abiertos	Diputados	datos.abiertos.diputados.gob.mx	httpx + delay	~4,600 vote events
Portal LXVI	Senado	senado.gob.mx/66/	curl_cffi + TLS impersonation	5,047 vote events
Directorio XLS	Senado	Official XLS files	pandas read_excel	LVIII-LXV legislatures

Diputados Scraper

The Diputados scraper targets the open data portal at datos.abiertos.diputados.gob.mx. No anti-bot protection is present, so the stack is minimal:

HTTP client: httpx with a configurable delay between requests (default 2.0 seconds)
Parser: BeautifulSoup for HTML parsing where JSON endpoints are unavailable
Data scope: voting records keyed by SITL IDs, legislator profiles, and party affiliations

The scraper pulls approximately 4,600 vote events. All data loads into a SQLite database with deduplication handled by the source_id field on each vote_event record.

:::tip The Diputados open data portal is well-structured and stable. If you are extending coverage to new data types, start here — the lack of anti-bot protection means faster iteration. :::

Senado Scraper — The Anti-WAF Case Study

The Problem

The Senado portal at senado.gob.mx is protected by Incapsula (Imperva WAF). Standard HTTP clients — Python requests, httpx, even curl — get blocked immediately. The WAF detects automated traffic through three mechanisms:

TLS fingerprinting: The JA3 hash of the TLS handshake identifies non-browser clients
JavaScript challenges: Incapsula serves JS that real browsers execute automatically
Behavioral analysis: Request patterns, timing, and cookie behavior are monitored

The first two iterations of the Senado scraper were blocked within minutes of starting a scrape run.

The Solution

The SenadoLXVIClient in scraper_congreso/senadores/client.py uses curl_cffi with TLS fingerprint impersonation. This library wraps libcurl-impersonate, which can reproduce the exact TLS handshake of real browsers — matching JA3 hashes, cipher suites, and extensions.

class SenadoLXVIClient:
    _IMPERSONATE_TARGETS: tuple[BrowserTypeLiteral, ...] = (
        "chrome", "safari", "chrome116", "chrome131", "edge", "chrome_android",
    )

    MAX_REQUESTS_PER_SESSION: int = 10
    WAF_CONSECUTIVE_THRESHOLD = 2

Fingerprint Pool

Six browser impersonation targets rotate across sessions. Each target presents a distinct JA3 hash to the WAF:

Target	Profile
`chrome`	Latest Chrome desktop
`safari`	Safari desktop
`chrome116`	Chrome 116 desktop
`chrome131`	Chrome 131 desktop
`edge`	Edge desktop
`chrome_android`	Chrome mobile

Session Management Strategy

The client uses a layered session strategy designed to minimize WAF detection while recovering gracefully from blocks:

Active session: Fixed fingerprint from the pool, shared persistent cookies across requests within the session
WAF block detected: Close the session immediately, discard all cookies (burned cookies carry WAF flags)
New session: Rotate to the next fingerprint from the pool, perform a warm-up GET request to populate fresh cookies before scraping
Proactive rotation: Rotate the session every 10 requests (MAX_REQUESTS_PER_SESSION) before the WAF has a chance to flag the pattern

:::caution Proactive rotation is critical. Waiting for the WAF to block you before rotating means the new session starts with elevated scrutiny. Rotating early keeps all sessions under the radar. :::

Circuit Breaker

A circuit breaker tracks consecutive WAF blocks. After WAF_CONSECUTIVE_THRESHOLD (2) consecutive blocks, the session is declared burned. The client raises SessionBurnedError, forces a mandatory pause, and must be restarted with a fresh session.

This prevents the scraper from hammering the WAF with requests that will never succeed.

Warm-up Procedure

After creating a new session, the client issues a dummy GET request to the portal before making any real data requests. This warm-up request allows Incapsula to set its challenge cookies. A cold session without cookies gets blocked far more aggressively than one that has already passed the initial JS challenge.

Results

Metric	Count
Vote events scraped	5,047
Senator profiles scraped	1,754
Iterations to get right	3

Anti-WAF Strategy Diagram

Request → Check Cache
  ├─ Cache Hit → Return cached data
  └─ Cache Miss → Send via curl_cffi
       ├─ Response OK → Cache + Return
       └─ WAF Detected (Incapsula markers)
            ├─ Consecutive < 2 → New session, rotate fingerprint, warm-up, retry
            └─ Consecutive ≥ 2 → SessionBurnedError → Pause + restart

The cache layer is not optional. Every cached page is one fewer request to the Senado portal, which means one fewer opportunity for the WAF to detect and block the scraper. For repeated scrape runs, the cache dramatically reduces exposure.

:::tip The cache has a configurable TTL (time-to-live) to prevent stale data from accumulating. Adjust the TTL based on how frequently the source data updates — longer TTLs for historical data, shorter for active legislatures. :::

Data Quality and Processing

Deduplication

Each vote_event record carries a source_id field that maps back to the original identifier from the source portal. This enables idempotent scraping: running the scraper multiple times does not create duplicate records. The SQLite INSERT OR IGNORE pattern on source_id handles this at the database level.

Profile Enrichment

Legislator profiles are enriched with demographic and electoral data:

Gender: 480 female / 598 male (across all loaded records)
Seat type: MR (majority-relative) or PL (proportional-list)
Circunscripción: Electoral district assignment for PL seats

Party Normalization

The normalize_party() function maps the mixed vote.group values returned by the portals to canonical organization IDs. Raw party names from the source data are inconsistent — abbreviations vary, coalitions create compound names, and historical parties have multiple labels. Normalization collapses all variants to a single canonical ID.

Membership Resolution

Some legislators have multi-party memberships across their career. The scraper resolves this by vote frequency: the legislator is assigned to the party where they cast the most votes. This is a pragmatic heuristic — it correctly handles party switches and expulsions without requiring manual disambiguation.

Lessons Learned

TLS fingerprinting is the primary bot detection mechanism for WAFs like Incapsula. Headers and user-agent strings are easy to spoof; the JA3 hash of the TLS handshake is not. Libraries like curl_cffi that can impersonate real browser TLS stacks are essential.
Proactive rotation beats reactive rotation. Rotating sessions before the WAF detects a pattern is far more effective than rotating after a block. The 10-request limit per session is conservative but reliable.
Cookie management matters. Burned cookies carry WAF flags. Discarding them entirely and starting fresh is better than trying to “fix” a flagged session.
Warm-up requests are essential. A cold session without Incapsula challenge cookies gets blocked on the first real request. The warm-up GET populates the necessary cookies.
Caching reduces exposure. Each cached page is one fewer request to the portal. For a scraper operating behind a WAF, minimizing total requests is a survival strategy, not just a performance optimization.
The Senado scraper took three iterations to get right. The first two were blocked within minutes. Iteration three introduced curl_cffi, fingerprint rotation, and proactive session management — and has been running reliably since.

Build System and Entry Points

The project uses hatchling as its build system via pyproject.toml:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.backends"

Each scraper is executable as a Python module:

# Diputados scraper
python -m scraper_congreso.diputados --leg LXVI --all-periods

# Senado voting records
python -m scraper_congreso.senadores.votaciones --range 1 5070 --delay 2.0

# Senado profiles
python -m scraper_congreso.senadores.perfiles

Key Dependencies

Package	Version	Purpose
`curl_cffi`	>= 0.15.0	TLS fingerprint impersonation (Senado anti-WAF)
`httpx`	>= 0.27	HTTP client for open data portals (Diputados)
`beautifulsoup4`	>= 4.12	HTML parsing
`lxml`	>= 5.0	Fast XML/HTML processing
`pydantic`	>= 2.5	Data model validation

Logging

scraper_congreso/utils/logging_config.py provides centralized logging configuration for the entire scraping package. All modules use the standard logging.getLogger(__name__) pattern, ensuring consistent log formatting and configurable log levels across the project.

For analysis runners, analysis/runner_utils.py provides a setup_logging() utility that configures logging with the same conventions.