Skip to content

Scraping & Data Collection

The Observatorio scrapes legislative data from two chambers, each requiring a fundamentally different approach. The Chamber of Deputies (Diputados) exposes an open data portal with no anti-bot protection, so a straightforward HTTP client suffices. The Senate (Senado) sits behind Imperva’s Incapsula Web Application Firewall (WAF), which detects and blocks automated requests via TLS fingerprinting. Getting data out of the Senado required building a specialized anti-WAF client.

Two chambers, two stacks:

ChamberClientDifficultyReason
DiputadoshttpxLowOpen data portal, no protection
Senadocurl_cffiHighIncapsula WAF with TLS fingerprinting
SourceChamberBase URLMethodVolume
Datos AbiertosDiputadosdatos.abiertos.diputados.gob.mxhttpx + delay~4,600 vote events
Portal LXVISenadosenado.gob.mx/66/curl_cffi + TLS impersonation5,047 vote events
Directorio XLSSenadoOfficial XLS filespandas read_excelLVIII-LXV legislatures

The Diputados scraper targets the open data portal at datos.abiertos.diputados.gob.mx. No anti-bot protection is present, so the stack is minimal:

  • HTTP client: httpx with a configurable delay between requests (default 2.0 seconds)
  • Parser: BeautifulSoup for HTML parsing where JSON endpoints are unavailable
  • Data scope: voting records keyed by SITL IDs, legislator profiles, and party affiliations

The scraper pulls approximately 4,600 vote events. All data loads into a SQLite database with deduplication handled by the source_id field on each vote_event record.

Senado Scraper — The Anti-WAF Case Study

Section titled “Senado Scraper — The Anti-WAF Case Study”

The Senado portal at senado.gob.mx is protected by Incapsula (Imperva WAF). Standard HTTP clients — Python requests, httpx, even curl — get blocked immediately. The WAF detects automated traffic through three mechanisms:

  1. TLS fingerprinting: The JA3 hash of the TLS handshake identifies non-browser clients
  2. JavaScript challenges: Incapsula serves JS that real browsers execute automatically
  3. Behavioral analysis: Request patterns, timing, and cookie behavior are monitored

The first two iterations of the Senado scraper were blocked within minutes of starting a scrape run.

The SenadoLXVIClient in senado/scrapers/shared/client.py uses curl_cffi with TLS fingerprint impersonation. This library wraps libcurl-impersonate, which can reproduce the exact TLS handshake of real browsers — matching JA3 hashes, cipher suites, and extensions.

class SenadoLXVIClient:
_IMPERSONATE_TARGETS: tuple[BrowserTypeLiteral, ...] = (
"chrome", "safari", "chrome116", "chrome131", "edge", "chrome_android",
)
MAX_REQUESTS_PER_SESSION: int = 10
WAF_CONSECUTIVE_THRESHOLD = 2

Six browser impersonation targets rotate across sessions. Each target presents a distinct JA3 hash to the WAF:

TargetProfile
chromeLatest Chrome desktop
safariSafari desktop
chrome116Chrome 116 desktop
chrome131Chrome 131 desktop
edgeEdge desktop
chrome_androidChrome mobile

The client uses a layered session strategy designed to minimize WAF detection while recovering gracefully from blocks:

  1. Active session: Fixed fingerprint from the pool, shared persistent cookies across requests within the session
  2. WAF block detected: Close the session immediately, discard all cookies (burned cookies carry WAF flags)
  3. New session: Rotate to the next fingerprint from the pool, perform a warm-up GET request to populate fresh cookies before scraping
  4. Proactive rotation: Rotate the session every 10 requests (MAX_REQUESTS_PER_SESSION) before the WAF has a chance to flag the pattern

A circuit breaker tracks consecutive WAF blocks. After WAF_CONSECUTIVE_THRESHOLD (2) consecutive blocks, the session is declared burned. The client raises SessionBurnedError, forces a mandatory pause, and must be restarted with a fresh session.

This prevents the scraper from hammering the WAF with requests that will never succeed.

After creating a new session, the client issues a dummy GET request to the portal before making any real data requests. This warm-up request allows Incapsula to set its challenge cookies. A cold session without cookies gets blocked far more aggressively than one that has already passed the initial JS challenge.

MetricCount
Vote events scraped5,047
Senator profiles scraped1,754
Iterations to get right3
Request → Check Cache
├─ Cache Hit → Return cached data
└─ Cache Miss → Send via curl_cffi
├─ Response OK → Cache + Return
└─ WAF Detected (Incapsula markers)
├─ Consecutive < 2 → New session, rotate fingerprint, warm-up, retry
└─ Consecutive ≥ 2 → SessionBurnedError → Pause + restart

The cache layer is not optional. Every cached page is one fewer request to the Senado portal, which means one fewer opportunity for the WAF to detect and block the scraper. For repeated scrape runs, the cache dramatically reduces exposure.

Each vote_event record carries a source_id field that maps back to the original identifier from the source portal. This enables idempotent scraping: running the scraper multiple times does not create duplicate records. The SQLite INSERT OR IGNORE pattern on source_id handles this at the database level.

Legislator profiles are enriched with demographic and electoral data:

  • Gender: 480 female / 598 male (across all loaded records)
  • Seat type: MR (majority-relative) or PL (proportional-list)
  • Circunscripción: Electoral district assignment for PL seats

The normalize_party() function maps the mixed vote.group values returned by the portals to canonical organization IDs. Raw party names from the source data are inconsistent — abbreviations vary, coalitions create compound names, and historical parties have multiple labels. Normalization collapses all variants to a single canonical ID.

Some legislators have multi-party memberships across their career. The scraper resolves this by vote frequency: the legislator is assigned to the party where they cast the most votes. This is a pragmatic heuristic — it correctly handles party switches and expulsions without requiring manual disambiguation.

  1. TLS fingerprinting is the primary bot detection mechanism for WAFs like Incapsula. Headers and user-agent strings are easy to spoof; the JA3 hash of the TLS handshake is not. Libraries like curl_cffi that can impersonate real browser TLS stacks are essential.

  2. Proactive rotation beats reactive rotation. Rotating sessions before the WAF detects a pattern is far more effective than rotating after a block. The 10-request limit per session is conservative but reliable.

  3. Cookie management matters. Burned cookies carry WAF flags. Discarding them entirely and starting fresh is better than trying to “fix” a flagged session.

  4. Warm-up requests are essential. A cold session without Incapsula challenge cookies gets blocked on the first real request. The warm-up GET populates the necessary cookies.

  5. Caching reduces exposure. Each cached page is one fewer request to the portal. For a scraper operating behind a WAF, minimizing total requests is a survival strategy, not just a performance optimization.

  6. The Senado scraper took three iterations to get right. The first two were blocked within minutes. Iteration three introduced curl_cffi, fingerprint rotation, and proactive session management — and has been running reliably since.