Senatorial extractor architecture
camara-senadores-mex is a standalone Scrapy extractor that turns pages from the Mexican Senate into a local SQLite database. Its responsibility ends at downloading, parsing, and persisting available roll-call votes and profiles; it does not try to resolve Popolo representation or artificially fill historical gaps.
The architecture is organized around a hostile institutional source: the portal publishes useful content, but it does not expose it as a stable open-data API.
Senate portal
senado.gob.mx/66/
│
├── /66/votacion/{id}
│ │
│ └── POST AJAX viewTableVot.php
│
└── /66/senador/{id}
│
▼
Scrapy + scrapy-impersonate
│
▼
temporal / vote / profile parsing
│
▼
local SQLite: senado.db
Layers
| Layer | Role | Evidence produced |
|---|---|---|
| Institutional source | HTML pages and AJAX view under /66/. | Vote HTML, AJAX fragments, profile pages. |
| Scrapy client | Traverses vote or profile IDs and preserves request context. | Responses associated with vote_id or senador_id. |
| Anti-WAF mitigation | Uses scrapy-impersonate / curl_cffi for TLS fingerprinting. | Requests with browser impersonation. |
| Parsing | Extracts temporal metadata, roll-call votes, and available profiles. | VotacionItem, VotoNominalItem, SenadorItem items. |
| Persistence | Inserts/upserts into local SQLite. | Tables for voting events, roll-call votes, and senators. |
| Validation | Reads the database and counts anomalies without rewriting it. | Auditable metrics and warnings. |
Vote flow
For each vote, the spider starts from the HTML page:
https://www.senado.gob.mx/66/votacion/{id}
That first HTML response is used to recover navigation context, cookies, and temporal metadata when present. The roll-call table is not treated as a complete contract from the initial HTML: the current code always performs a second request to the AJAX endpoint for roll-call votes.
The current operational endpoint is:
POST https://www.senado.gob.mx/66/app/votaciones/functions/viewTableVot.php
with an application/x-www-form-urlencoded body equivalent to:
action=ajax&cell=1&order=DESC&votacion={id}&q=
and relevant headers:
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: <url of /66/votacion/{id}>
This sets an important boundary: in the version documented here, viewTableVot.php must not be described as a contractual GET. Any historical references to GET only describe earlier exploration or older implementations, not the current operational contract.
Profile flow
Profiles are obtained from IDs detected in roll-call votes:
https://www.senado.gob.mx/66/senador/{id}
The profile spider does not traverse an invented universal catalog. It reads the senador_id values present in votos_nominales that are still absent from the senadores table, tries to open the corresponding page, and stores only profiles with a valid information section.
Anti-WAF friction
The portal operates behind Incapsula. For that reason the project uses Scrapy with scrapy-impersonate, which integrates curl_cffi and enables requests with a browser TLS fingerprint.
The current configuration includes:
scrapy_impersonate.ImpersonateDownloadHandlerdownload handlers for HTTP and HTTPS;scrapy_impersonate.RandomBrowserMiddlewaremiddleware;meta={"impersonate": "chrome131"}on vote and AJAX requests;- enabled cookies, retries, and manual throttling.
Anti-WAF mitigation does not make the source stable. It only makes access sufficiently consistent for extraction, persistence, and auditing.
Output contract
The extractor output is operational: senado.db. Reading it correctly requires preserving the limits of the source:
- IDs with no content are not filled artificially.
- Missing profiles are not invented.
- Empty values are preserved as extraction evidence.
- The Popolo layer remains outside this repository and consumes the database as a later input.