Extraction pipeline
The camara-senadores-mex pipeline runs two main traversals: roll-call voting events and senator profiles. Both use Scrapy, but they have different sources and acceptance criteria.
Vote IDs
│
▼
/66/votacion/{id}
│
├── temporal parsing
│
└── POST viewTableVot.php
│
▼
roll-call rows
│
▼
SQLite: votaciones + votos_nominales
│
▼
observed senator IDs
│
▼
/66/senador/{id}
│
▼
SQLite: available senators
1. ID selection
The vote spider accepts two input modes:
| Mode | Behavior |
|---|---|
max_id | Iterates range(1, max_id + 1). |
ids | Uses an explicit comma-separated list. |
This supports both large crawls and selective recrawls of specific cases.
Operational examples documented by the spider itself:
scrapy crawl votaciones
scrapy crawl votaciones -a max_id=50
scrapy crawl votaciones -a ids=347,891,2103,2789,3671,4256,4890
The authorized traversal of the final dataset used the [1, 5000] range; the pipeline documentation does not invent additional filters by legislature, party, or vote type.
2. Vote HTML request
For each ID, Scrapy requests:
https://www.senado.gob.mx/66/votacion/{id}
The request includes browser impersonation in meta to mitigate the WAF:
meta={"vote_id": vote_id, "impersonate": "chrome131"}
On that response, the parser attempts to extract temporal metadata:
- legislature;
- year of exercise;
- period;
- date.
The parser checks direct <strong> text and fragments separated by <br>. If the legislature is not explicit, it may be inferred from the date using known ranges from LX to LXVI.
3. Current AJAX request
After the initial HTML, the spider always queries the roll-call AJAX view:
POST https://www.senado.gob.mx/66/app/votaciones/functions/viewTableVot.php
The request uses a form-urlencoded body:
action=ajax&cell=1&order=DESC&votacion={id}&q=
and headers:
Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: <vote page URL>
The decision on whether a voting event has data is made in parse_votes(), with the AJAX response as the main evidence. If there are no votes and the legislature could not be resolved either, the case is silently discarded; if there are votes or legislative metadata, the voting event is emitted.
4. Roll-call row parsing
The parser traverses <tr> rows and requires at least four cells. For each useful row:
- takes the name from the link in the second cell;
- cleans prefixes such as
Sen./Senador/Senadora; - reorders comma-form names (
Surname, Name→Name Surname); - extracts the party from the third cell;
- joins all text nodes in the fourth cell to preserve multi-node vote values;
- extracts
senador_idfrom the name or partyhref; - emits
VotoNominalItemonly when bothsenador_idand name exist.
Internal whitespace normalization is part of the pipeline so that fragmented HTML details do not contaminate the database. Values are not invented when the portal does not deliver them.
5. Conceptual persistence
Persistence uses a SQLite pipeline:
| Item | Conceptual operation |
|---|---|
VotacionItem | INSERT OR REPLACE into votaciones. |
VotoNominalItem | INSERT OR IGNORE into votos_nominales, backed by a uniqueness constraint. |
SenadorItem | INSERT OR REPLACE into senadores. |
The database is written in batches with periodic commits. This enables selective recrawls without duplicating roll-call votes when the uniqueness constraint applies.
6. Profile pipeline
The senator spider starts from the already populated database:
SELECT DISTINCT vn.senador_id
FROM votos_nominales vn
LEFT JOIN senadores s ON vn.senador_id = s.id
WHERE s.id IS NULL
ORDER BY vn.senador_id
With that list it visits:
https://www.senado.gob.mx/66/senador/{id}
If the page does not contain the expected information section, the profile is omitted. If it exists, name, sex inferred from prefix, election type, state, and URL are stored.
7. Validation and limits
The pipeline does not end with a “perfect” database; it ends with an auditable database.
Known limits are part of the result:
- there are empty vote IDs inside the traversed range;
- not every ID present in votes has an available profile;
- empty parties or vote values may exist;
- the WAF may introduce access variability;
- the extractor must not fill missing values without evidence from the portal.
Later validation reads, counts, and flags. It must not erase extraction history or hide anomalies that are relevant for understanding the source.