Extraction pipeline

The camara-senadores-mex pipeline runs two main traversals: roll-call voting events and senator profiles. Both use Scrapy, but they have different sources and acceptance criteria.

Vote IDs


/66/votacion/{id}

      ├── temporal parsing

      └── POST viewTableVot.php


        roll-call rows


        SQLite: votaciones + votos_nominales


 observed senator IDs


 /66/senador/{id}


 SQLite: available senators

1. ID selection

The vote spider accepts two input modes:

ModeBehavior
max_idIterates range(1, max_id + 1).
idsUses an explicit comma-separated list.

This supports both large crawls and selective recrawls of specific cases.

Operational examples documented by the spider itself:

scrapy crawl votaciones
scrapy crawl votaciones -a max_id=50
scrapy crawl votaciones -a ids=347,891,2103,2789,3671,4256,4890

The authorized traversal of the final dataset used the [1, 5000] range; the pipeline documentation does not invent additional filters by legislature, party, or vote type.

2. Vote HTML request

For each ID, Scrapy requests:

https://www.senado.gob.mx/66/votacion/{id}

The request includes browser impersonation in meta to mitigate the WAF:

meta={"vote_id": vote_id, "impersonate": "chrome131"}

On that response, the parser attempts to extract temporal metadata:

  • legislature;
  • year of exercise;
  • period;
  • date.

The parser checks direct <strong> text and fragments separated by <br>. If the legislature is not explicit, it may be inferred from the date using known ranges from LX to LXVI.

3. Current AJAX request

After the initial HTML, the spider always queries the roll-call AJAX view:

POST https://www.senado.gob.mx/66/app/votaciones/functions/viewTableVot.php

The request uses a form-urlencoded body:

action=ajax&cell=1&order=DESC&votacion={id}&q=

and headers:

Content-Type: application/x-www-form-urlencoded
X-Requested-With: XMLHttpRequest
Referer: <vote page URL>

The decision on whether a voting event has data is made in parse_votes(), with the AJAX response as the main evidence. If there are no votes and the legislature could not be resolved either, the case is silently discarded; if there are votes or legislative metadata, the voting event is emitted.

4. Roll-call row parsing

The parser traverses <tr> rows and requires at least four cells. For each useful row:

  1. takes the name from the link in the second cell;
  2. cleans prefixes such as Sen. / Senador / Senadora;
  3. reorders comma-form names (Surname, NameName Surname);
  4. extracts the party from the third cell;
  5. joins all text nodes in the fourth cell to preserve multi-node vote values;
  6. extracts senador_id from the name or party href;
  7. emits VotoNominalItem only when both senador_id and name exist.

Internal whitespace normalization is part of the pipeline so that fragmented HTML details do not contaminate the database. Values are not invented when the portal does not deliver them.

5. Conceptual persistence

Persistence uses a SQLite pipeline:

ItemConceptual operation
VotacionItemINSERT OR REPLACE into votaciones.
VotoNominalItemINSERT OR IGNORE into votos_nominales, backed by a uniqueness constraint.
SenadorItemINSERT OR REPLACE into senadores.

The database is written in batches with periodic commits. This enables selective recrawls without duplicating roll-call votes when the uniqueness constraint applies.

6. Profile pipeline

The senator spider starts from the already populated database:

SELECT DISTINCT vn.senador_id
FROM votos_nominales vn
LEFT JOIN senadores s ON vn.senador_id = s.id
WHERE s.id IS NULL
ORDER BY vn.senador_id

With that list it visits:

https://www.senado.gob.mx/66/senador/{id}

If the page does not contain the expected information section, the profile is omitted. If it exists, name, sex inferred from prefix, election type, state, and URL are stored.

7. Validation and limits

The pipeline does not end with a “perfect” database; it ends with an auditable database.

Known limits are part of the result:

  • there are empty vote IDs inside the traversed range;
  • not every ID present in votes has an available profile;
  • empty parties or vote values may exist;
  • the WAF may introduce access variability;
  • the extractor must not fill missing values without evidence from the portal.

Later validation reads, counts, and flags. It must not erase extraction history or hide anomalies that are relevant for understanding the source.