Chamber of Deputies: reconstructing roll-call votes from SITL
The Chamber of Deputies publishes enough information to reconstruct an important part of its parliamentary activity, but it does not provide it as a stable dataset.
The source exists. It is public. It has institutional value. But it does not behave like a database.
The work of camara-diputados-mex begins from that friction: turning the SITL/INFOPAL system into a traceable roll-call voting database. Not through blind trust in the portal, but through reconstruction: discovering periods, locating voting events, following parties, normalizing roll-call records, persisting them in SQLite, and auditing which parts of the result can stand against the original source.
This page is not a repository usage manual. It is a reconstructive reading of the process: how an institutional publication designed for human navigation becomes processable legislative evidence.
| Source | Process | Persistence | Audit | Status |
|---|---|---|---|---|
| SITL/INFOPAL | standalone Scrapy | SQLite | checked against SITL | partial / under post-fix validation |
The problem: an official source is not the same as a dataset
SITL/INFOPAL exposes pages, tables, links, and parameters. But an official publication does not automatically equal a dataset.
A dataset needs stability, recognizable fields, identity rules, traceability, and validation criteria. The institutional source, by contrast, appears as a combination of static HTML, historical routes, legislature-specific parameters, party count tables, roll-call listings, and conventions that shift across generations of the portal.
The task, then, is not to “download data.” It is to reconstruct a source.
That means distinguishing substantive content from navigation, separating direct evidence from inference, counting absences, and preventing data cleaning from erasing important signals about the fragility of the original source.
SITL/INFOPAL: static HTML as primary source
Unlike legislative portals that depend on AJAX views or fragmented dynamic responses, the Deputies case works mainly through static HTML.
That relative stability does not remove the problem. It moves it.
The challenge is not crossing a hostile real-time interface, but recognizing historical patterns inside pages that were not designed as an API: period indexes, voting lists, party count tables, and roll-call pages linked through parameters.
The scraper in camara-diputados-mex is implemented as a standalone Scrapy project. Its flow reconstructs the historical series from Legislature LX to Legislature LXVI using the structure published by SITL/INFOPAL itself.
SITL/INFOPAL
│
├── periods
├── voting events
├── parties
└── roll-call votes
│
▼
audited SQLite
Dynamic discovery
One of the central decisions in the project is not to treat periods, voting events, or parties as fixed lists.
The scraper dynamically discovers:
- available legislative periods;
- voting events inside each period;
- parties present in the source tables;
- roll-call links associated with each vote and party.
This matters because the portal does not behave like a single historical table. Structure changes across legislatures, routes do not always follow one pattern, and identifiers only become meaningful inside their context.
In a long-running institutional source, every stable pattern matters. But no pattern should be promoted to global truth without validation.
The historical P0: votaciont was not a global key
The most important finding in the project was an identity problem.
The votaciont parameter looked like a sufficient identifier for a voting event. But the historical series showed otherwise: votaciont is sequential per legislature.
That means the same number can appear in more than one legislature without referring to the same legislative event. Treating it as a global key mixes distinct historical entities.
The consequence is strong: the previous historical DB is invalid for analysis.
This is not a cosmetic error. It affects the identity of the voting events themselves. If the key does not distinguish legislature, the dataset may overwrite, mix, or attribute roll-call records to the wrong legislative context.
The correction required composite keys with legislatura.
Before:
votaciont
After:
legislatura + votaciont
This rule changes how the dataset must be read. A vote is not only a number; it is a number situated inside a legislature.
Three database states
To read the project correctly, it helps to separate three states.
Invalid historical DB
The historical DB is the database before the P0 correction. By treating votaciont as if it were globally unique, it does not preserve the separation between legislatures.
It should not be used as a valid source for historical analysis.
Audited clean DB
The audited clean DB currently available is:
data/diputados_clean_20260429_143417.db
On this cut, a clean reconstruction was consolidated and audited against SITL.
| Metric | Value in audited clean DB |
|---|---|
| Voting events | 4,386 |
| Party count rows | 36,220 |
| Roll-call vote records | 2,173,969 |
| Deputies | 4,402 |
| Voting events audited against SITL | 35/35 PASS |
These numbers describe the state of the audited clean DB. They should not be read as total closure of the project or as an automatic guarantee for every derived dimension.
Post-fix derived DB in progress
After the audit, engineering corrected the handling of diputados.entidad in code. That fix exists and was committed locally.
But the post-fix derived DB is still being materialized and validated. Any claim that depends on that database must therefore remain conditioned.
The correct publication status is: partial / under post-fix validation.
Audit: validating without erasing the boundary
The audited clean DB was checked against the original SITL source. In the audited voting sample, the result was 35/35 PASS for voting events, party counts, and roll-call votes.
That result supports the corrected voting reconstruction in the audited sample. But it does not automatically make every field in the project final.
The distinction matters. Domain-by-domain validation avoids two opposite errors. On one side, it avoids discarding a useful base because one field is still under correction. On the other, it avoids declaring total victory simply because the main voting metrics passed.
In this cut, the strongest part is the reconstruction of voting events, party counts, and roll-call votes in the audited clean DB. The conditioned part is the post-fix materialization of territorial fields associated with deputies.
Territory, substitutes, and diputados.entidad
The diputados.entidad field opened a different boundary from the voting reconstruction.
The current methodological rule is that, for substitutes, territorial validation must be performed against the principal deputy. The represented territory is not inferred only from the name that appears in a record, but from the institutional relationship with the seat.
This rule avoids confusing personal origin, curriculum text, or nominal appearance with territorial representation.
The code fix is meant to prevent future data from storing textual blocks, circumscription values, or inappropriate nulls when an interpretable territorial representation exists. But until the post-fix derived DB completes validation, that result should not be presented as already materialized in a final validated database.
Schema decision: no circunscripcion column
The current schema does not add a separate circunscripcion column.
When the source expresses a circumscription rather than an ordinary district, the temporary representation is:
distrito = "Circ. N"
This decision preserves compatibility with the current model and keeps the source signal without introducing a column whose historical semantics would require a broader schema decision.
It does not mean that a circumscription is equivalent to a district. It means that, for now, the dataset preserves that information inside the available field while future modeling is decided explicitly.
Scope: roll-call voting
The focus of this cut is the historical reconstruction of voting events, party counts, and roll-call votes.
comisiones are out of scope by Nolan’s decision. They are not treated as dataset debt or as an omission waiting for repair.
Expanding the scope toward committees would imply another question, another model, and another validation process.
Guarantees and limits
Current guarantees
- The project is a standalone Scrapy scraper for the Chamber of Deputies.
- The primary source is SITL/INFOPAL.
- The historical scope worked so far covers LX-LXVI.
- Extraction is based on static HTML and dynamic discovery of periods, voting events, and parties.
- The historical P0 around
votaciontwas corrected through composite keys withlegislatura. - The previous historical DB is invalid for analysis.
- The audited clean DB
data/diputados_clean_20260429_143417.dbcontains 4,386 voting events, 36,220 party count rows, 2,173,969 roll-call vote records, and 4,402 deputies. - In the audited sample, 35/35 voting events passed against SITL.
Current limits
- This page describes a partial state, not final closure.
- The post-fix derived DB is still being materialized and validated.
- The
diputados.entidadfix exists in code, but should not be presented as already materialized in a final validated DB. - Territorial claims must respect the substitute rule: validation against the principal deputy.
- The schema does not include a
circunscripcioncolumn; it temporarily usesdistrito = "Circ. N". comisionesremain outside the current scope by Nolan’s decision.
The operating thesis is simple: an institutional source becomes a dataset when every identifier is situated in context, every transformation leaves a trace, and every validation boundary is stated without overstating it.