Chamber of Senators: from hostile source to validated dataset

The Chamber of Senators publishes enough information to reconstruct an important part of its parliamentary life, but it does not provide it as a stable dataset.

The source exists. It is public. It has institutional value. But it does not behave like a database.

The work of camara-senadores-mex begins from that friction: turning a hostile institutional source into a validated senatorial dataset. Not through blind trust in the portal, but through reconstruction: locating the signal, extracting it under constraints, persisting it in a queryable structure, and auditing the gaps left by the source itself.

This page is not a repository usage manual. It is a reconstructive reading of the process: how a publication designed for human navigation becomes processable legislative evidence.

SourceProcessPersistenceAuditContract
institutional portalextractionSQLiteread-only validationsenatorial contract

The problem: an official source is not the same as a dataset

The Senate portal offers pages, tables, profiles, and documents. But an official publication does not automatically equal a dataset.

A dataset needs stability, recognizable fields, membership rules, traceability, and validation criteria. The institutional source, by contrast, appears as a combination of HTML, numeric routes, partial views, incomplete metadata, and responses that are not always homogeneous.

The task, then, is not to “download data.” It is to reconstruct a source.

That means distinguishing substantive content from navigation, separating direct evidence from inference, counting absences, and preventing data cleaning from erasing important signals about the fragility of the original source.

/66/: the first structural clue

Inside the institutional site, the /66/ route works as an entry point to the LXVI Legislature.

It is not an API. Nor is it, by itself, a guarantee of completeness. But it does offer a structural clue: the site organizes part of its publication around legislature identifiers.

In a hostile source, every stable pattern matters. A number in the URL, a repeated hierarchy, a navigation block, or a link convention can become anchors for reconstructing the documentary universe.

/66/ works this way: less as a final destination and more as a starting point for mapping which legislative contents are grouped under that convention.

                 /66/
          ┌───────┼────────┐
          │       │        │
  legislature   votes   related pages

Why HTML is not enough

Institutional HTML is the first evidence, but it is not enough as a final dataset.

It may contain names, dates, offices, tables, links, and useful texts. It may also contain navigation, styles, repeated blocks, side information, incomplete fragments, or markup whose semantics depend on visual context.

Reading the HTML makes it possible to detect signals. Validating it requires more: identifying which field is declared by the source, which field is inferred, which content is repeated by template, and which pieces cannot be audited with the same level of confidence.

That is why the project does not treat HTML as the final base. It treats it as a primary source.

The dataset appears afterward: when each extraction is normalized, each gap is counted, and each transformation rule can be reviewed.

The hidden table: viewTableVot.php

The practical entry point for roll-call votes is an AJAX view: viewTableVot.php.

That view returns HTML fragments with voting records. It is not an open, stable, self-contained download. It is an interface piece: a partial response designed to feed a table inside the site.

A fundamental part of the signal is there: legislators, vote direction, party, and voting context. But it appears encapsulated in a form that forces reconstruction.

The institutional source publishes data, but does not by itself guarantee the extraction, traceability, or completeness conditions required by a reproducible dataset.

<tr>
  <td>Sen.</td>
  <td>María Example</td>
  <td>Parliamentary Group A</td>
  <td>In favor</td>
</tr>
<tr>
  <td>Sen.</td>
  <td>José Example</td>
  <td>Parliamentary Group B</td>
  <td>Against</td>
</tr>
<tr>
  <td>Dip.</td>
  <td>Non-senatorial record</td>
  <td>Parliamentary Group C</td>
  <td>Abstention</td>
</tr>

Extraction under friction: Scrapy and WAF Incapsula

Extraction happens under friction.

The portal is protected by WAF Incapsula, a layer that may introduce blocks, variable responses, unstable sessions, or incomplete access. That friction forces every response to be treated as evidence, not as final truth.

An accessible page does not automatically equal a correct page. A received fragment does not guarantee that the source is complete. An absence may mean there is no data, that the HTML does not expose it, that the response was partial, or that the metadata is not auditable from that route.

Extraction produces an operational base. Later validation decides which part of that base can stand as a senatorial dataset.

Persistence: from fragments to SQLite

The next step turns scattered fragments into a queryable SQLite database.

There, information stops depending on institutional navigation and becomes organized as related entities: voting events, roll-call votes, senators identified in votes, available profiles, parties, vote directions, and detectable anomalies.

The confirmed result contains:

MetricConfirmed value
Voting events4,993
Senatorial roll-call votes454,094
Senator IDs present in votes737
Available profiles700
IDs in votes without associated profile37
Voting events with no vote records7
Empty votes14
Empty parties88

The difference between 737 IDs present in votes and 700 available profiles is not hidden. It is part of the dataset’s real state: 37 senator IDs appear in votes but have no associated profile in the available layer.

Read-only validation

Validation does not modify the database.

Its function is to read, count, cross-check, and flag. That decision matters: when an institutional source is incomplete or irregular, silently correcting it may destroy evidence about the original source.

The read-only validator checks internal consistency without rewriting the data. It marks anomalies, confirms counts, and separates real problems from acceptable warnings.

Confirmed findings include seven voting events with no vote records:

408, 621, 801, 848, 3697, 3698, 3848

It also records 14 empty votes and 88 empty parties.

These cases do not automatically invalidate the complete dataset. They work as boundary marks: they show where coverage drops, where the source does not provide everything, or where extraction must not invent information.

Audit: warnings are not automatic failures

The audit separates critical errors from accepted warnings.

A warning does not automatically equal a failure. It may indicate a real absence in the source, an unavailable profile, incomplete metadata, a non-auditable HTML piece, or an inconsistency inherited from the portal.

The difference is central: the goal is not to manufacture a perfect database, but to state precisely what is validated, what is incomplete, and what must be read with caution.

LevelOperational reading
ValidatedCount, cross-check, or rule confirmed by the read-only audit.
Accepted warningKnown anomaly that keeps context and does not invalidate the complete set.
Requires reviewCase that needs additional inspection before being used as strong evidence.

The operating thesis is simple: a hostile institutional source becomes a dataset when every transformation leaves a trace, every gap is counted, and every warning keeps its context.

The senatorial contract: Sen. is not Dip.

The dataset does not try to copy the institutional source literally. It translates it into a verifiable senatorial contract.

The central rule is this: votos_nominales contains the nominal subset of senators, identified by Sen. records. It is not a replica of the raw AJAX response when the institutional response mixes other offices.

This matters because some responses may include Dip. records. Those records may exist in the institutional raw source, but they do not belong to this dataset’s senatorial contract.

Therefore:

Institutional inputRuleResult in votos_nominales
Sen. recordBelongs to the senatorial subsetKept
Dip. recordDoes not belong to the senatorial contractExcluded
Mixed responseFiltered by senatorial membershipNot fully replicated

The audited cases 891, 3450, and 4890 set this boundary with concrete examples. The dataset’s validity does not depend on assuming that every AJAX response is senatorial, but on applying an explicit membership rule.

Popolo boundary

Popolo appears here as an external boundary.

camara-senadores-mex validates senatorial information extracted from the institutional portal. That information may later interact with a Popolo representation, but this document does not enter the internals of popolo-senadores-mex.

The separation avoids mixing two responsibilities:

  • here: reconstructing and validating senatorial data from a hostile institutional source;
  • outside: representing people, offices, or relationships under a Popolo structure.

Popolo works as an external output or comparison box. It is not the primary proof of truth for the senatorial scraping.

Guarantees and limits

Guarantees

  • The dataset contains 4,993 voting events and 454,094 confirmed senatorial roll-call votes.
  • votos_nominales represents the nominal subset of senators (Sen.).
  • Mixed AJAX responses are not fully replicated if they include Dip. records.
  • Cases 891, 3450, and 4890 were used as audit points to set the Sen./Dip. boundary.
  • Read-only validation counts anomalies without silently modifying the database.
  • Known gaps remain visible: 37 IDs without profile, 7 voting events with no vote records, 14 empty votes, and 88 empty parties.

Limits

  • The institutional source is protected by WAF Incapsula, which introduces friction and access variability.
  • HTML is not always sufficient as final evidence.
  • Metadata and profiles may be incomplete or not always auditable.
  • An accepted warning does not automatically equal a failure.
  • The dataset does not promise absolute completeness of the entire institutional publication.
  • Popolo remains outside this validation; it is treated as an external boundary.