High-Level Overview
The Observatorio del Congreso is a quantitative analysis platform for Mexico’s legislative branch (Cámara de Diputados + Senado de la República). It uses a unified Popolo-Graph schema stored in SQLite to model legislators, parties, votes, and informal power networks across seven legislatures (LX through LXVI, 2006-2027). The dataset covers approximately 3.5 million individual votes, 9,437 vote events, and 4,840 persons. The codebase has 302 passing tests and runs on Python 3.12.
Pipeline
┌─────────────────────────────────────────────────────────────────────┐
│ DATA COLLECTION │
│ │
│ ┌──────────────────────┐ ┌──────────────────────────────────┐ │
│ │ Senado Scraper │ │ Diputados Scraper │ │
│ │ curl_cffi + TLS │ │ httpx + BeautifulSoup │ │
│ │ fingerprint │ │ │ │
│ │ (Anti-WAF: │ │ SITL / INFOPAL open portal │ │
│ │ Incapsula bypass) │ │ + datos.abiertos API │ │
│ └──────────┬───────────┘ └──────────────┬───────────────────┘ │
│ │ │ │
└─────────────┼──────────────────────────────────┼────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ PARSE & LOAD │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Transformers → Loaders (deduplication via source_id) │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │
└─────────────────────────────────┼───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SQLite (WAL mode) — congreso.db │ │
│ │ Popolo-Graph Schema: 12 tables │ │
│ │ area · organization · person · membership · post │ │
│ │ motion · vote_event · vote · count │ │
│ │ actor_externo · relacion_poder · evento_politico │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │
└─────────────────────────────────┼───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ANALYSIS LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ W-NOMINATE │ │ Co-voting │ │ Community Detection │ │
│ │ (scipy, │ │ Matrix │ │ (nx.community, │ │
│ │ numpy) │ │ & Graph │ │ built-in Louvain) │ │
│ └──────┬───────┘ └──────┬───────┘ └───────────┬──────────────┘ │
│ │ │ │ │
│ ┌──────┴───────┐ ┌──────┴───────┐ ┌───────────┴──────────────┐ │
│ │ Centrality │ │ Power │ │ Empirical Power │ │
│ │ (degree, │ │ Indices │ │ (from real voting │ │
│ │ betweenness)│ │ (Shapley- │ │ coalitions) │ │
│ │ │ │ Shubik, │ │ │ │
│ │ │ │ Banzhaf) │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └───────────┬──────────────┘ │
│ │ │ │ │
└─────────┼──────────────────┼───────────────────────┼─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ EXPORT LAYER │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ JSON files → public/data/observatorio/ │ │
│ │ Pre-aggregated, static, no server-side computation │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │
└─────────────────────────────────┼───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ VISUALIZATION LAYER │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ CachorroSpace (Astro + Starlight) │ │
│ │ ECharts 6 via React islands │ │
│ │ Interactive charts: NOMINATE maps, co-voting graphs, │ │
│ │ power indices, community structures │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Scraping (Senado) | curl_cffi + TLS fingerprint impersonation | Anti-WAF evasion for Incapsula-protected portal |
| Scraping (Diputados) | httpx + BeautifulSoup | Open data portal scraping (SITL / INFOPAL) |
| Database | SQLite (WAL mode) | Unified Popolo-Graph storage |
| Build system | hatchling | Installable package via pyproject.toml |
| Analysis — NOMINATE | scipy, numpy, matplotlib | Ideal point estimation (W-NOMINATE algorithm) |
| Analysis — Networks | networkx (built-in Louvain) | Co-voting graphs, community detection |
| Analysis — Power | numpy, scipy | Shapley-Shubik O(n²W) DP, Banzhaf indices |
| Exports | JSON (static) | Pre-aggregated data for visualizations |
| Visualizations | ECharts 6 (React islands) | Interactive charts on CachorroSpace |
| Logging | Python logging (centralized) | Structured logging via runner_utils.setup_logging() |
:::tip All analysis runs offline against the SQLite database. There is no server-side computation at visualization time — JSON exports are pre-computed and served as static files. :::
Data Sources
| Source | URL | Chamber | Data |
|---|---|---|---|
| Cámara de Diputados | datos_abiertos / SITL / INFOPAL | Diputados | Voting records, legislator profiles, composition |
| Senado de la República | senado.gob.mx/66/ | Senado | Voting records, senator profiles, directorio |
:::note
The Senado portal is protected by Incapsula WAF. The scraper uses curl_cffi with impersonate="chrome" to bypass TLS fingerprint detection. The Diputados portal is open-access and uses standard HTTP requests via httpx.
:::
Data Flow
The pipeline processes data in four stages:
1. Scrape and Parse
Each chamber has a dedicated scraper with its own HTTP client, parser, and transformer modules:
- Senado:
curl_cffisession with TLS impersonation retrieves voting pages. Parsers extract vote data from HTML. Transformers normalize data into Popolo-Graph format. - Diputados:
httpxclient with file-based caching and rate limiting queries the SITL/INFOPAL systems. Parsers handle both XML and HTML responses.
2. Load into SQLite
Data flows through loaders that insert records into congreso.db with deduplication via the source_id column on the vote_event table. The id_generator module produces human-readable IDs with prefixes (P01, O01, VE01, etc.).
3. Analysis
Analysis scripts read from SQLite and compute:
- W-NOMINATE: Ideal point estimation placing legislators on a 2D ideological map
- Co-voting matrix: Pairwise agreement rates between legislators, exported as weighted graphs
- Community detection: Louvain algorithm (via
nx.community) identifies voting blocs within co-voting networks - Centrality: Degree and betweenness centrality measures on co-voting graphs
- Power indices: Shapley-Shubik and Banzhaf indices based on seat distributions
- Empirical power: Measured from real voting coalition data, not just seat counts
4. Export and Visualize
The export_observatorio_json.py script reads analysis CSV outputs and produces static JSON files consumed by ECharts 6 visualizations embedded as React islands in CachorroSpace.
analysis/output/*.csv
│
▼
export_observatorio_json.py
│
▼
public/data/observatorio/*.json
│
▼
React ECharts islands (CachorroSpace)
Project Structure
observatorio-congreso/
├── pyproject.toml # hatchling build-system, deps, ruff config
│
├── scraper_congreso/ # Installable package (pip install -e .)
│ ├── __init__.py
│ ├── diputados/ # Chamber of Deputies scraper
│ │ ├── __init__.py
│ │ ├── __main__.py # python -m scraper_congreso.diputados
│ │ ├── client.py # httpx HTTP client with SHA256 cache
│ │ ├── config.py # Legislatures + party mappings
│ │ ├── models.py # Pydantic data models
│ │ ├── pipeline.py # Main scraping pipeline
│ │ ├── loader.py # SQLite loader (dedup via source_id)
│ │ ├── legislatura.py # Legislature range logic
│ │ ├── transformers.py # SITL → Popolo-Graph normalization
│ │ └── parsers/
│ │ ├── votaciones.py # Vote event parser
│ │ ├── nominal.py # Roll-call vote parser
│ │ ├── desglose.py # Vote breakdown parser
│ │ ├── diputado.py # Legislator profile parser
│ │ └── composicion.py # Chamber composition parser
│ │
│ ├── senadores/ # Senate scraper
│ │ ├── __init__.py
│ │ ├── client.py # Anti-WAF client (curl_cffi, 6 fingerprints)
│ │ ├── config.py # Scraper configuration
│ │ ├── models.py # Shared data models
│ │ ├── votaciones/ # Voting records scraper
│ │ │ ├── __init__.py
│ │ │ ├── __main__.py # python -m scraper_congreso.senadores.votaciones
│ │ │ ├── cli.py # CLI entry point
│ │ │ ├── loader.py # SQLite loader
│ │ │ ├── transformers.py # Data normalization
│ │ │ └── parsers/
│ │ │ └── lxvi_portal.py # Portal /66/ parser (GET + POST AJAX)
│ │ └── perfiles/ # Senator profiles scraper
│ │ ├── __init__.py
│ │ ├── __main__.py # python -m scraper_congreso.senadores.perfiles
│ │ ├── scraper.py # Profile scraper logic
│ │ └── parsers/
│ │ └── perfil_parser.py
│ │
│ └── utils/ # Shared utilities
│ ├── __init__.py
│ ├── base_loader.py # BaseLoader (shared SQLite patterns)
│ ├── db_helpers.py # DB helper functions
│ ├── db_utils.py # DB utility functions
│ ├── id_generator.py # Human-readable IDs (P01, O01, VE01...)
│ ├── text_utils.py # Text normalization
│ ├── config.py # Shared config
│ └── logging_config.py # Logging configuration
│
├── analysis/ # 28 modules (~13.8K lines)
│ ├── constants.py # PARTY_COLORS, ORG_TO_SHORT, PARTY_ORDER, COLORES_WEB
│ ├── config.py # 8 tuneable parameters (thresholds, seeds, IDs)
│ ├── db.py # Data access layer (get_connection + 5 parametrized queries)
│ ├── runner_utils.py # Shared logging, argparse, run_for_cameras
│ ├── nominate.py # W-NOMINATE implementation
│ ├── covotacion.py # Co-voting matrix and graph
│ ├── covotacion_dinamica.py # Dynamic time-windowed co-voting (829 lines)
│ ├── comunidades.py # Louvain via nx.community (seed=42)
│ ├── centralidad.py # Degree and betweenness centrality
│ ├── poder_partidos.py # Shapley-Shubik O(n²W) DP + Banzhaf
│ ├── poder_empirico.py # Empirical power from real votes
│ ├── evolucion_partidos.py # Party evolution analysis
│ ├── efecto_genero.py # Gender effect analysis
│ ├── efecto_curul_tipo.py # Seat type effect analysis
│ ├── trayectorias.py # Individual legislator trajectories
│ ├── visualizacion.py # General visualization exports
│ ├── visualizacion_nominate.py
│ ├── visualizacion_dinamica.py
│ ├── visualizacion_poder.py
│ ├── visualizacion_articulo.py
│ ├── run_analysis.py # Run all analyses
│ ├── run_nominate.py # Run NOMINATE only
│ ├── run_covotacion_dinamica.py
│ ├── run_evolucion_partidos.py
│ ├── run_efecto_genero.py
│ ├── run_efecto_curul_tipo.py
│ └── run_trayectorias.py
│
├── db/
│ ├── schema.sql # Synchronized schema (18 indexes, 14 FKs, corrected CHECKs)
│ ├── init_db.py # PRAGMA FK ON + seed data
│ ├── constants.py # LEGISLATURAS_ORDERED, CAMARA_IDS, party mappings
│ ├── congreso.db # SQLite database (~337MB)
│ ├── migrations/ # 25 documented migrations (all applied, idempotent)
│ │ └── README.md # Migration docs
│ └── archived/ # Obsolete files (senado_schema.sql, legacy helpers)
│
├── tests/ # 302 tests (passing)
│
├── scripts/
│ ├── mantener.sh # Project maintenance script
│ ├── backup_db.sh # Database backup
│ └── clean_cache.sh # Cache cleanup
│
└── cache/ # HTTP response cache
Database Configuration
SQLite is configured for safe concurrent access and data integrity:
PRAGMA foreign_keys = ON;
PRAGMA encoding = "UTF-8";
PRAGMA journal_mode = WAL;
PRAGMA busy_timeout = 5000;
PRAGMA foreign_keys = ON is enforced in both db/init_db.py and analysis/db.py, ensuring referential integrity regardless of entry point.
| Setting | Value | Purpose |
|---|---|---|
journal_mode | WAL | Concurrent reads without blocking writes |
foreign_keys | ON | Enforce referential integrity between tables |
busy_timeout | 5000ms | Wait up to 5 seconds if database is locked |
encoding | UTF-8 | Correct handling of Spanish characters (accents, ñ) |
The schema defines 14 foreign keys with explicit ON DELETE / ON UPDATE actions: 3 use CASCADE (for dependent records that should propagate deletions) and 11 use RESTRICT (to prevent orphaned references).
Schema Overview
The Popolo-Graph schema contains 12 tables with 18 indexes, 14 foreign keys, and 5 corrected CHECK constraints. It is organized into four groups:
Core Popolo entities (legislative data standard):
| Table | Purpose |
|---|---|
area | Geographic divisions (states, districts, constituencies) |
organization | Political parties, blocs, coalitions, institutions |
person | Legislators and political actors |
membership | Person-to-organization relationships with roles and dates |
post | Legislative positions within organizations and areas |
motion | Bills and legislative initiatives |
vote_event | Specific voting instances (chamber + date) |
vote | Individual legislator votes per event |
count | Aggregated vote counts per group per event |
Power network extensions (beyond standard Popolo):
| Table | Purpose |
|---|---|
actor_externo | External actors (governors, party leaders, judges) |
relacion_poder | Informal power relationships (loyalty, pressure, alliances) |
evento_politico | Political events that affect power dynamics |
:::note
All tables use human-readable IDs with prefixes (P01 for person, O01 for organization, VE01 for vote event, etc.). This makes debugging and manual queries significantly easier than opaque integer primary keys.
:::
Schema Maintenance
The db/migrations/ directory contains 25 documented migration scripts, all applied and idempotent. Obsolete schema files (such as the former senado_schema.sql and legacy helper scripts) are preserved in db/archived/ for reference.
Indexes
The schema includes 18 indexes covering the most common query patterns:
membershipqueries by person and by organizationvote_eventlookups by motion and bysource_id(deduplication)votequeries by voter and by eventcountqueries by event and by grouprelacion_poderqueries by source, target, and typepersonfiltering by internal faction (corriente_interna)
Integrity Constraints
Date validation CHECK constraints ensure end_date >= start_date on person and membership tables for both inserts and updates. These constraints enforce data integrity at the SQLite level regardless of which loader writes the data.
Data Volumes
| Metric | Value |
|---|---|
| Individual votes | ~3,510,053 |
| Vote events | ~9,437 |
| Persons | ~4,840 |
| Organizations | ~20+ |
| Legislatures | 7 (LX through LXVI, 2006-2027) |
| Tests | 302 passing |
| Migration scripts | 25 (all applied) |
Build System
The project uses pyproject.toml with hatchling as its build backend, making the scraper installable as a package via pip install -e ..
Entry Points
python -m scraper_congreso.diputados # Scrape Diputados
python -m scraper_congreso.senadores.votaciones # Scrape Senate votes
python -m scraper_congreso.senadores.perfiles # Scrape Senate profiles
Dependencies
Core (scraper): curl_cffi, httpx, beautifulsoup4, lxml, pydantic
Dev: pytest, ruff
Analysis: numpy, pandas, scipy, networkx, matplotlib, polars