Architecture — CachorroSpace

High-Level Overview

The Observatorio del Congreso is a quantitative analysis platform for Mexico’s legislative branch (Cámara de Diputados + Senado de la República). It uses a unified Popolo-Graph schema stored in SQLite to model legislators, parties, votes, and informal power networks across seven legislatures (LX through LXVI, 2006-2027). The dataset covers approximately 3.5 million individual votes, 9,437 vote events, and 4,840 persons. The codebase has 302 passing tests and runs on Python 3.12.

Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA COLLECTION                             │
│                                                                     │
│  ┌──────────────────────┐      ┌──────────────────────────────────┐ │
│  │  Senado Scraper      │      │  Diputados Scraper               │ │
│  │  curl_cffi + TLS     │      │  httpx + BeautifulSoup           │ │
│  │  fingerprint         │      │                                  │ │
│  │  (Anti-WAF:          │      │  SITL / INFOPAL open portal      │ │
│  │   Incapsula bypass)  │      │  + datos.abiertos API            │ │
│  └──────────┬───────────┘      └──────────────┬───────────────────┘ │
│             │                                  │                    │
└─────────────┼──────────────────────────────────┼────────────────────┘
              │                                  │
              ▼                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         PARSE & LOAD                                │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  Transformers → Loaders (deduplication via source_id)        │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      STORAGE LAYER                                  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  SQLite (WAL mode) — congreso.db                             │   │
│  │  Popolo-Graph Schema: 12 tables                              │   │
│  │  area · organization · person · membership · post            │   │
│  │  motion · vote_event · vote · count                          │   │
│  │  actor_externo · relacion_poder · evento_politico            │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      ANALYSIS LAYER                                 │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐  │
│  │  W-NOMINATE   │  │  Co-voting   │  │  Community Detection    │  │
│  │  (scipy,      │  │  Matrix      │  │  (nx.community,         │  │
│  │   numpy)      │  │  & Graph     │  │   built-in Louvain)     │  │
│  └──────┬───────┘  └──────┬───────┘  └───────────┬──────────────┘  │
│         │                  │                       │                 │
│  ┌──────┴───────┐  ┌──────┴───────┐  ┌───────────┴──────────────┐  │
│  │  Centrality   │  │  Power       │  │  Empirical Power         │  │
│  │  (degree,     │  │  Indices     │  │  (from real voting       │  │
│  │   betweenness)│  │  (Shapley-   │  │   coalitions)            │  │
│  │               │  │   Shubik,    │  │                          │  │
│  │               │  │   Banzhaf)   │  │                          │  │
│  └──────┬───────┘  └──────┬───────┘  └───────────┬──────────────┘  │
│         │                  │                       │                 │
└─────────┼──────────────────┼───────────────────────┼─────────────────┘
          │                  │                       │
          ▼                  ▼                       ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      EXPORT LAYER                                   │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  JSON files → public/data/observatorio/                      │   │
│  │  Pre-aggregated, static, no server-side computation          │   │
│  └──────────────────────────────┬───────────────────────────────┘   │
                                │                                   │
└─────────────────────────────────┼───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                   VISUALIZATION LAYER                               │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  CachorroSpace (Astro + Starlight)                           │   │
│  │  ECharts 6 via React islands                                 │   │
│  │  Interactive charts: NOMINATE maps, co-voting graphs,        │   │
│  │  power indices, community structures                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Technology Stack

Component	Technology	Purpose
Scraping (Senado)	curl_cffi + TLS fingerprint impersonation	Anti-WAF evasion for Incapsula-protected portal
Scraping (Diputados)	httpx + BeautifulSoup	Open data portal scraping (SITL / INFOPAL)
Database	SQLite (WAL mode)	Unified Popolo-Graph storage
Build system	hatchling	Installable package via pyproject.toml
Analysis — NOMINATE	scipy, numpy, matplotlib	Ideal point estimation (W-NOMINATE algorithm)
Analysis — Networks	networkx (built-in Louvain)	Co-voting graphs, community detection
Analysis — Power	numpy, scipy	Shapley-Shubik O(n²W) DP, Banzhaf indices
Exports	JSON (static)	Pre-aggregated data for visualizations
Visualizations	ECharts 6 (React islands)	Interactive charts on CachorroSpace
Logging	Python logging (centralized)	Structured logging via runner_utils.setup_logging()

:::tip All analysis runs offline against the SQLite database. There is no server-side computation at visualization time — JSON exports are pre-computed and served as static files. :::

Data Sources

Source	URL	Chamber	Data
Cámara de Diputados	`datos_abiertos` / SITL / INFOPAL	Diputados	Voting records, legislator profiles, composition
Senado de la República	`senado.gob.mx/66/`	Senado	Voting records, senator profiles, directorio

:::note The Senado portal is protected by Incapsula WAF. The scraper uses curl_cffi with impersonate="chrome" to bypass TLS fingerprint detection. The Diputados portal is open-access and uses standard HTTP requests via httpx. :::

Data Flow

The pipeline processes data in four stages:

1. Scrape and Parse

Each chamber has a dedicated scraper with its own HTTP client, parser, and transformer modules:

Senado: curl_cffi session with TLS impersonation retrieves voting pages. Parsers extract vote data from HTML. Transformers normalize data into Popolo-Graph format.
Diputados: httpx client with file-based caching and rate limiting queries the SITL/INFOPAL systems. Parsers handle both XML and HTML responses.

2. Load into SQLite

Data flows through loaders that insert records into congreso.db with deduplication via the source_id column on the vote_event table. The id_generator module produces human-readable IDs with prefixes (P01, O01, VE01, etc.).

3. Analysis

Analysis scripts read from SQLite and compute:

W-NOMINATE: Ideal point estimation placing legislators on a 2D ideological map
Co-voting matrix: Pairwise agreement rates between legislators, exported as weighted graphs
Community detection: Louvain algorithm (via nx.community) identifies voting blocs within co-voting networks
Centrality: Degree and betweenness centrality measures on co-voting graphs
Power indices: Shapley-Shubik and Banzhaf indices based on seat distributions
Empirical power: Measured from real voting coalition data, not just seat counts

4. Export and Visualize

The export_observatorio_json.py script reads analysis CSV outputs and produces static JSON files consumed by ECharts 6 visualizations embedded as React islands in CachorroSpace.

analysis/output/*.csv
        │
        ▼
export_observatorio_json.py
        │
        ▼
public/data/observatorio/*.json
        │
        ▼
React ECharts islands (CachorroSpace)

Project Structure

observatorio-congreso/
├── pyproject.toml               # hatchling build-system, deps, ruff config
│
├── scraper_congreso/            # Installable package (pip install -e .)
│   ├── __init__.py
│   ├── diputados/               # Chamber of Deputies scraper
│   │   ├── __init__.py
│   │   ├── __main__.py          # python -m scraper_congreso.diputados
│   │   ├── client.py            # httpx HTTP client with SHA256 cache
│   │   ├── config.py            # Legislatures + party mappings
│   │   ├── models.py            # Pydantic data models
│   │   ├── pipeline.py          # Main scraping pipeline
│   │   ├── loader.py            # SQLite loader (dedup via source_id)
│   │   ├── legislatura.py       # Legislature range logic
│   │   ├── transformers.py      # SITL → Popolo-Graph normalization
│   │   └── parsers/
│   │       ├── votaciones.py    # Vote event parser
│   │       ├── nominal.py       # Roll-call vote parser
│   │       ├── desglose.py      # Vote breakdown parser
│   │       ├── diputado.py      # Legislator profile parser
│   │       └── composicion.py   # Chamber composition parser
│   │
│   ├── senadores/               # Senate scraper
│   │   ├── __init__.py
│   │   ├── client.py            # Anti-WAF client (curl_cffi, 6 fingerprints)
│   │   ├── config.py            # Scraper configuration
│   │   ├── models.py            # Shared data models
│   │   ├── votaciones/          # Voting records scraper
│   │   │   ├── __init__.py
│   │   │   ├── __main__.py      # python -m scraper_congreso.senadores.votaciones
│   │   │   ├── cli.py           # CLI entry point
│   │   │   ├── loader.py        # SQLite loader
│   │   │   ├── transformers.py  # Data normalization
│   │   │   └── parsers/
│   │   │       └── lxvi_portal.py  # Portal /66/ parser (GET + POST AJAX)
│   │   └── perfiles/            # Senator profiles scraper
│   │       ├── __init__.py
│   │       ├── __main__.py      # python -m scraper_congreso.senadores.perfiles
│   │       ├── scraper.py       # Profile scraper logic
│   │       └── parsers/
│   │           └── perfil_parser.py
│   │
│   └── utils/                   # Shared utilities
│       ├── __init__.py
│       ├── base_loader.py       # BaseLoader (shared SQLite patterns)
│       ├── db_helpers.py        # DB helper functions
│       ├── db_utils.py          # DB utility functions
│       ├── id_generator.py      # Human-readable IDs (P01, O01, VE01...)
│       ├── text_utils.py        # Text normalization
│       ├── config.py            # Shared config
│       └── logging_config.py    # Logging configuration
│
├── analysis/                    # 28 modules (~13.8K lines)
│   ├── constants.py             # PARTY_COLORS, ORG_TO_SHORT, PARTY_ORDER, COLORES_WEB
│   ├── config.py                # 8 tuneable parameters (thresholds, seeds, IDs)
│   ├── db.py                    # Data access layer (get_connection + 5 parametrized queries)
│   ├── runner_utils.py          # Shared logging, argparse, run_for_cameras
│   ├── nominate.py              # W-NOMINATE implementation
│   ├── covotacion.py            # Co-voting matrix and graph
│   ├── covotacion_dinamica.py   # Dynamic time-windowed co-voting (829 lines)
│   ├── comunidades.py           # Louvain via nx.community (seed=42)
│   ├── centralidad.py           # Degree and betweenness centrality
│   ├── poder_partidos.py        # Shapley-Shubik O(n²W) DP + Banzhaf
│   ├── poder_empirico.py        # Empirical power from real votes
│   ├── evolucion_partidos.py    # Party evolution analysis
│   ├── efecto_genero.py         # Gender effect analysis
│   ├── efecto_curul_tipo.py     # Seat type effect analysis
│   ├── trayectorias.py          # Individual legislator trajectories
│   ├── visualizacion.py         # General visualization exports
│   ├── visualizacion_nominate.py
│   ├── visualizacion_dinamica.py
│   ├── visualizacion_poder.py
│   ├── visualizacion_articulo.py
│   ├── run_analysis.py          # Run all analyses
│   ├── run_nominate.py          # Run NOMINATE only
│   ├── run_covotacion_dinamica.py
│   ├── run_evolucion_partidos.py
│   ├── run_efecto_genero.py
│   ├── run_efecto_curul_tipo.py
│   └── run_trayectorias.py
│
├── db/
│   ├── schema.sql               # Synchronized schema (18 indexes, 14 FKs, corrected CHECKs)
│   ├── init_db.py               # PRAGMA FK ON + seed data
│   ├── constants.py             # LEGISLATURAS_ORDERED, CAMARA_IDS, party mappings
│   ├── congreso.db              # SQLite database (~337MB)
│   ├── migrations/              # 25 documented migrations (all applied, idempotent)
│   │   └── README.md            # Migration docs
│   └── archived/                # Obsolete files (senado_schema.sql, legacy helpers)
│
├── tests/                       # 302 tests (passing)
│
├── scripts/
│   ├── mantener.sh              # Project maintenance script
│   ├── backup_db.sh             # Database backup
│   └── clean_cache.sh           # Cache cleanup
│
└── cache/                       # HTTP response cache

Database Configuration

SQLite is configured for safe concurrent access and data integrity:

PRAGMA foreign_keys = ON;
PRAGMA encoding    = "UTF-8";
PRAGMA journal_mode = WAL;
PRAGMA busy_timeout = 5000;

PRAGMA foreign_keys = ON is enforced in both db/init_db.py and analysis/db.py, ensuring referential integrity regardless of entry point.

Setting	Value	Purpose
`journal_mode`	WAL	Concurrent reads without blocking writes
`foreign_keys`	ON	Enforce referential integrity between tables
`busy_timeout`	5000ms	Wait up to 5 seconds if database is locked
`encoding`	UTF-8	Correct handling of Spanish characters (accents, ñ)

The schema defines 14 foreign keys with explicit ON DELETE / ON UPDATE actions: 3 use CASCADE (for dependent records that should propagate deletions) and 11 use RESTRICT (to prevent orphaned references).

Schema Overview

The Popolo-Graph schema contains 12 tables with 18 indexes, 14 foreign keys, and 5 corrected CHECK constraints. It is organized into four groups:

Core Popolo entities (legislative data standard):

Table	Purpose
`area`	Geographic divisions (states, districts, constituencies)
`organization`	Political parties, blocs, coalitions, institutions
`person`	Legislators and political actors
`membership`	Person-to-organization relationships with roles and dates
`post`	Legislative positions within organizations and areas
`motion`	Bills and legislative initiatives
`vote_event`	Specific voting instances (chamber + date)
`vote`	Individual legislator votes per event
`count`	Aggregated vote counts per group per event

Power network extensions (beyond standard Popolo):

Table	Purpose
`actor_externo`	External actors (governors, party leaders, judges)
`relacion_poder`	Informal power relationships (loyalty, pressure, alliances)
`evento_politico`	Political events that affect power dynamics

:::note All tables use human-readable IDs with prefixes (P01 for person, O01 for organization, VE01 for vote event, etc.). This makes debugging and manual queries significantly easier than opaque integer primary keys. :::

Schema Maintenance

The db/migrations/ directory contains 25 documented migration scripts, all applied and idempotent. Obsolete schema files (such as the former senado_schema.sql and legacy helper scripts) are preserved in db/archived/ for reference.

Indexes

The schema includes 18 indexes covering the most common query patterns:

membership queries by person and by organization
vote_event lookups by motion and by source_id (deduplication)
vote queries by voter and by event
count queries by event and by group
relacion_poder queries by source, target, and type
person filtering by internal faction (corriente_interna)

Integrity Constraints

Date validation CHECK constraints ensure end_date >= start_date on person and membership tables for both inserts and updates. These constraints enforce data integrity at the SQLite level regardless of which loader writes the data.

Data Volumes

Metric	Value
Individual votes	~3,510,053
Vote events	~9,437
Persons	~4,840
Organizations	~20+
Legislatures	7 (LX through LXVI, 2006-2027)
Tests	302 passing
Migration scripts	25 (all applied)

Build System

The project uses pyproject.toml with hatchling as its build backend, making the scraper installable as a package via pip install -e ..

Entry Points

python -m scraper_congreso.diputados           # Scrape Diputados
python -m scraper_congreso.senadores.votaciones # Scrape Senate votes
python -m scraper_congreso.senadores.perfiles   # Scrape Senate profiles

Dependencies

Core (scraper): curl_cffi, httpx, beautifulsoup4, lxml, pydantic

Dev: pytest, ruff

Analysis: numpy, pandas, scipy, networkx, matplotlib, polars