Architecture
High-Level Overview
Section titled “High-Level Overview”mcp-memory follows a layered architecture with three main components: the MCP server (transport + tools), the storage layer (SQLite), and the embedding engine (ONNX). A fourth layer — the Limbic Scoring engine — sits above storage and orchestrates dynamic re-ranking.
graph TD Client["MCP Client<br/>(Claude Desktop, OpenCode, Cursor)"] Client -->|stdio JSON-RPC| FastMCP
subgraph FastMCP["FastMCP Server"] Tools["19 MCP Tools<br/>(8 Anthropic + 11 extended)"] Pydantic["Pydantic Validation"] Engine["EmbeddingEngine<br/>(ONNX, lazy load)"] Tools --> Pydantic Tools --> Engine end
subgraph Storage["Storage Layer"] MemoryStore["MemoryStore<br/>(SQLite + sqlite-vec + FTS5)"] SQLite["memory.db<br/>(WAL mode)"] end
subgraph Scoring["Scoring Layer"] Limbic["scoring.py (Limbic)<br/>Salience · Decay · Co-oc · Temporal Co-oc Decay"] Routing["Query Router<br/>(COSINE/LIMBIC/HYBRID)"] end
subgraph EntitySplit["Entity Splitting"] Splitter["entity_splitter.py<br/>(TF-IDF + Thresholds)"] end
subgraph Scripts["Scripts"] AutoTuner["auto_tuner.py<br/>(Grid Search + NDCG@K)"] GridSearch["grid_search.py<br/>(Offline GAMMA/BETA_SAL)"] ABMetrics["ab_metrics.py<br/>(Shadow Mode Metrics)"] end
Pydantic --> MemoryStore Engine --> MemoryStore MemoryStore --> SQLite MemoryStore --> Limbic Limbic --> RoutingThe server starts as a stdio process that listens for JSON-RPC on stdin and responds via stdout. Logs go to stderr to avoid interfering with the MCP protocol.
Default storage path: ~/.config/opencode/mcp-memory/memory.db
Technology Stack
Section titled “Technology Stack”| Component | Technology | Version | Purpose |
|---|---|---|---|
| Runtime | Python | >= 3.12 | Minimum language version |
| MCP Server | FastMCP | >= 3.0 | Framework for tool registration and stdio transport |
| Database | SQLite | stdlib | Persistent storage with WAL journaling |
| Vector search | sqlite-vec | >= 0.1.6 | KNN with cosine distance on virtual table vec0 |
| Embeddings | ONNX Runtime | >= 1.17 | CPU inference for sentence embeddings |
| Tokenization | HuggingFace Tokenizers | >= 0.19 | Fast tokenization (Rust implementation) |
| Numerics | NumPy | >= 1.26 | Vector operations and linear algebra |
| Validation | Pydantic | >= 2.0 | Input/output model validation |
| Model download | HuggingFace Hub | >= 0.20 | Fetch models from HF Hub |
| Build system | hatchling | — | Python packaging |
Transport
Section titled “Transport”mcp-memory communicates over stdio using JSON-RPC:
- Input: The server reads JSON-RPC requests from stdin
- Output: Responses are written to stdout
- Logs: All diagnostic output goes to stderr — never stdout — to avoid polluting the MCP protocol channel
The entry point is registered as mcp-memory, which resolves to mcp_memory.server:main. This makes it compatible with any MCP client that supports stdio transport:
{ "mcpServers": { "memory": { "command": ["uvx", "--from", "git+https://github.com/Yarlan1503/mcp-memory", "mcp-memory"] } }}Main Components
Section titled “Main Components”The codebase is organized into five core modules:
server.py — FastMCP Server
Section titled “server.py — FastMCP Server”The entry point and tool registry — only 97 lines. Initializes the FastMCP server, creates the MemoryStore instance, and registers 19 tools by importing from four tool modules (tools/core.py, tools/search.py, tools/entity_mgmt.py, tools/reflections.py, tools/relations.py). Each tool has Pydantic-validated inputs and outputs. The store is shared via a runtime lookup pattern — tool functions access it through _server_mod.store rather than closure or global.
storage.py — MemoryStore
Section titled “storage.py — MemoryStore”The persistence layer — organized as a package with 7 mixins. MemoryStore in storage/__init__.py is a facade class that inherits from CoreMixin, SchemaMixin, RelationsMixin, SearchMixin, AccessMixin, ReflectionsMixin, and ConsolidationMixin. Each mixin handles a specific domain (entity CRUD, schema migrations, relations, search, access tracking, reflections, consolidation). All share a single SQLite connection (self.db) via Python’s MRO (Method Resolution Order).
embeddings.py — EmbeddingEngine
Section titled “embeddings.py — EmbeddingEngine”Encapsulates all embedding inference logic using an ONNX model. Implements a singleton pattern with lazy loading — the model is only loaded into memory on first use. Provides the encoding pipeline: prefix prepending → tokenization → ONNX forward pass → mean pooling → L2 normalization.
scoring.py — Limbic Scoring
Section titled “scoring.py — Limbic Scoring”The dynamic ranking engine. Computes composite scores from three signals: salience (access frequency + graph degree), temporal decay (exponential with configurable half-life), and co-occurrence (how often entities appear together in results). Also implements Reciprocal Rank Fusion for merging KNN and FTS5 results. See Limbic System for details.
entity_splitter.py — Entity Splitting
Section titled “entity_splitter.py — Entity Splitting”Automatically splits entities that exceed observation thresholds into focused sub-entities. Uses TF-IDF to group observations by topic and creates contiene/parte_de relations to preserve the knowledge graph structure.
- Thresholds:
Sesion=15,Proyecto=25,DEFAULT=20 - Topic extraction: TF-IDF with Spanish stop words, min word length 4
- Split proposal: Returns suggested new entities + relations to create
- Atomic execution: All-or-nothing transaction via SQLite context manager
scripts/auto_tuner.py — Auto-tuning
Section titled “scripts/auto_tuner.py — Auto-tuning”Offline optimization script for GAMMA and BETA_SAL hyperparameters:
- Grid search: Explores GAMMA × BETA_SAL combinations
- NDCG@K metrics: Uses implicit feedback from shadow mode A/B testing
- Smooth apply: Exponential moving average to avoid sudden changes
- CLI:
--analyze,--tune,--set-gamma,--set-beta
scripts/ab_metrics.py — A/B Testing Metrics
Section titled “scripts/ab_metrics.py — A/B Testing Metrics”Calculates quality metrics from shadow mode data:
- NDCG@K: Normalized Discounted Cumulative Gain
- Lift@K: Proportion of relevant items in top-K vs overall
- Baseline comparison: Treatment vs cosine-only ranking
migrate.py — JSONL Import
Section titled “migrate.py — JSONL Import”Imports data from Anthropic’s JSONL format into SQLite. Processes the file line-by-line, tolerates corrupt entries, and generates embeddings in batch at the end. Fully idempotent — running it multiple times produces the same result. See Migration Guide for the full process.
config.py — Input Limits
Section titled “config.py — Input Limits”Constants for input validation across tools. Defines MAX_OBS=100, MAX_ENTITIES=50, MAX_OBS_LEN=2000, and MAX_QUERY_LEN=500 to prevent abuse and maintain performance. Also contains A/B testing configuration (USE_AB_TESTING, BASELINE_PROBABILITY).
_helpers.py — Shared Helpers
Section titled “_helpers.py — Shared Helpers”Three utility functions shared across tool modules: _get_engine() (lazy-loads the EmbeddingEngine singleton), _format_output() (standardizes entity output formatting), and _get_store() (retrieves the MemoryStore instance from the server module). Centralizes the runtime lookup pattern used by all tools.
retry.py — Concurrency Handling
Section titled “retry.py — Concurrency Handling”The retry_on_locked decorator provides exponential backoff with jitter for SQLite write operations under multi-client concurrency. Applied to all 21 write methods in the storage layer. Parameters: max_retries=5, base_delay=0.1s, max_delay=2.0s, with 10% jitter to prevent thundering herd. Essential when multiple MCP clients (e.g., two opencode sessions) write to the same database simultaneously. Since v2.2, each retry automatically performs a rollback() before re-attempting the operation, ensuring a clean transaction state.
Data Flow: Write (create_entities)
Section titled “Data Flow: Write (create_entities)”When a client invokes create_entities, data flows through four stages before persisting with its semantic embedding:
sequenceDiagram participant C as Client participant F as FastMCP participant P as Pydantic participant S as MemoryStore participant E as EmbeddingEngine participant V as sqlite-vec
C->>F: create_entities([{name, entityType, observations}]) F->>P: EntityInput.model_validate(dict) Note over P: Validates name (non-empty),<br/>entityType, observations P->>S: upsert_entity(name, type) S->>S: add_observations(entity_id, obs)<br/>[dedup by content] S->>E: prepare_entity_text(name, type, obs) Note over E: Head+Tail+Diversity<br/>selection, 480 token budget E->>E: encode([text]) → float[384] E->>V: INSERT OR REPLACE<br/>(rowid, embedding) V-->>C: Entity created with embeddingStep by step:
- Client → FastMCP: The client sends a JSON-RPC request with a list of entity dicts
- FastMCP → Pydantic: Each dict is validated against
EntityInput— name must be non-empty, observations default to[] - Pydantic → MemoryStore: The entity is upserted (
INSERT ... ON CONFLICT(name) DO UPDATE). Observations are added with deduplication by exact content - MemoryStore → EmbeddingEngine: Entity text is prepared using Head+Tail+Diversity selection (480-token budget) and encoded into a 384-dim vector
- EmbeddingEngine → sqlite-vec: The vector (1,536 bytes as float32) is stored with
INSERT OR REPLACEusing the entity’s ID as rowid
Data Flow: Search
Section titled “Data Flow: Search”Semantic search uses a hybrid pipeline: KNN vector search runs in parallel with FTS5 full-text search, results are merged via Reciprocal Rank Fusion, then re-ranked by the Limbic Scoring engine.
sequenceDiagram participant C as Client participant F as FastMCP participant KNN as KNN (sqlite-vec) participant FTS as FTS5 (BM25) participant RRF as RRF Merge participant L as Limbic Scoring participant S as MemoryStore
C->>F: search_semantic("project memory", limit=10) F->>F: Check engine.available
par Parallel Search F->>KNN: encode(query) → KNN 3× limit F->>FTS: BM25 on name/type/obs end
KNN-->>RRF: [{id, distance}] FTS-->>RRF: [{id, rank}] Note over RRF: score(d) = Σ 1/(k + rank_i)<br/>k = 60 RRF->>L: Fused candidates + scores L->>L: Fetch access, degree,<br/>co-occurrence data L->>L: Compute limbic_score<br/>per candidate L->>S: Top-K entity IDs (hydrate) S->>S: get_entity_by_id() +<br/>get_observations() Note over S: record_access() +<br/>record_co_occurrences()<br/>(post-response, best-effort) S-->>C: Results with limbic_score<br/>+ scoring breakdownStep by step:
- Client → FastMCP: The query string and optional limit arrive via JSON-RPC
- Availability check: If the ONNX model isn’t loaded, the server returns a clear error with download instructions
- Parallel search: Two branches execute simultaneously:
- KNN: The query is encoded with the
"query: "prefix and compared against stored vectors via sqlite-vec (retrieves 3× limit candidates) - FTS5: BM25 ranking over entity names, types, and observation text
- KNN: The query is encoded with the
- RRF Merge: Results from both branches are fused using Reciprocal Rank Fusion (
k=60). Entities appearing in both rankings receive a combined score boost - Limbic Re-rank — The merged candidates are scored by the Limbic System, which applies salience, temporal decay, and co-occurrence boosts. Query routing (
detect_query_type()) determines the strategy (COSINE_HEAVY/LIMBIC_HEAVY/HYBRID_BALANCED) based on linguistic features and k_limit. - Hydration: Top-K entity IDs are hydrated with full entity data (name, type, observations) from SQLite
- Post-response tracking: Access events and co-occurrences are recorded for future ranking improvements — this is best-effort and doesn’t block the response
For details on how the Limbic Scoring formula works, see Limbic System.
Lazy Loading
Section titled “Lazy Loading”The EmbeddingEngine uses a singleton + lazy load pattern to keep startup fast:
graph TD A["Server starts<br/>EmbeddingEngine._instance = None"] --> B["First call to get_instance()"] B --> C{Model files exist<br/>in ~/.cache/mcp-memory-v2/models/?} C -->|Yes| D["Load ONNX + Tokenizer<br/>_available = True"] C -->|No| E["_available = False<br/>Server continues without embeddings"] D --> F["Ready for search_semantic<br/>and embedding generation"]The server starts without loading the model. The 8 Anthropic-compatible tools work immediately using only SQLite, as do most extended tools. The embedding engine initializes on demand — the first time a tool needs it.
Two-level lazy loading:
- Import level:
mcp_memory.embeddingsis imported inside_get_engine(), not at module scope inserver.py - Instance level:
EmbeddingEngine.get_instance()creates the singleton on first call
| Scenario | Behavior |
|---|---|
| Model downloaded | First search_semantic takes ~3-5 seconds (loading). Subsequent calls: milliseconds |
| Model not downloaded | search_semantic returns a clear error. All other 10 tools work normally |
| sqlite-vec unavailable | Server continues without vector search. CRUD operations unaffected |
Database
Section titled “Database”Default Path
Section titled “Default Path”~/.config/opencode/mcp-memory/memory.dbThe directory is created automatically if it doesn’t exist. A single file holds all data — entities, observations, relations, embeddings, and scoring metadata.
WAL Mode and Concurrency
Section titled “WAL Mode and Concurrency”SQLite is configured with Write-Ahead Logging (WAL) for safe concurrent access:
PRAGMA journal_mode = WAL # Concurrent reads without blocking writesPRAGMA busy_timeout = 10000 # Wait up to 10s if lockedPRAGMA synchronous = NORMAL # Balance between safety and speedPRAGMA cache_size = -64000 # 64 MB page cachePRAGMA temp_store = MEMORY # Temporary tables in RAMPRAGMA foreign_keys = ON # Enforce referential integrity| Operation | Behavior |
|---|---|
| Concurrent reads | Allowed — WAL supports multiple simultaneous readers |
| Writes | Sequential — single writer, but readers aren’t blocked |
| Lock contention | Writers wait up to 10 seconds (busy_timeout) for a lock |
Starting from v2.2, write operations use retry_on_locked — exponential backoff with jitter that handles database is locked errors transparently. Each retry performs an automatic rollback() before re-attempting, and long-running write operations (like add_observations) use BEGIN IMMEDIATE to acquire the write lock upfront. This enables safe multi-client access (e.g., two opencode sessions writing concurrently) without manual retry logic.
Schema Overview
Section titled “Schema Overview”| Table | Type | Purpose |
|---|---|---|
entities | Regular | Graph nodes (id, name, entity_type, timestamps) |
observations | Regular | Facts attached to entities (entity_id FK, content) |
relations | Regular | Graph edges (from_entity, to_entity, relation_type) |
db_metadata | Regular | System key-value metadata |
entity_embeddings | Virtual (vec0) | 384-dim vectors with cosine distance |
entity_fts | Virtual (FTS5) | Full-text search with BM25 ranking |
entity_access | Regular | Access tracking for Limbic Scoring |
co_occurrences | Regular | Co-occurrence tracking for Limbic Scoring |
For the complete schema DDL, index definitions, and Pydantic model details, see API Reference.
A/B Testing: Shadow Mode
Section titled “A/B Testing: Shadow Mode”MCP Memory v2 includes a shadow-mode A/B testing system that compares limbic scoring against a cosine-only baseline without affecting user experience.
How It Works
Section titled “How It Works”| Aspect | Description |
|---|---|
| Shadow mode | Every search_semantic call runs both baseline and limbic ranking |
| Assignment | Hash-based deterministic (query text → bucket) or random (10% baseline) |
| Logging | search_events and search_results tables store raw rankings |
| Metrics | ab_metrics.py computes NDCG@K, Lift@K from logged data |
| No user impact | Baseline results are logged but never returned to users |
Configuration
Section titled “Configuration”USE_AB_TESTING = TrueBASELINE_PROBABILITY = 0.1 # 10% of queries are baselineDatabase Tables
Section titled “Database Tables”| Table | Purpose |
|---|---|
search_events | Query metadata: text, treatment, k_limit, timestamp, duration |
search_results | Per-entity ranking data: entity_id, rank, limbic_score, cosine_sim |
implicit_feedback | Re-access events for NDCG calculation |
Auto-Tuning Workflow
Section titled “Auto-Tuning Workflow”- Collect shadow mode data via normal
search_semanticusage - Run
python scripts/auto_tuner.py --tunewhen enough data accumulates - Script finds optimal GAMMA × BETA_SAL via grid search
- Applies smoothly via exponential moving average (blend_factor=0.1)
- Updates both
db_metadataandscoring.pyconstants
Design Conventions
Section titled “Design Conventions”Several architectural decisions distinguish mcp-memory from the original Anthropic server and similar solutions:
SQLite over JSONL
Section titled “SQLite over JSONL”The original Anthropic server rewrites the entire knowledge graph to a JSONL file on every operation, with no locking. This causes data corruption under concurrent access. SQLite with WAL mode provides ACID transactions, indexed queries (O(log n) vs linear scan), and safe concurrency — without requiring a separate database server.
ONNX over Cloud APIs
Section titled “ONNX over Cloud APIs”Embeddings run locally via ONNX Runtime on CPU. No API keys, no network latency, no rate limits, no vendor lock-in. The tradeoff is ~465 MB of disk space for the model and ~5ms per encoding — acceptable for a local tool.
Cosine Distance, 384 Dimensions
Section titled “Cosine Distance, 384 Dimensions”The intfloat/multilingual-e5-small model produces 384-dim vectors. This is a deliberate balance between quality and footprint:
- 384 dims × 4 bytes = 1,536 bytes per embedding — small enough for efficient storage and fast KNN search
- Cosine distance (
d = 1 - cos(A, B)) matches how the e5 model was trained - Vectors are L2-normalized before storage, enabling dot product as a proxy for cosine similarity
Over-Retrieval + Re-Ranking
Section titled “Over-Retrieval + Re-Ranking”Search retrieves 3× the requested limit from KNN (e.g., 30 candidates for limit=10), then re-ranks with Limbic Scoring to produce the final top-K. This gives the scoring engine a larger pool to work with, improving result quality without significant overhead.
Asymmetric Prefixes
Section titled “Asymmetric Prefixes”The e5 model requires task-specific prefixes for optimal retrieval:
- Queries use
"query: "prefix - Entities (passages) use
"passage: "prefix
This is a requirement of the model’s training methodology — using the wrong prefix significantly degrades search quality.
Non-Incremental Embeddings
Section titled “Non-Incremental Embeddings”Embeddings are regenerated from scratch whenever an entity’s content changes. While this costs a full encoding pass each time, it guarantees that the vector always reflects the current state — no stale partial updates, no accumulation artifacts.
Auto-Tuning via Grid Search
Section titled “Auto-Tuning via Grid Search”The Limbic Scoring constants (GAMMA, BETA_SAL) are tunable via offline grid search:
- Data source: Shadow mode A/B testing logs (no manual labeling required)
- Metric: NDCG@K (Normalized Discounted Cumulative Gain at K)
- Process:
auto_tuner.py --tuneexplores combinations, applies smoothly - Persistence: Both
db_metadatatable andscoring.pymodule constants
This enables continuous improvement without code changes or service restarts.