Architecture

High-Level Overview

mcp-memory follows a layered architecture with three main components: the MCP server (transport + tools), the storage layer (SQLite), and the embedding engine (ONNX). A fourth layer — the Limbic Scoring engine — sits above storage and orchestrates dynamic re-ranking.

graph TD
    Client["MCP Client<br/>(Claude Desktop, OpenCode, Cursor)"]
    Client -->|stdio JSON-RPC| FastMCP

    subgraph FastMCP["FastMCP Server"]
        Tools["19 MCP Tools<br/>(8 Anthropic + 11 extended)"]
        Pydantic["Pydantic Validation"]
        Engine["EmbeddingEngine<br/>(ONNX, lazy load)"]
        Tools --> Pydantic
        Tools --> Engine
    end

    subgraph Storage["Storage Layer"]
        MemoryStore["MemoryStore<br/>(SQLite + sqlite-vec + FTS5)"]
        SQLite["memory.db<br/>(WAL mode)"]
    end

    subgraph Scoring["Scoring Layer"]
        Limbic["scoring.py (Limbic)<br/>Salience · Decay · Co-oc · Temporal Co-oc Decay"]
        Routing["Query Router<br/>(COSINE/LIMBIC/HYBRID)"]
    end

    subgraph EntitySplit["Entity Splitting"]
        Splitter["entity_splitter.py<br/>(TF-IDF + Thresholds)"]
    end

    subgraph Scripts["Scripts"]
        AutoTuner["auto_tuner.py<br/>(Grid Search + NDCG@K)"]
        GridSearch["grid_search.py<br/>(Offline GAMMA/BETA_SAL)"]
        ABMetrics["ab_metrics.py<br/>(Shadow Mode Metrics)"]
    end

    Pydantic --> MemoryStore
    Engine --> MemoryStore
    MemoryStore --> SQLite
    MemoryStore --> Limbic
    Limbic --> Routing

The server starts as a stdio process that listens for JSON-RPC on stdin and responds via stdout. Logs go to stderr to avoid interfering with the MCP protocol.

Default storage path: ~/.config/opencode/mcp-memory/memory.db

Technology Stack

Component	Technology	Version	Purpose
Runtime	Python	>= 3.12	Minimum language version
MCP Server	FastMCP	>= 3.0	Framework for tool registration and stdio transport
Database	SQLite	stdlib	Persistent storage with WAL journaling
Vector search	sqlite-vec	>= 0.1.6	KNN with cosine distance on virtual table `vec0`
Embeddings	ONNX Runtime	>= 1.17	CPU inference for sentence embeddings
Tokenization	HuggingFace Tokenizers	>= 0.19	Fast tokenization (Rust implementation)
Numerics	NumPy	>= 1.26	Vector operations and linear algebra
Validation	Pydantic	>= 2.0	Input/output model validation
Model download	HuggingFace Hub	>= 0.20	Fetch models from HF Hub
Build system	hatchling	—	Python packaging

Transport

mcp-memory communicates over stdio using JSON-RPC:

Input: The server reads JSON-RPC requests from stdin
Output: Responses are written to stdout
Logs: All diagnostic output goes to stderr — never stdout — to avoid polluting the MCP protocol channel

The entry point is registered as mcp-memory, which resolves to mcp_memory.server:main. This makes it compatible with any MCP client that supports stdio transport:

{
  "mcpServers": {
    "memory": {
      "command": ["uvx", "--from", "git+https://github.com/Yarlan1503/mcp-memory", "mcp-memory"]
    }
  }
}

Main Components

The codebase is organized into five core modules:

`server.py` — FastMCP Server

The entry point and tool registry — only 97 lines. Initializes the FastMCP server, creates the MemoryStore instance, and registers 19 tools by importing from four tool modules (tools/core.py, tools/search.py, tools/entity_mgmt.py, tools/reflections.py, tools/relations.py). Each tool has Pydantic-validated inputs and outputs. The store is shared via a runtime lookup pattern — tool functions access it through _server_mod.store rather than closure or global.

`storage.py` — MemoryStore

The persistence layer — organized as a package with 7 mixins. MemoryStore in storage/__init__.py is a facade class that inherits from CoreMixin, SchemaMixin, RelationsMixin, SearchMixin, AccessMixin, ReflectionsMixin, and ConsolidationMixin. Each mixin handles a specific domain (entity CRUD, schema migrations, relations, search, access tracking, reflections, consolidation). All share a single SQLite connection (self.db) via Python’s MRO (Method Resolution Order).

`embeddings.py` — EmbeddingEngine

Encapsulates all embedding inference logic using an ONNX model. Implements a singleton pattern with lazy loading — the model is only loaded into memory on first use. Provides the encoding pipeline: prefix prepending → tokenization → ONNX forward pass → mean pooling → L2 normalization.

`scoring.py` — Limbic Scoring

The dynamic ranking engine. Computes composite scores from three signals: salience (access frequency + graph degree), temporal decay (exponential with configurable half-life), and co-occurrence (how often entities appear together in results). Also implements Reciprocal Rank Fusion for merging KNN and FTS5 results. See Limbic System for details.

`entity_splitter.py` — Entity Splitting

Automatically splits entities that exceed observation thresholds into focused sub-entities. Uses TF-IDF to group observations by topic and creates contiene/parte_de relations to preserve the knowledge graph structure.

Thresholds: Sesion=15, Proyecto=25, DEFAULT=20
Topic extraction: TF-IDF with Spanish stop words, min word length 4
Split proposal: Returns suggested new entities + relations to create
Atomic execution: All-or-nothing transaction via SQLite context manager

`scripts/auto_tuner.py` — Auto-tuning

Offline optimization script for GAMMA and BETA_SAL hyperparameters:

Grid search: Explores GAMMA × BETA_SAL combinations
NDCG@K metrics: Uses implicit feedback from shadow mode A/B testing
Smooth apply: Exponential moving average to avoid sudden changes
CLI: --analyze, --tune, --set-gamma, --set-beta

`scripts/ab_metrics.py` — A/B Testing Metrics

Calculates quality metrics from shadow mode data:

NDCG@K: Normalized Discounted Cumulative Gain
Lift@K: Proportion of relevant items in top-K vs overall
Baseline comparison: Treatment vs cosine-only ranking

`migrate.py` — JSONL Import

Imports data from Anthropic’s JSONL format into SQLite. Processes the file line-by-line, tolerates corrupt entries, and generates embeddings in batch at the end. Fully idempotent — running it multiple times produces the same result. See Migration Guide for the full process.

`config.py` — Input Limits

Constants for input validation across tools. Defines MAX_OBS=100, MAX_ENTITIES=50, MAX_OBS_LEN=2000, and MAX_QUERY_LEN=500 to prevent abuse and maintain performance. Also contains A/B testing configuration (USE_AB_TESTING, BASELINE_PROBABILITY).

`_helpers.py` — Shared Helpers

Three utility functions shared across tool modules: _get_engine() (lazy-loads the EmbeddingEngine singleton), _format_output() (standardizes entity output formatting), and _get_store() (retrieves the MemoryStore instance from the server module). Centralizes the runtime lookup pattern used by all tools.

`retry.py` — Concurrency Handling

The retry_on_locked decorator provides exponential backoff with jitter for SQLite write operations under multi-client concurrency. Applied to all 21 write methods in the storage layer. Parameters: max_retries=5, base_delay=0.1s, max_delay=2.0s, with 10% jitter to prevent thundering herd. Essential when multiple MCP clients (e.g., two opencode sessions) write to the same database simultaneously. Since v2.2, each retry automatically performs a rollback() before re-attempting the operation, ensuring a clean transaction state.

Data Flow: Write (create_entities)

When a client invokes create_entities, data flows through four stages before persisting with its semantic embedding:

sequenceDiagram
    participant C as Client
    participant F as FastMCP
    participant P as Pydantic
    participant S as MemoryStore
    participant E as EmbeddingEngine
    participant V as sqlite-vec

    C->>F: create_entities([{name, entityType, observations}])
    F->>P: EntityInput.model_validate(dict)
    Note over P: Validates name (non-empty),<br/>entityType, observations
    P->>S: upsert_entity(name, type)
    S->>S: add_observations(entity_id, obs)<br/>[dedup by content]
    S->>E: prepare_entity_text(name, type, obs)
    Note over E: Head+Tail+Diversity<br/>selection, 480 token budget
    E->>E: encode([text]) → float[384]
    E->>V: INSERT OR REPLACE<br/>(rowid, embedding)
    V-->>C: Entity created with embedding

Step by step:

Client → FastMCP: The client sends a JSON-RPC request with a list of entity dicts
FastMCP → Pydantic: Each dict is validated against EntityInput — name must be non-empty, observations default to []
Pydantic → MemoryStore: The entity is upserted (INSERT ... ON CONFLICT(name) DO UPDATE). Observations are added with deduplication by exact content
MemoryStore → EmbeddingEngine: Entity text is prepared using Head+Tail+Diversity selection (480-token budget) and encoded into a 384-dim vector
EmbeddingEngine → sqlite-vec: The vector (1,536 bytes as float32) is stored with INSERT OR REPLACE using the entity’s ID as rowid

Data Flow: Search

Semantic search uses a hybrid pipeline: KNN vector search runs in parallel with FTS5 full-text search, results are merged via Reciprocal Rank Fusion, then re-ranked by the Limbic Scoring engine.

sequenceDiagram
    participant C as Client
    participant F as FastMCP
    participant KNN as KNN (sqlite-vec)
    participant FTS as FTS5 (BM25)
    participant RRF as RRF Merge
    participant L as Limbic Scoring
    participant S as MemoryStore

    C->>F: search_semantic("project memory", limit=10)
    F->>F: Check engine.available

    par Parallel Search
        F->>KNN: encode(query) → KNN 3× limit
        F->>FTS: BM25 on name/type/obs
    end

    KNN-->>RRF: [{id, distance}]
    FTS-->>RRF: [{id, rank}]
    Note over RRF: score(d) = Σ 1/(k + rank_i)<br/>k = 60
    RRF->>L: Fused candidates + scores
    L->>L: Fetch access, degree,<br/>co-occurrence data
    L->>L: Compute limbic_score<br/>per candidate
    L->>S: Top-K entity IDs (hydrate)
    S->>S: get_entity_by_id() +<br/>get_observations()
    Note over S: record_access() +<br/>record_co_occurrences()<br/>(post-response, best-effort)
    S-->>C: Results with limbic_score<br/>+ scoring breakdown

Step by step:

Client → FastMCP: The query string and optional limit arrive via JSON-RPC
Availability check: If the ONNX model isn’t loaded, the server returns a clear error with download instructions
Parallel search: Two branches execute simultaneously:
- KNN: The query is encoded with the "query: " prefix and compared against stored vectors via sqlite-vec (retrieves 3× limit candidates)
- FTS5: BM25 ranking over entity names, types, and observation text
RRF Merge: Results from both branches are fused using Reciprocal Rank Fusion (k=60). Entities appearing in both rankings receive a combined score boost
Limbic Re-rank — The merged candidates are scored by the Limbic System, which applies salience, temporal decay, and co-occurrence boosts. Query routing (detect_query_type()) determines the strategy (COSINE_HEAVY/LIMBIC_HEAVY/HYBRID_BALANCED) based on linguistic features and k_limit.
Hydration: Top-K entity IDs are hydrated with full entity data (name, type, observations) from SQLite
Post-response tracking: Access events and co-occurrences are recorded for future ranking improvements — this is best-effort and doesn’t block the response

For details on how the Limbic Scoring formula works, see Limbic System.

Lazy Loading

The EmbeddingEngine uses a singleton + lazy load pattern to keep startup fast:

graph TD
    A["Server starts<br/>EmbeddingEngine._instance = None"] --> B["First call to get_instance()"]
    B --> C{Model files exist<br/>in ~/.cache/mcp-memory-v2/models/?}
    C -->|Yes| D["Load ONNX + Tokenizer<br/>_available = True"]
    C -->|No| E["_available = False<br/>Server continues without embeddings"]
    D --> F["Ready for search_semantic<br/>and embedding generation"]

The server starts without loading the model. The 8 Anthropic-compatible tools work immediately using only SQLite, as do most extended tools. The embedding engine initializes on demand — the first time a tool needs it.

Two-level lazy loading:

Import level: mcp_memory.embeddings is imported inside _get_engine(), not at module scope in server.py
Instance level: EmbeddingEngine.get_instance() creates the singleton on first call

Scenario	Behavior
Model downloaded	First `search_semantic` takes ~3-5 seconds (loading). Subsequent calls: milliseconds
Model not downloaded	`search_semantic` returns a clear error. All other 10 tools work normally
sqlite-vec unavailable	Server continues without vector search. CRUD operations unaffected

Database

Default Path

~/.config/opencode/mcp-memory/memory.db

The directory is created automatically if it doesn’t exist. A single file holds all data — entities, observations, relations, embeddings, and scoring metadata.

WAL Mode and Concurrency

SQLite is configured with Write-Ahead Logging (WAL) for safe concurrent access:

PRAGMA journal_mode = WAL       # Concurrent reads without blocking writes
PRAGMA busy_timeout  = 10000    # Wait up to 10s if locked
PRAGMA synchronous   = NORMAL   # Balance between safety and speed
PRAGMA cache_size    = -64000   # 64 MB page cache
PRAGMA temp_store    = MEMORY   # Temporary tables in RAM
PRAGMA foreign_keys  = ON       # Enforce referential integrity

Operation	Behavior
Concurrent reads	Allowed — WAL supports multiple simultaneous readers
Writes	Sequential — single writer, but readers aren’t blocked
Lock contention	Writers wait up to 10 seconds (`busy_timeout`) for a lock

Starting from v2.2, write operations use retry_on_locked — exponential backoff with jitter that handles database is locked errors transparently. Each retry performs an automatic rollback() before re-attempting, and long-running write operations (like add_observations) use BEGIN IMMEDIATE to acquire the write lock upfront. This enables safe multi-client access (e.g., two opencode sessions writing concurrently) without manual retry logic.

Schema Overview

Table	Type	Purpose
`entities`	Regular	Graph nodes (id, name, entity_type, timestamps)
`observations`	Regular	Facts attached to entities (entity_id FK, content)
`relations`	Regular	Graph edges (from_entity, to_entity, relation_type)
`db_metadata`	Regular	System key-value metadata
`entity_embeddings`	Virtual (vec0)	384-dim vectors with cosine distance
`entity_fts`	Virtual (FTS5)	Full-text search with BM25 ranking
`entity_access`	Regular	Access tracking for Limbic Scoring
`co_occurrences`	Regular	Co-occurrence tracking for Limbic Scoring

For the complete schema DDL, index definitions, and Pydantic model details, see API Reference.

A/B Testing: Shadow Mode

MCP Memory v2 includes a shadow-mode A/B testing system that compares limbic scoring against a cosine-only baseline without affecting user experience.

How It Works

Aspect	Description
Shadow mode	Every `search_semantic` call runs both baseline and limbic ranking
Assignment	Hash-based deterministic (query text → bucket) or random (10% baseline)
Logging	`search_events` and `search_results` tables store raw rankings
Metrics	`ab_metrics.py` computes NDCG@K, Lift@K from logged data
No user impact	Baseline results are logged but never returned to users

Configuration

USE_AB_TESTING = True
BASELINE_PROBABILITY = 0.1  # 10% of queries are baseline

Database Tables

Table	Purpose
`search_events`	Query metadata: text, treatment, k_limit, timestamp, duration
`search_results`	Per-entity ranking data: entity_id, rank, limbic_score, cosine_sim
`implicit_feedback`	Re-access events for NDCG calculation

Auto-Tuning Workflow

Collect shadow mode data via normal search_semantic usage
Run python scripts/auto_tuner.py --tune when enough data accumulates
Script finds optimal GAMMA × BETA_SAL via grid search
Applies smoothly via exponential moving average (blend_factor=0.1)
Updates both db_metadata and scoring.py constants

Design Conventions

Several architectural decisions distinguish mcp-memory from the original Anthropic server and similar solutions:

SQLite over JSONL

The original Anthropic server rewrites the entire knowledge graph to a JSONL file on every operation, with no locking. This causes data corruption under concurrent access. SQLite with WAL mode provides ACID transactions, indexed queries (O(log n) vs linear scan), and safe concurrency — without requiring a separate database server.

ONNX over Cloud APIs

Embeddings run locally via ONNX Runtime on CPU. No API keys, no network latency, no rate limits, no vendor lock-in. The tradeoff is ~465 MB of disk space for the model and ~5ms per encoding — acceptable for a local tool.

Cosine Distance, 384 Dimensions

The intfloat/multilingual-e5-small model produces 384-dim vectors. This is a deliberate balance between quality and footprint:

384 dims × 4 bytes = 1,536 bytes per embedding — small enough for efficient storage and fast KNN search
Cosine distance (d = 1 - cos(A, B)) matches how the e5 model was trained
Vectors are L2-normalized before storage, enabling dot product as a proxy for cosine similarity

Over-Retrieval + Re-Ranking

Search retrieves 3× the requested limit from KNN (e.g., 30 candidates for limit=10), then re-ranks with Limbic Scoring to produce the final top-K. This gives the scoring engine a larger pool to work with, improving result quality without significant overhead.

Asymmetric Prefixes

The e5 model requires task-specific prefixes for optimal retrieval:

Queries use "query: " prefix
Entities (passages) use "passage: " prefix

This is a requirement of the model’s training methodology — using the wrong prefix significantly degrades search quality.

Non-Incremental Embeddings

Embeddings are regenerated from scratch whenever an entity’s content changes. While this costs a full encoding pass each time, it guarantees that the vector always reflects the current state — no stale partial updates, no accumulation artifacts.

Auto-Tuning via Grid Search

The Limbic Scoring constants (GAMMA, BETA_SAL) are tunable via offline grid search:

Data source: Shadow mode A/B testing logs (no manual labeling required)
Metric: NDCG@K (Normalized Discounted Cumulative Gain at K)
Process: auto_tuner.py --tune explores combinations, applies smoothly
Persistence: Both db_metadata table and scoring.py module constants

This enables continuous improvement without code changes or service restarts.