Skip to content

Architecture

mcp-memory follows a layered architecture with three main components: the MCP server (transport + tools), the storage layer (SQLite), and the embedding engine (ONNX). A fourth layer — the Limbic Scoring engine — sits above storage and orchestrates dynamic re-ranking.

graph TD
Client["MCP Client<br/>(Claude Desktop, OpenCode, Cursor)"]
Client -->|stdio JSON-RPC| FastMCP
subgraph FastMCP["FastMCP Server"]
Tools["19 MCP Tools<br/>(8 Anthropic + 11 extended)"]
Pydantic["Pydantic Validation"]
Engine["EmbeddingEngine<br/>(ONNX, lazy load)"]
Tools --> Pydantic
Tools --> Engine
end
subgraph Storage["Storage Layer"]
MemoryStore["MemoryStore<br/>(SQLite + sqlite-vec + FTS5)"]
SQLite["memory.db<br/>(WAL mode)"]
end
subgraph Scoring["Scoring Layer"]
Limbic["scoring.py (Limbic)<br/>Salience · Decay · Co-oc · Temporal Co-oc Decay"]
Routing["Query Router<br/>(COSINE/LIMBIC/HYBRID)"]
end
subgraph EntitySplit["Entity Splitting"]
Splitter["entity_splitter.py<br/>(TF-IDF + Thresholds)"]
end
subgraph Scripts["Scripts"]
AutoTuner["auto_tuner.py<br/>(Grid Search + NDCG@K)"]
GridSearch["grid_search.py<br/>(Offline GAMMA/BETA_SAL)"]
ABMetrics["ab_metrics.py<br/>(Shadow Mode Metrics)"]
end
Pydantic --> MemoryStore
Engine --> MemoryStore
MemoryStore --> SQLite
MemoryStore --> Limbic
Limbic --> Routing

The server starts as a stdio process that listens for JSON-RPC on stdin and responds via stdout. Logs go to stderr to avoid interfering with the MCP protocol.

Default storage path: ~/.config/opencode/mcp-memory/memory.db

ComponentTechnologyVersionPurpose
RuntimePython>= 3.12Minimum language version
MCP ServerFastMCP>= 3.0Framework for tool registration and stdio transport
DatabaseSQLitestdlibPersistent storage with WAL journaling
Vector searchsqlite-vec>= 0.1.6KNN with cosine distance on virtual table vec0
EmbeddingsONNX Runtime>= 1.17CPU inference for sentence embeddings
TokenizationHuggingFace Tokenizers>= 0.19Fast tokenization (Rust implementation)
NumericsNumPy>= 1.26Vector operations and linear algebra
ValidationPydantic>= 2.0Input/output model validation
Model downloadHuggingFace Hub>= 0.20Fetch models from HF Hub
Build systemhatchlingPython packaging

mcp-memory communicates over stdio using JSON-RPC:

  • Input: The server reads JSON-RPC requests from stdin
  • Output: Responses are written to stdout
  • Logs: All diagnostic output goes to stderr — never stdout — to avoid polluting the MCP protocol channel

The entry point is registered as mcp-memory, which resolves to mcp_memory.server:main. This makes it compatible with any MCP client that supports stdio transport:

{
"mcpServers": {
"memory": {
"command": ["uvx", "--from", "git+https://github.com/Yarlan1503/mcp-memory", "mcp-memory"]
}
}
}

The codebase is organized into five core modules:

The entry point and tool registry — only 97 lines. Initializes the FastMCP server, creates the MemoryStore instance, and registers 19 tools by importing from four tool modules (tools/core.py, tools/search.py, tools/entity_mgmt.py, tools/reflections.py, tools/relations.py). Each tool has Pydantic-validated inputs and outputs. The store is shared via a runtime lookup pattern — tool functions access it through _server_mod.store rather than closure or global.

The persistence layer — organized as a package with 7 mixins. MemoryStore in storage/__init__.py is a facade class that inherits from CoreMixin, SchemaMixin, RelationsMixin, SearchMixin, AccessMixin, ReflectionsMixin, and ConsolidationMixin. Each mixin handles a specific domain (entity CRUD, schema migrations, relations, search, access tracking, reflections, consolidation). All share a single SQLite connection (self.db) via Python’s MRO (Method Resolution Order).

Encapsulates all embedding inference logic using an ONNX model. Implements a singleton pattern with lazy loading — the model is only loaded into memory on first use. Provides the encoding pipeline: prefix prepending → tokenization → ONNX forward pass → mean pooling → L2 normalization.

The dynamic ranking engine. Computes composite scores from three signals: salience (access frequency + graph degree), temporal decay (exponential with configurable half-life), and co-occurrence (how often entities appear together in results). Also implements Reciprocal Rank Fusion for merging KNN and FTS5 results. See Limbic System for details.

Automatically splits entities that exceed observation thresholds into focused sub-entities. Uses TF-IDF to group observations by topic and creates contiene/parte_de relations to preserve the knowledge graph structure.

  • Thresholds: Sesion=15, Proyecto=25, DEFAULT=20
  • Topic extraction: TF-IDF with Spanish stop words, min word length 4
  • Split proposal: Returns suggested new entities + relations to create
  • Atomic execution: All-or-nothing transaction via SQLite context manager

Offline optimization script for GAMMA and BETA_SAL hyperparameters:

  • Grid search: Explores GAMMA × BETA_SAL combinations
  • NDCG@K metrics: Uses implicit feedback from shadow mode A/B testing
  • Smooth apply: Exponential moving average to avoid sudden changes
  • CLI: --analyze, --tune, --set-gamma, --set-beta

scripts/ab_metrics.py — A/B Testing Metrics

Section titled “scripts/ab_metrics.py — A/B Testing Metrics”

Calculates quality metrics from shadow mode data:

  • NDCG@K: Normalized Discounted Cumulative Gain
  • Lift@K: Proportion of relevant items in top-K vs overall
  • Baseline comparison: Treatment vs cosine-only ranking

Imports data from Anthropic’s JSONL format into SQLite. Processes the file line-by-line, tolerates corrupt entries, and generates embeddings in batch at the end. Fully idempotent — running it multiple times produces the same result. See Migration Guide for the full process.

Constants for input validation across tools. Defines MAX_OBS=100, MAX_ENTITIES=50, MAX_OBS_LEN=2000, and MAX_QUERY_LEN=500 to prevent abuse and maintain performance. Also contains A/B testing configuration (USE_AB_TESTING, BASELINE_PROBABILITY).

Three utility functions shared across tool modules: _get_engine() (lazy-loads the EmbeddingEngine singleton), _format_output() (standardizes entity output formatting), and _get_store() (retrieves the MemoryStore instance from the server module). Centralizes the runtime lookup pattern used by all tools.

The retry_on_locked decorator provides exponential backoff with jitter for SQLite write operations under multi-client concurrency. Applied to all 21 write methods in the storage layer. Parameters: max_retries=5, base_delay=0.1s, max_delay=2.0s, with 10% jitter to prevent thundering herd. Essential when multiple MCP clients (e.g., two opencode sessions) write to the same database simultaneously. Since v2.2, each retry automatically performs a rollback() before re-attempting the operation, ensuring a clean transaction state.

When a client invokes create_entities, data flows through four stages before persisting with its semantic embedding:

sequenceDiagram
participant C as Client
participant F as FastMCP
participant P as Pydantic
participant S as MemoryStore
participant E as EmbeddingEngine
participant V as sqlite-vec
C->>F: create_entities([{name, entityType, observations}])
F->>P: EntityInput.model_validate(dict)
Note over P: Validates name (non-empty),<br/>entityType, observations
P->>S: upsert_entity(name, type)
S->>S: add_observations(entity_id, obs)<br/>[dedup by content]
S->>E: prepare_entity_text(name, type, obs)
Note over E: Head+Tail+Diversity<br/>selection, 480 token budget
E->>E: encode([text]) → float[384]
E->>V: INSERT OR REPLACE<br/>(rowid, embedding)
V-->>C: Entity created with embedding

Step by step:

  1. Client → FastMCP: The client sends a JSON-RPC request with a list of entity dicts
  2. FastMCP → Pydantic: Each dict is validated against EntityInput — name must be non-empty, observations default to []
  3. Pydantic → MemoryStore: The entity is upserted (INSERT ... ON CONFLICT(name) DO UPDATE). Observations are added with deduplication by exact content
  4. MemoryStore → EmbeddingEngine: Entity text is prepared using Head+Tail+Diversity selection (480-token budget) and encoded into a 384-dim vector
  5. EmbeddingEngine → sqlite-vec: The vector (1,536 bytes as float32) is stored with INSERT OR REPLACE using the entity’s ID as rowid

Semantic search uses a hybrid pipeline: KNN vector search runs in parallel with FTS5 full-text search, results are merged via Reciprocal Rank Fusion, then re-ranked by the Limbic Scoring engine.

sequenceDiagram
participant C as Client
participant F as FastMCP
participant KNN as KNN (sqlite-vec)
participant FTS as FTS5 (BM25)
participant RRF as RRF Merge
participant L as Limbic Scoring
participant S as MemoryStore
C->>F: search_semantic("project memory", limit=10)
F->>F: Check engine.available
par Parallel Search
F->>KNN: encode(query) → KNN 3× limit
F->>FTS: BM25 on name/type/obs
end
KNN-->>RRF: [{id, distance}]
FTS-->>RRF: [{id, rank}]
Note over RRF: score(d) = Σ 1/(k + rank_i)<br/>k = 60
RRF->>L: Fused candidates + scores
L->>L: Fetch access, degree,<br/>co-occurrence data
L->>L: Compute limbic_score<br/>per candidate
L->>S: Top-K entity IDs (hydrate)
S->>S: get_entity_by_id() +<br/>get_observations()
Note over S: record_access() +<br/>record_co_occurrences()<br/>(post-response, best-effort)
S-->>C: Results with limbic_score<br/>+ scoring breakdown

Step by step:

  1. Client → FastMCP: The query string and optional limit arrive via JSON-RPC
  2. Availability check: If the ONNX model isn’t loaded, the server returns a clear error with download instructions
  3. Parallel search: Two branches execute simultaneously:
    • KNN: The query is encoded with the "query: " prefix and compared against stored vectors via sqlite-vec (retrieves 3× limit candidates)
    • FTS5: BM25 ranking over entity names, types, and observation text
  4. RRF Merge: Results from both branches are fused using Reciprocal Rank Fusion (k=60). Entities appearing in both rankings receive a combined score boost
  5. Limbic Re-rank — The merged candidates are scored by the Limbic System, which applies salience, temporal decay, and co-occurrence boosts. Query routing (detect_query_type()) determines the strategy (COSINE_HEAVY/LIMBIC_HEAVY/HYBRID_BALANCED) based on linguistic features and k_limit.
  6. Hydration: Top-K entity IDs are hydrated with full entity data (name, type, observations) from SQLite
  7. Post-response tracking: Access events and co-occurrences are recorded for future ranking improvements — this is best-effort and doesn’t block the response

For details on how the Limbic Scoring formula works, see Limbic System.

The EmbeddingEngine uses a singleton + lazy load pattern to keep startup fast:

graph TD
A["Server starts<br/>EmbeddingEngine._instance = None"] --> B["First call to get_instance()"]
B --> C{Model files exist<br/>in ~/.cache/mcp-memory-v2/models/?}
C -->|Yes| D["Load ONNX + Tokenizer<br/>_available = True"]
C -->|No| E["_available = False<br/>Server continues without embeddings"]
D --> F["Ready for search_semantic<br/>and embedding generation"]

The server starts without loading the model. The 8 Anthropic-compatible tools work immediately using only SQLite, as do most extended tools. The embedding engine initializes on demand — the first time a tool needs it.

Two-level lazy loading:

  1. Import level: mcp_memory.embeddings is imported inside _get_engine(), not at module scope in server.py
  2. Instance level: EmbeddingEngine.get_instance() creates the singleton on first call
ScenarioBehavior
Model downloadedFirst search_semantic takes ~3-5 seconds (loading). Subsequent calls: milliseconds
Model not downloadedsearch_semantic returns a clear error. All other 10 tools work normally
sqlite-vec unavailableServer continues without vector search. CRUD operations unaffected
~/.config/opencode/mcp-memory/memory.db

The directory is created automatically if it doesn’t exist. A single file holds all data — entities, observations, relations, embeddings, and scoring metadata.

SQLite is configured with Write-Ahead Logging (WAL) for safe concurrent access:

PRAGMA journal_mode = WAL # Concurrent reads without blocking writes
PRAGMA busy_timeout = 10000 # Wait up to 10s if locked
PRAGMA synchronous = NORMAL # Balance between safety and speed
PRAGMA cache_size = -64000 # 64 MB page cache
PRAGMA temp_store = MEMORY # Temporary tables in RAM
PRAGMA foreign_keys = ON # Enforce referential integrity
OperationBehavior
Concurrent readsAllowed — WAL supports multiple simultaneous readers
WritesSequential — single writer, but readers aren’t blocked
Lock contentionWriters wait up to 10 seconds (busy_timeout) for a lock

Starting from v2.2, write operations use retry_on_locked — exponential backoff with jitter that handles database is locked errors transparently. Each retry performs an automatic rollback() before re-attempting, and long-running write operations (like add_observations) use BEGIN IMMEDIATE to acquire the write lock upfront. This enables safe multi-client access (e.g., two opencode sessions writing concurrently) without manual retry logic.

TableTypePurpose
entitiesRegularGraph nodes (id, name, entity_type, timestamps)
observationsRegularFacts attached to entities (entity_id FK, content)
relationsRegularGraph edges (from_entity, to_entity, relation_type)
db_metadataRegularSystem key-value metadata
entity_embeddingsVirtual (vec0)384-dim vectors with cosine distance
entity_ftsVirtual (FTS5)Full-text search with BM25 ranking
entity_accessRegularAccess tracking for Limbic Scoring
co_occurrencesRegularCo-occurrence tracking for Limbic Scoring

For the complete schema DDL, index definitions, and Pydantic model details, see API Reference.

MCP Memory v2 includes a shadow-mode A/B testing system that compares limbic scoring against a cosine-only baseline without affecting user experience.

AspectDescription
Shadow modeEvery search_semantic call runs both baseline and limbic ranking
AssignmentHash-based deterministic (query text → bucket) or random (10% baseline)
Loggingsearch_events and search_results tables store raw rankings
Metricsab_metrics.py computes NDCG@K, Lift@K from logged data
No user impactBaseline results are logged but never returned to users
USE_AB_TESTING = True
BASELINE_PROBABILITY = 0.1 # 10% of queries are baseline
TablePurpose
search_eventsQuery metadata: text, treatment, k_limit, timestamp, duration
search_resultsPer-entity ranking data: entity_id, rank, limbic_score, cosine_sim
implicit_feedbackRe-access events for NDCG calculation
  1. Collect shadow mode data via normal search_semantic usage
  2. Run python scripts/auto_tuner.py --tune when enough data accumulates
  3. Script finds optimal GAMMA × BETA_SAL via grid search
  4. Applies smoothly via exponential moving average (blend_factor=0.1)
  5. Updates both db_metadata and scoring.py constants

Several architectural decisions distinguish mcp-memory from the original Anthropic server and similar solutions:

The original Anthropic server rewrites the entire knowledge graph to a JSONL file on every operation, with no locking. This causes data corruption under concurrent access. SQLite with WAL mode provides ACID transactions, indexed queries (O(log n) vs linear scan), and safe concurrency — without requiring a separate database server.

Embeddings run locally via ONNX Runtime on CPU. No API keys, no network latency, no rate limits, no vendor lock-in. The tradeoff is ~465 MB of disk space for the model and ~5ms per encoding — acceptable for a local tool.

The intfloat/multilingual-e5-small model produces 384-dim vectors. This is a deliberate balance between quality and footprint:

  • 384 dims × 4 bytes = 1,536 bytes per embedding — small enough for efficient storage and fast KNN search
  • Cosine distance (d = 1 - cos(A, B)) matches how the e5 model was trained
  • Vectors are L2-normalized before storage, enabling dot product as a proxy for cosine similarity

Search retrieves 3× the requested limit from KNN (e.g., 30 candidates for limit=10), then re-ranks with Limbic Scoring to produce the final top-K. This gives the scoring engine a larger pool to work with, improving result quality without significant overhead.

The e5 model requires task-specific prefixes for optimal retrieval:

  • Queries use "query: " prefix
  • Entities (passages) use "passage: " prefix

This is a requirement of the model’s training methodology — using the wrong prefix significantly degrades search quality.

Embeddings are regenerated from scratch whenever an entity’s content changes. While this costs a full encoding pass each time, it guarantees that the vector always reflects the current state — no stale partial updates, no accumulation artifacts.

The Limbic Scoring constants (GAMMA, BETA_SAL) are tunable via offline grid search:

  • Data source: Shadow mode A/B testing logs (no manual labeling required)
  • Metric: NDCG@K (Normalized Discounted Cumulative Gain at K)
  • Process: auto_tuner.py --tune explores combinations, applies smoothly
  • Persistence: Both db_metadata table and scoring.py module constants

This enables continuous improvement without code changes or service restarts.