How It Works
Semantic search converts each entity in the knowledge graph into a 384-dimensional numeric vector and stores it in a sqlite-vec virtual table. When you call search_semantic, the engine encodes your query into the same vector space, finds the k nearest neighbors (KNN), and returns the most relevant entities.
The full pipeline — from raw text to ranked results — looks like this:
graph TD
Input["Entity text or Query"] --> Prefix["No prefix needed<br/>(MiniLM unified encoding)"]
Prefix --> Tok["Tokenization<br/>HuggingFace fast tokenizer<br/>trunc=512, pad=512"]
Tok --> ONNX["ONNX Forward Pass<br/>CPUExecutionProvider<br/>→ (batch, seq_len, 384)"]
ONNX --> Pool["Mean Pooling<br/>Mask [PAD] via attention_mask<br/>→ (batch, 384)"]
Pool --> Norm["L2 Normalization<br/>||v|| = 1<br/>→ float32[384]"]
Norm --> Store{Entity or Query?}
Store -->|Entity| Vec["Serialize to bytes<br/>384 x 4 = 1,536 bytes<br/>INSERT OR REPLACE into vec0"]
Store -->|Query| KNN["sqlite-vec KNN<br/>WHERE embedding MATCH ?<br/>ORDER BY distance"]
KNN --> Results["[{entity_id, distance}]"]
Results --> Limbic["Limbic Re-rank<br/>salience · temporal · cooc"]
Limbic --> Output["Top-K results with scoring"]
For the complete fusion pipeline that combines KNN with full-text search, see Hybrid Search (FTS5 + RRF).
Embedding Model
The semantic engine uses sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 from HuggingFace (MiniLM family, optimized for sentence embeddings):
| Property | Value |
|---|---|
| Model ID | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
| Dimensions | 384 (float32, ONNX FP32) |
| Languages | 94+ — Spanish, English, French, German, Chinese, Japanese, etc. |
| Runtime | CPU only (CPUExecutionProvider, no GPU required) |
| Size on disk | ~465 MB (model ONNX + tokenizer) |
| Distance metric | Cosine distance: d = 1 - cos(A, B), range [0, 2] |
| Type | Unified encoding — no task-specific prefixes required |
| Cache path | ~/.cache/mcp-memory-v2/models/ |
Encoding Pipeline
The EmbeddingEngine.encode() method transforms raw text into L2-normalized vectors in five steps:
def encode(self, texts: list[str], task: str = "passage") -> np.ndarray:
"""Encode texts to embeddings.
MiniLM uses unified encoding — task parameter is optional/legacy.
"""
prefix = self.QUERY_PREFIX if task == "query" else self.PASSAGE_PREFIX
prefixed = [f"{prefix}{t}" for t in texts]
# Step 1-4: tokenization → ONNX → mean pooling → L2 normalize
Step 1 — Prepend Prefix
MiniLM uses unified encoding — the same model processes both queries and passages without task-specific prefixes. The optional task parameter remains in the API for backward compatibility but has no semantic effect on the output.
Step 2 — Tokenization
The HuggingFace fast tokenizer (Rust implementation) with two fixed configurations:
enable_truncation(max_length=512)— truncate sequences longer than 512 tokensenable_padding(pad_to_multiple_of=8)— dynamic padding to the nearest multiple of 8 tokens per batch
encoded = self._tokenizer.encode_batch(texts)
input_ids = np.array(
[e.ids for e in encoded],
dtype=np.int64,
)
attention_mask = np.array(
[e.attention_mask for e in encoded],
dtype=np.int64,
)
All sequences leave the tokenizer as (batch, max_len) int64 arrays where max_len is the longest sequence in the batch rounded up to a multiple of 8 — input_ids for the model and attention_mask to distinguish real tokens from padding.
:::tip Dynamic padding reduces inference time 8×–50× compared to fixed 512-token padding, especially for short queries and entities. :::
Step 3 — ONNX Forward Pass
Input names are discovered dynamically from the ONNX graph (self._session.get_inputs()), making the code robust against minor model export variations:
feed: dict[str, np.ndarray] = {}
for name in self._input_names:
if name == "input_ids":
feed[name] = input_ids
elif name == "attention_mask":
feed[name] = attention_mask
elif name == "token_type_ids":
feed[name] = np.zeros_like(input_ids)
else:
feed[name] = np.zeros_like(input_ids)
outputs = self._session.run(None, feed)
token_embeddings = outputs[0] # (batch, seq_len, 384)
The session runs on CPUExecutionProvider with graph_optimization_level = ORT_ENABLE_ALL.
Step 4 — Mean Pooling
Average the embeddings of all real tokens (not [PAD]). The attention mask zeroes out padding contributions:
mask_expanded = attention_mask[:, :, np.newaxis].astype(np.float32)
sum_embeddings = np.sum(token_embeddings * mask_expanded, axis=1)
sum_mask = np.clip(mask_expanded.sum(axis=1), a_min=1e-9, a_max=None)
mean_embeddings = sum_embeddings / sum_mask
The mask expands to 3D (batch, 512, 1) for element-wise multiplication against token_embeddings. Tokens where attention_mask == 0 (padding) contribute nothing to the average.
Step 5 — L2 Normalization
Convert each vector to a unit vector (norm = 1). This allows dot product to serve as a direct proxy for cosine similarity — the distance_metric=cosine that sqlite-vec expects:
norms = np.linalg.norm(mean_embeddings, axis=1, keepdims=True)
norms = np.clip(norms, a_min=1e-9, a_max=None)
normalized = mean_embeddings / norms
return normalized.astype(np.float32) # (batch, 384)
:::tip
After L2 normalization, cosine similarity between two vectors equals their dot product: cos(A, B) = A · B when ||A|| = ||B|| = 1. This is why sqlite-vec can use an efficient dot-product kernel internally while reporting results as cosine distance.
:::
Entity Text Preparation
Before an entity reaches the encoding pipeline, it must be converted from structured data into a single text string. The system uses a Head+Tail+Diversity strategy to fit the most informative content within the model’s practical token budget.
Format
"{name} ({entity_type}) | {obs1} | {obs2} | ... | Rel: type → target; ..."
Example:
MCP Memory v2 (Project) | 8 tasks: T1 → T2 → T3 | Pipeline: Architect → Builder → Auditor | Rel: uses → FastMCP; uses → SQLite
Head+Tail+Diversity Strategy
The budget is 480 tokens (MAX_TOKENS = 480), with " | " as the separator between observations:
| Segment | Content | Rationale |
|---|---|---|
| Head | First observations | Most important/stable content — typically the entity’s core description |
| Tail | Last observations | Most recent content — latest updates, status changes |
| Diversity | Selected intermediate observations | Maximizes semantic variety — prevents the embedding from overfitting to a narrow topic |
Relations are appended at the end when they exist, formatted as Rel: type → target; ....
:::caution The text is generated from a selected snapshot of observations — not necessarily all of them. Each time an entity’s observations change, the embedding is completely regenerated from scratch. There is no incremental update; the entire entity text is rebuilt and re-encoded. :::
KNN Search with sqlite-vec
Vectors are stored in the entity_embeddings sqlite-vec virtual table:
CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddings
USING vec0(embedding float[384] distance_metric=cosine);
Storage Details
| Property | Value |
|---|---|
| Identifier | rowid (implicit, maps to entities.id) |
| Distance metric | cosine — angular distance, range [0, 2] |
| Vector size | 384 × 4 bytes = 1,536 bytes per embedding |
| Upsert strategy | INSERT OR REPLACE — no duplicate versions |
Serialization
def serialize_f32(vector: np.ndarray) -> bytes:
"""Pack a float32 vector into raw bytes for sqlite-vec.
A 384-dim vector → 1,536 bytes."""
return struct.pack(f"{len(vector)}f", *vector.flatten())
def deserialize_f32(data: bytes, dim: int = 384) -> np.ndarray:
"""Unpack raw bytes from sqlite-vec back into a float32 vector."""
return np.frombuffer(data, dtype=np.float32).reshape(dim)
Each 384-dim float32 vector serializes to exactly 1,536 bytes of raw data. Storing with INSERT OR REPLACE means updating an entity’s embedding overwrites the old value — no stale versions accumulate.
KNN Query
SELECT rowid, distance
FROM entity_embeddings
WHERE embedding MATCH ?
ORDER BY distance
LIMIT ?
The ? placeholder receives the serialized query vector (1,536 bytes). Results come back ordered by ascending distance — most similar first. The default limit is 10, configurable via the limit parameter.
Cosine Distance
The distance metric is cosine distance, defined as:
d(A, B) = 1 - cos(A, B) = 1 - (A · B) / (||A|| × ||B||)
Since vectors are L2-normalized (||A|| = ||B|| = 1), this simplifies to d = 1 - A · B.
| Distance | Meaning |
|---|---|
0.0 | Identical vectors |
< 0.3 | Very similar |
~ 1.0 | Unrelated |
2.0 | Opposite vectors |
After KNN retrieval, results pass through the Limbic System for dynamic re-ranking based on salience, temporal decay, and co-occurrence patterns.
Configuration
Download the Model
Run the download script to fetch and export the ONNX model:
uv run python scripts/download_model.py
This downloads four files to ~/.cache/mcp-memory-v2/models/:
~/.cache/mcp-memory-v2/models/
├── model.onnx # Exported ONNX model (~465 MB)
├── tokenizer.json # HuggingFace fast tokenizer
├── tokenizer_config.json # Tokenizer configuration
└── special_tokens_map.json # Special tokens mapping
:::tip The download is a one-time operation. After the files are in place, the engine loads them from disk on every startup. No network access is needed at runtime. :::
Lazy Load Behavior
The EmbeddingEngine uses a singleton pattern with two-level lazy loading:
class EmbeddingEngine:
_instance: "EmbeddingEngine | None" = None
@classmethod
def get_instance(cls) -> "EmbeddingEngine":
if cls._instance is None:
cls._instance = cls()
return cls._instance
- Import level:
mcp_memory.embeddingsis imported inside_get_engine(), not at module scope - Instance level:
get_instance()creates the singleton on first call
Practical consequence:
| Moment | Behavior |
|---|---|
| Server startup | Always fast — the model is not loaded |
First search_semantic call | ~3–5 seconds extra while loading ONNX + tokenizer into memory |
| Subsequent calls | Milliseconds — engine is already in memory |
Running Without the Model
The server is designed to degrade gracefully:
| Scenario | Behavior |
|---|---|
| Model downloaded | Semantic search and embedding generation work normally |
| Model not downloaded | search_semantic returns a clear error; all other tools work |
| sqlite-vec unavailable | Server continues without semantic search; CRUD tools function normally |
Troubleshooting
”Embedding model not available”
The model files haven’t been downloaded yet. Run:
uv run python scripts/download_model.py
Then restart the MCP server (or make another search_semantic call — the engine retries loading on demand).
First search is slow (3–5 seconds)
This is expected. The ONNX model (~465 MB) and tokenizer are loaded into memory on the first call. Subsequent calls return in milliseconds. There is no way to pre-warm the engine besides making an initial search.
sqlite-vec not available
If the sqlite-vec extension fails to load, the server starts without vector search capability. All 10 tools (create_entities, search_nodes, open_nodes, etc.) continue working. Only search_semantic is affected.
Embeddings seem wrong after updating observations
Embeddings are regenerated completely when observations change — not updated incrementally. If you’ve added many observations and the results feel off, the entity may have hit the 480-token budget and older observations were dropped from the snapshot. Consider whether critical identifying information is in the entity’s first few observations (the “Head” segment).
Related
- Hybrid Search (FTS5 + RRF) — combining KNN with BM25 full-text search via Reciprocal Rank Fusion
- Limbic System — dynamic re-ranking with salience, temporal decay, and co-occurrence
- Architecture — full system architecture and component overview
- Tools Reference —
search_semantictool parameters and return format