Semantic Search
How It Works
Section titled “How It Works”Semantic search converts each entity in the knowledge graph into a 384-dimensional numeric vector and stores it in a sqlite-vec virtual table. When you call search_semantic, the engine encodes your query into the same vector space, finds the k nearest neighbors (KNN), and returns the most relevant entities.
The full pipeline — from raw text to ranked results — looks like this:
graph TD Input["Entity text or Query"] --> Prefix["Prepend prefix<br/>'query: ' or 'passage: '"] Prefix --> Tok["Tokenization<br/>HuggingFace fast tokenizer<br/>trunc=512, pad=512"] Tok --> ONNX["ONNX Forward Pass<br/>CPUExecutionProvider<br/>→ (batch, seq_len, 384)"] ONNX --> Pool["Mean Pooling<br/>Mask [PAD] via attention_mask<br/>→ (batch, 384)"] Pool --> Norm["L2 Normalization<br/>||v|| = 1<br/>→ float32[384]"] Norm --> Store{Entity or Query?} Store -->|Entity| Vec["Serialize to bytes<br/>384 x 4 = 1,536 bytes<br/>INSERT OR REPLACE into vec0"] Store -->|Query| KNN["sqlite-vec KNN<br/>WHERE embedding MATCH ?<br/>ORDER BY distance"] KNN --> Results["[{entity_id, distance}]"] Results --> Limbic["Limbic Re-rank<br/>salience · temporal · cooc"] Limbic --> Output["Top-K results with scoring"]For the complete fusion pipeline that combines KNN with full-text search, see Hybrid Search (FTS5 + RRF).
Embedding Model
Section titled “Embedding Model”The semantic engine uses intfloat/multilingual-e5-small from IntFloat (E5 family, trained by Intel):
| Property | Value |
|---|---|
| Model ID | intfloat/multilingual-e5-small |
| Dimensions | 384 (float32, ONNX FP32) |
| Languages | 94+ — Spanish, English, French, German, Chinese, Japanese, etc. |
| Runtime | CPU only (CPUExecutionProvider, no GPU required) |
| Size on disk | ~465 MB (model ONNX + tokenizer) |
| Distance metric | Cosine distance: d = 1 - cos(A, B), range [0, 2] |
| Type | Asymmetric retrieval — requires "query: " and "passage: " prefixes |
| Cache path | ~/.cache/mcp-memory-v2/models/ |
Encoding Pipeline
Section titled “Encoding Pipeline”The EmbeddingEngine.encode() method transforms raw text into L2-normalized vectors in five steps:
def encode(self, texts: list[str], task: str = "passage") -> np.ndarray: """Encode texts to embeddings. task: "query" prepends "query: " prefix, "passage" prepends "passage: ". """ prefix = self.QUERY_PREFIX if task == "query" else self.PASSAGE_PREFIX prefixed = [f"{prefix}{t}" for t in texts] # Step 1-4: tokenization → ONNX → mean pooling → L2 normalizeStep 1 — Prepend Prefix
Section titled “Step 1 — Prepend Prefix”Each input text gets a task-specific prefix based on the E5 model’s training:
task="query": prepends"query: "— used when encoding search queriestask="passage": prepends"passage: "— used when encoding entity text (default)
Step 2 — Tokenization
Section titled “Step 2 — Tokenization”The HuggingFace fast tokenizer (Rust implementation) with two fixed configurations:
enable_truncation(max_length=512)— truncate sequences longer than 512 tokensenable_padding(length=512)— pad shorter sequences with[PAD]to exactly 512 tokens
encoded = self._tokenizer.encode_batch(texts)
input_ids = np.array( [e.ids for e in encoded], dtype=np.int64,)attention_mask = np.array( [e.attention_mask for e in encoded], dtype=np.int64,)All sequences leave the tokenizer as uniform (batch, 512) int64 arrays — input_ids for the model and attention_mask to distinguish real tokens from padding.
Step 3 — ONNX Forward Pass
Section titled “Step 3 — ONNX Forward Pass”Input names are discovered dynamically from the ONNX graph (self._session.get_inputs()), making the code robust against minor model export variations:
feed: dict[str, np.ndarray] = {}for name in self._input_names: if name == "input_ids": feed[name] = input_ids elif name == "attention_mask": feed[name] = attention_mask elif name == "token_type_ids": feed[name] = np.zeros_like(input_ids) else: feed[name] = np.zeros_like(input_ids)
outputs = self._session.run(None, feed)token_embeddings = outputs[0] # (batch, seq_len, 384)The session runs on CPUExecutionProvider with graph_optimization_level = ORT_ENABLE_ALL.
Step 4 — Mean Pooling
Section titled “Step 4 — Mean Pooling”Average the embeddings of all real tokens (not [PAD]). The attention mask zeroes out padding contributions:
mask_expanded = attention_mask[:, :, np.newaxis].astype(np.float32)sum_embeddings = np.sum(token_embeddings * mask_expanded, axis=1)sum_mask = np.clip(mask_expanded.sum(axis=1), a_min=1e-9, a_max=None)mean_embeddings = sum_embeddings / sum_maskThe mask expands to 3D (batch, 512, 1) for element-wise multiplication against token_embeddings. Tokens where attention_mask == 0 (padding) contribute nothing to the average.
Step 5 — L2 Normalization
Section titled “Step 5 — L2 Normalization”Convert each vector to a unit vector (norm = 1). This allows dot product to serve as a direct proxy for cosine similarity — the distance_metric=cosine that sqlite-vec expects:
norms = np.linalg.norm(mean_embeddings, axis=1, keepdims=True)norms = np.clip(norms, a_min=1e-9, a_max=None)normalized = mean_embeddings / norms
return normalized.astype(np.float32) # (batch, 384)Entity Text Preparation
Section titled “Entity Text Preparation”Before an entity reaches the encoding pipeline, it must be converted from structured data into a single text string. The system uses a Head+Tail+Diversity strategy to fit the most informative content within the model’s practical token budget.
Format
Section titled “Format”"{name} ({entity_type}) | {obs1} | {obs2} | ... | Rel: type → target; ..."Example:
MCP Memory v2 (Project) | 8 tasks: T1 → T2 → T3 | Pipeline: Architect → Builder → Auditor | Rel: uses → FastMCP; uses → SQLiteHead+Tail+Diversity Strategy
Section titled “Head+Tail+Diversity Strategy”The budget is 480 tokens (MAX_TOKENS = 480), with " | " as the separator between observations:
| Segment | Content | Rationale |
|---|---|---|
| Head | First observations | Most important/stable content — typically the entity’s core description |
| Tail | Last observations | Most recent content — latest updates, status changes |
| Diversity | Selected intermediate observations | Maximizes semantic variety — prevents the embedding from overfitting to a narrow topic |
Relations are appended at the end when they exist, formatted as Rel: type → target; ....
KNN Search with sqlite-vec
Section titled “KNN Search with sqlite-vec”Vectors are stored in the entity_embeddings sqlite-vec virtual table:
CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddingsUSING vec0(embedding float[384] distance_metric=cosine);Storage Details
Section titled “Storage Details”| Property | Value |
|---|---|
| Identifier | rowid (implicit, maps to entities.id) |
| Distance metric | cosine — angular distance, range [0, 2] |
| Vector size | 384 × 4 bytes = 1,536 bytes per embedding |
| Upsert strategy | INSERT OR REPLACE — no duplicate versions |
Serialization
Section titled “Serialization”def serialize_f32(vector: np.ndarray) -> bytes: """Pack a float32 vector into raw bytes for sqlite-vec. A 384-dim vector → 1,536 bytes.""" return struct.pack(f"{len(vector)}f", *vector.flatten())
def deserialize_f32(data: bytes, dim: int = 384) -> np.ndarray: """Unpack raw bytes from sqlite-vec back into a float32 vector.""" return np.frombuffer(data, dtype=np.float32).reshape(dim)Each 384-dim float32 vector serializes to exactly 1,536 bytes of raw data. Storing with INSERT OR REPLACE means updating an entity’s embedding overwrites the old value — no stale versions accumulate.
KNN Query
Section titled “KNN Query”SELECT rowid, distanceFROM entity_embeddingsWHERE embedding MATCH ?ORDER BY distanceLIMIT ?The ? placeholder receives the serialized query vector (1,536 bytes). Results come back ordered by ascending distance — most similar first. The default limit is 10, configurable via the limit parameter.
Cosine Distance
Section titled “Cosine Distance”The distance metric is cosine distance, defined as:
d(A, B) = 1 - cos(A, B) = 1 - (A · B) / (||A|| × ||B||)Since vectors are L2-normalized (||A|| = ||B|| = 1), this simplifies to d = 1 - A · B.
| Distance | Meaning |
|---|---|
0.0 | Identical vectors |
< 0.3 | Very similar |
~ 1.0 | Unrelated |
2.0 | Opposite vectors |
After KNN retrieval, results pass through the Limbic System for dynamic re-ranking based on salience, temporal decay, and co-occurrence patterns.
Configuration
Section titled “Configuration”Download the Model
Section titled “Download the Model”Run the download script to fetch and export the ONNX model:
uv run python scripts/download_model.pyThis downloads four files to ~/.cache/mcp-memory-v2/models/:
~/.cache/mcp-memory-v2/models/├── model.onnx # Exported ONNX model (~465 MB)├── tokenizer.json # HuggingFace fast tokenizer├── tokenizer_config.json # Tokenizer configuration└── special_tokens_map.json # Special tokens mappingLazy Load Behavior
Section titled “Lazy Load Behavior”The EmbeddingEngine uses a singleton pattern with two-level lazy loading:
class EmbeddingEngine: _instance: "EmbeddingEngine | None" = None
@classmethod def get_instance(cls) -> "EmbeddingEngine": if cls._instance is None: cls._instance = cls() return cls._instance- Import level:
mcp_memory.embeddingsis imported inside_get_engine(), not at module scope - Instance level:
get_instance()creates the singleton on first call
Practical consequence:
| Moment | Behavior |
|---|---|
| Server startup | Always fast — the model is not loaded |
First search_semantic call | ~3–5 seconds extra while loading ONNX + tokenizer into memory |
| Subsequent calls | Milliseconds — engine is already in memory |
Running Without the Model
Section titled “Running Without the Model”The server is designed to degrade gracefully:
| Scenario | Behavior |
|---|---|
| Model downloaded | Semantic search and embedding generation work normally |
| Model not downloaded | search_semantic returns a clear error; all other tools work |
| sqlite-vec unavailable | Server continues without semantic search; CRUD tools function normally |
Troubleshooting
Section titled “Troubleshooting””Embedding model not available”
Section titled “”Embedding model not available””The model files haven’t been downloaded yet. Run:
uv run python scripts/download_model.pyThen restart the MCP server (or make another search_semantic call — the engine retries loading on demand).
First search is slow (3–5 seconds)
Section titled “First search is slow (3–5 seconds)”This is expected. The ONNX model (~465 MB) and tokenizer are loaded into memory on the first call. Subsequent calls return in milliseconds. There is no way to pre-warm the engine besides making an initial search.
sqlite-vec not available
Section titled “sqlite-vec not available”If the sqlite-vec extension fails to load, the server starts without vector search capability. All 10 tools (create_entities, search_nodes, open_nodes, etc.) continue working. Only search_semantic is affected.
Embeddings seem wrong after updating observations
Section titled “Embeddings seem wrong after updating observations”Embeddings are regenerated completely when observations change — not updated incrementally. If you’ve added many observations and the results feel off, the entity may have hit the 480-token budget and older observations were dropped from the snapshot. Consider whether critical identifying information is in the entity’s first few observations (the “Head” segment).
Related
Section titled “Related”- Hybrid Search (FTS5 + RRF) — combining KNN with BM25 full-text search via Reciprocal Rank Fusion
- Limbic System — dynamic re-ranking with salience, temporal decay, and co-occurrence
- Architecture — full system architecture and component overview
- Tools Reference —
search_semantictool parameters and return format