Skip to content

Semantic Search

Semantic search converts each entity in the knowledge graph into a 384-dimensional numeric vector and stores it in a sqlite-vec virtual table. When you call search_semantic, the engine encodes your query into the same vector space, finds the k nearest neighbors (KNN), and returns the most relevant entities.

The full pipeline — from raw text to ranked results — looks like this:

graph TD
Input["Entity text or Query"] --> Prefix["Prepend prefix<br/>'query: ' or 'passage: '"]
Prefix --> Tok["Tokenization<br/>HuggingFace fast tokenizer<br/>trunc=512, pad=512"]
Tok --> ONNX["ONNX Forward Pass<br/>CPUExecutionProvider<br/>→ (batch, seq_len, 384)"]
ONNX --> Pool["Mean Pooling<br/>Mask [PAD] via attention_mask<br/>→ (batch, 384)"]
Pool --> Norm["L2 Normalization<br/>||v|| = 1<br/>→ float32[384]"]
Norm --> Store{Entity or Query?}
Store -->|Entity| Vec["Serialize to bytes<br/>384 x 4 = 1,536 bytes<br/>INSERT OR REPLACE into vec0"]
Store -->|Query| KNN["sqlite-vec KNN<br/>WHERE embedding MATCH ?<br/>ORDER BY distance"]
KNN --> Results["[{entity_id, distance}]"]
Results --> Limbic["Limbic Re-rank<br/>salience · temporal · cooc"]
Limbic --> Output["Top-K results with scoring"]

For the complete fusion pipeline that combines KNN with full-text search, see Hybrid Search (FTS5 + RRF).

The semantic engine uses intfloat/multilingual-e5-small from IntFloat (E5 family, trained by Intel):

PropertyValue
Model IDintfloat/multilingual-e5-small
Dimensions384 (float32, ONNX FP32)
Languages94+ — Spanish, English, French, German, Chinese, Japanese, etc.
RuntimeCPU only (CPUExecutionProvider, no GPU required)
Size on disk~465 MB (model ONNX + tokenizer)
Distance metricCosine distance: d = 1 - cos(A, B), range [0, 2]
TypeAsymmetric retrieval — requires "query: " and "passage: " prefixes
Cache path~/.cache/mcp-memory-v2/models/

The EmbeddingEngine.encode() method transforms raw text into L2-normalized vectors in five steps:

def encode(self, texts: list[str], task: str = "passage") -> np.ndarray:
"""Encode texts to embeddings.
task: "query" prepends "query: " prefix, "passage" prepends "passage: ".
"""
prefix = self.QUERY_PREFIX if task == "query" else self.PASSAGE_PREFIX
prefixed = [f"{prefix}{t}" for t in texts]
# Step 1-4: tokenization → ONNX → mean pooling → L2 normalize

Each input text gets a task-specific prefix based on the E5 model’s training:

  • task="query": prepends "query: " — used when encoding search queries
  • task="passage": prepends "passage: " — used when encoding entity text (default)

The HuggingFace fast tokenizer (Rust implementation) with two fixed configurations:

  • enable_truncation(max_length=512) — truncate sequences longer than 512 tokens
  • enable_padding(length=512) — pad shorter sequences with [PAD] to exactly 512 tokens
encoded = self._tokenizer.encode_batch(texts)
input_ids = np.array(
[e.ids for e in encoded],
dtype=np.int64,
)
attention_mask = np.array(
[e.attention_mask for e in encoded],
dtype=np.int64,
)

All sequences leave the tokenizer as uniform (batch, 512) int64 arrays — input_ids for the model and attention_mask to distinguish real tokens from padding.

Input names are discovered dynamically from the ONNX graph (self._session.get_inputs()), making the code robust against minor model export variations:

feed: dict[str, np.ndarray] = {}
for name in self._input_names:
if name == "input_ids":
feed[name] = input_ids
elif name == "attention_mask":
feed[name] = attention_mask
elif name == "token_type_ids":
feed[name] = np.zeros_like(input_ids)
else:
feed[name] = np.zeros_like(input_ids)
outputs = self._session.run(None, feed)
token_embeddings = outputs[0] # (batch, seq_len, 384)

The session runs on CPUExecutionProvider with graph_optimization_level = ORT_ENABLE_ALL.

Average the embeddings of all real tokens (not [PAD]). The attention mask zeroes out padding contributions:

mask_expanded = attention_mask[:, :, np.newaxis].astype(np.float32)
sum_embeddings = np.sum(token_embeddings * mask_expanded, axis=1)
sum_mask = np.clip(mask_expanded.sum(axis=1), a_min=1e-9, a_max=None)
mean_embeddings = sum_embeddings / sum_mask

The mask expands to 3D (batch, 512, 1) for element-wise multiplication against token_embeddings. Tokens where attention_mask == 0 (padding) contribute nothing to the average.

Convert each vector to a unit vector (norm = 1). This allows dot product to serve as a direct proxy for cosine similarity — the distance_metric=cosine that sqlite-vec expects:

norms = np.linalg.norm(mean_embeddings, axis=1, keepdims=True)
norms = np.clip(norms, a_min=1e-9, a_max=None)
normalized = mean_embeddings / norms
return normalized.astype(np.float32) # (batch, 384)

Before an entity reaches the encoding pipeline, it must be converted from structured data into a single text string. The system uses a Head+Tail+Diversity strategy to fit the most informative content within the model’s practical token budget.

"{name} ({entity_type}) | {obs1} | {obs2} | ... | Rel: type → target; ..."

Example:

MCP Memory v2 (Project) | 8 tasks: T1 → T2 → T3 | Pipeline: Architect → Builder → Auditor | Rel: uses → FastMCP; uses → SQLite

The budget is 480 tokens (MAX_TOKENS = 480), with " | " as the separator between observations:

SegmentContentRationale
HeadFirst observationsMost important/stable content — typically the entity’s core description
TailLast observationsMost recent content — latest updates, status changes
DiversitySelected intermediate observationsMaximizes semantic variety — prevents the embedding from overfitting to a narrow topic

Relations are appended at the end when they exist, formatted as Rel: type → target; ....

Vectors are stored in the entity_embeddings sqlite-vec virtual table:

CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddings
USING vec0(embedding float[384] distance_metric=cosine);
PropertyValue
Identifierrowid (implicit, maps to entities.id)
Distance metriccosine — angular distance, range [0, 2]
Vector size384 × 4 bytes = 1,536 bytes per embedding
Upsert strategyINSERT OR REPLACE — no duplicate versions
def serialize_f32(vector: np.ndarray) -> bytes:
"""Pack a float32 vector into raw bytes for sqlite-vec.
A 384-dim vector → 1,536 bytes."""
return struct.pack(f"{len(vector)}f", *vector.flatten())
def deserialize_f32(data: bytes, dim: int = 384) -> np.ndarray:
"""Unpack raw bytes from sqlite-vec back into a float32 vector."""
return np.frombuffer(data, dtype=np.float32).reshape(dim)

Each 384-dim float32 vector serializes to exactly 1,536 bytes of raw data. Storing with INSERT OR REPLACE means updating an entity’s embedding overwrites the old value — no stale versions accumulate.

SELECT rowid, distance
FROM entity_embeddings
WHERE embedding MATCH ?
ORDER BY distance
LIMIT ?

The ? placeholder receives the serialized query vector (1,536 bytes). Results come back ordered by ascending distance — most similar first. The default limit is 10, configurable via the limit parameter.

The distance metric is cosine distance, defined as:

d(A, B) = 1 - cos(A, B) = 1 - (A · B) / (||A|| × ||B||)

Since vectors are L2-normalized (||A|| = ||B|| = 1), this simplifies to d = 1 - A · B.

DistanceMeaning
0.0Identical vectors
< 0.3Very similar
~ 1.0Unrelated
2.0Opposite vectors

After KNN retrieval, results pass through the Limbic System for dynamic re-ranking based on salience, temporal decay, and co-occurrence patterns.

Run the download script to fetch and export the ONNX model:

Terminal window
uv run python scripts/download_model.py

This downloads four files to ~/.cache/mcp-memory-v2/models/:

~/.cache/mcp-memory-v2/models/
├── model.onnx # Exported ONNX model (~465 MB)
├── tokenizer.json # HuggingFace fast tokenizer
├── tokenizer_config.json # Tokenizer configuration
└── special_tokens_map.json # Special tokens mapping

The EmbeddingEngine uses a singleton pattern with two-level lazy loading:

class EmbeddingEngine:
_instance: "EmbeddingEngine | None" = None
@classmethod
def get_instance(cls) -> "EmbeddingEngine":
if cls._instance is None:
cls._instance = cls()
return cls._instance
  1. Import level: mcp_memory.embeddings is imported inside _get_engine(), not at module scope
  2. Instance level: get_instance() creates the singleton on first call

Practical consequence:

MomentBehavior
Server startupAlways fast — the model is not loaded
First search_semantic call~3–5 seconds extra while loading ONNX + tokenizer into memory
Subsequent callsMilliseconds — engine is already in memory

The server is designed to degrade gracefully:

ScenarioBehavior
Model downloadedSemantic search and embedding generation work normally
Model not downloadedsearch_semantic returns a clear error; all other tools work
sqlite-vec unavailableServer continues without semantic search; CRUD tools function normally

The model files haven’t been downloaded yet. Run:

Terminal window
uv run python scripts/download_model.py

Then restart the MCP server (or make another search_semantic call — the engine retries loading on demand).

This is expected. The ONNX model (~465 MB) and tokenizer are loaded into memory on the first call. Subsequent calls return in milliseconds. There is no way to pre-warm the engine besides making an initial search.

If the sqlite-vec extension fails to load, the server starts without vector search capability. All 10 tools (create_entities, search_nodes, open_nodes, etc.) continue working. Only search_semantic is affected.

Embeddings seem wrong after updating observations

Section titled “Embeddings seem wrong after updating observations”

Embeddings are regenerated completely when observations change — not updated incrementally. If you’ve added many observations and the results feel off, the entity may have hit the 480-token budget and older observations were dropped from the snapshot. Consider whether critical identifying information is in the entity’s first few observations (the “Head” segment).