Semantic Search — CachorroSpace

How It Works

Semantic search converts each entity in the knowledge graph into a 384-dimensional numeric vector and stores it in a sqlite-vec virtual table. When you call search_semantic, the engine encodes your query into the same vector space, finds the k nearest neighbors (KNN), and returns the most relevant entities.

The full pipeline — from raw text to ranked results — looks like this:

graph TD
    Input["Entity text or Query"] --> Prefix["No prefix needed<br/>(MiniLM unified encoding)"]
    Prefix --> Tok["Tokenization<br/>HuggingFace fast tokenizer<br/>trunc=512, pad=512"]
    Tok --> ONNX["ONNX Forward Pass<br/>CPUExecutionProvider<br/>→ (batch, seq_len, 384)"]
    ONNX --> Pool["Mean Pooling<br/>Mask [PAD] via attention_mask<br/>→ (batch, 384)"]
    Pool --> Norm["L2 Normalization<br/>||v|| = 1<br/>→ float32[384]"]
    Norm --> Store{Entity or Query?}
    Store -->|Entity| Vec["Serialize to bytes<br/>384 x 4 = 1,536 bytes<br/>INSERT OR REPLACE into vec0"]
    Store -->|Query| KNN["sqlite-vec KNN<br/>WHERE embedding MATCH ?<br/>ORDER BY distance"]
    KNN --> Results["[{entity_id, distance}]"]
    Results --> Limbic["Limbic Re-rank<br/>salience · temporal · cooc"]
    Limbic --> Output["Top-K results with scoring"]

For the complete fusion pipeline that combines KNN with full-text search, see Hybrid Search (FTS5 + RRF).

Embedding Model

The semantic engine uses sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 from HuggingFace (MiniLM family, optimized for sentence embeddings):

Property	Value
Model ID	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
Dimensions	384 (float32, ONNX FP32)
Languages	94+ — Spanish, English, French, German, Chinese, Japanese, etc.
Runtime	CPU only (`CPUExecutionProvider`, no GPU required)
Size on disk	~465 MB (model ONNX + tokenizer)
Distance metric	Cosine distance: `d = 1 - cos(A, B)`, range `[0, 2]`
Type	Unified encoding — no task-specific prefixes required
Cache path	`~/.cache/mcp-memory-v2/models/`

Encoding Pipeline

The EmbeddingEngine.encode() method transforms raw text into L2-normalized vectors in five steps:

def encode(self, texts: list[str], task: str = "passage") -> np.ndarray:
    """Encode texts to embeddings.
    MiniLM uses unified encoding — task parameter is optional/legacy.
    """
    prefix = self.QUERY_PREFIX if task == "query" else self.PASSAGE_PREFIX
    prefixed = [f"{prefix}{t}" for t in texts]
    # Step 1-4: tokenization → ONNX → mean pooling → L2 normalize

Step 1 — Prepend Prefix

MiniLM uses unified encoding — the same model processes both queries and passages without task-specific prefixes. The optional task parameter remains in the API for backward compatibility but has no semantic effect on the output.

Step 2 — Tokenization

The HuggingFace fast tokenizer (Rust implementation) with two fixed configurations:

enable_truncation(max_length=512) — truncate sequences longer than 512 tokens
enable_padding(pad_to_multiple_of=8) — dynamic padding to the nearest multiple of 8 tokens per batch

encoded = self._tokenizer.encode_batch(texts)

input_ids = np.array(
    [e.ids for e in encoded],
    dtype=np.int64,
)
attention_mask = np.array(
    [e.attention_mask for e in encoded],
    dtype=np.int64,
)

All sequences leave the tokenizer as (batch, max_len) int64 arrays where max_len is the longest sequence in the batch rounded up to a multiple of 8 — input_ids for the model and attention_mask to distinguish real tokens from padding.

:::tip Dynamic padding reduces inference time 8×–50× compared to fixed 512-token padding, especially for short queries and entities. :::

Step 3 — ONNX Forward Pass

Input names are discovered dynamically from the ONNX graph (self._session.get_inputs()), making the code robust against minor model export variations:

feed: dict[str, np.ndarray] = {}
for name in self._input_names:
    if name == "input_ids":
        feed[name] = input_ids
    elif name == "attention_mask":
        feed[name] = attention_mask
    elif name == "token_type_ids":
        feed[name] = np.zeros_like(input_ids)
    else:
        feed[name] = np.zeros_like(input_ids)

outputs = self._session.run(None, feed)
token_embeddings = outputs[0]  # (batch, seq_len, 384)

The session runs on CPUExecutionProvider with graph_optimization_level = ORT_ENABLE_ALL.

Step 4 — Mean Pooling

Average the embeddings of all real tokens (not [PAD]). The attention mask zeroes out padding contributions:

mask_expanded = attention_mask[:, :, np.newaxis].astype(np.float32)
sum_embeddings = np.sum(token_embeddings * mask_expanded, axis=1)
sum_mask = np.clip(mask_expanded.sum(axis=1), a_min=1e-9, a_max=None)
mean_embeddings = sum_embeddings / sum_mask

The mask expands to 3D (batch, 512, 1) for element-wise multiplication against token_embeddings. Tokens where attention_mask == 0 (padding) contribute nothing to the average.

Step 5 — L2 Normalization

Convert each vector to a unit vector (norm = 1). This allows dot product to serve as a direct proxy for cosine similarity — the distance_metric=cosine that sqlite-vec expects:

norms = np.linalg.norm(mean_embeddings, axis=1, keepdims=True)
norms = np.clip(norms, a_min=1e-9, a_max=None)
normalized = mean_embeddings / norms

return normalized.astype(np.float32)  # (batch, 384)

:::tip After L2 normalization, cosine similarity between two vectors equals their dot product: cos(A, B) = A · B when ||A|| = ||B|| = 1. This is why sqlite-vec can use an efficient dot-product kernel internally while reporting results as cosine distance. :::

Entity Text Preparation

Before an entity reaches the encoding pipeline, it must be converted from structured data into a single text string. The system uses a Head+Tail+Diversity strategy to fit the most informative content within the model’s practical token budget.

Format

"{name} ({entity_type}) | {obs1} | {obs2} | ... | Rel: type → target; ..."

Example:

MCP Memory v2 (Project) | 8 tasks: T1 → T2 → T3 | Pipeline: Architect → Builder → Auditor | Rel: uses → FastMCP; uses → SQLite

Head+Tail+Diversity Strategy

The budget is 480 tokens (MAX_TOKENS = 480), with " | " as the separator between observations:

Segment	Content	Rationale
Head	First observations	Most important/stable content — typically the entity’s core description
Tail	Last observations	Most recent content — latest updates, status changes
Diversity	Selected intermediate observations	Maximizes semantic variety — prevents the embedding from overfitting to a narrow topic

Relations are appended at the end when they exist, formatted as Rel: type → target; ....

:::caution The text is generated from a selected snapshot of observations — not necessarily all of them. Each time an entity’s observations change, the embedding is completely regenerated from scratch. There is no incremental update; the entire entity text is rebuilt and re-encoded. :::

KNN Search with sqlite-vec

Vectors are stored in the entity_embeddings sqlite-vec virtual table:

CREATE VIRTUAL TABLE IF NOT EXISTS entity_embeddings
USING vec0(embedding float[384] distance_metric=cosine);

Storage Details

Property	Value
Identifier	`rowid` (implicit, maps to `entities.id`)
Distance metric	`cosine` — angular distance, range `[0, 2]`
Vector size	384 × 4 bytes = 1,536 bytes per embedding
Upsert strategy	`INSERT OR REPLACE` — no duplicate versions

Serialization

def serialize_f32(vector: np.ndarray) -> bytes:
    """Pack a float32 vector into raw bytes for sqlite-vec.
    A 384-dim vector → 1,536 bytes."""
    return struct.pack(f"{len(vector)}f", *vector.flatten())


def deserialize_f32(data: bytes, dim: int = 384) -> np.ndarray:
    """Unpack raw bytes from sqlite-vec back into a float32 vector."""
    return np.frombuffer(data, dtype=np.float32).reshape(dim)

Each 384-dim float32 vector serializes to exactly 1,536 bytes of raw data. Storing with INSERT OR REPLACE means updating an entity’s embedding overwrites the old value — no stale versions accumulate.

KNN Query

SELECT rowid, distance
FROM entity_embeddings
WHERE embedding MATCH ?
ORDER BY distance
LIMIT ?

The ? placeholder receives the serialized query vector (1,536 bytes). Results come back ordered by ascending distance — most similar first. The default limit is 10, configurable via the limit parameter.

Cosine Distance

The distance metric is cosine distance, defined as:

d(A, B) = 1 - cos(A, B) = 1 - (A · B) / (||A|| × ||B||)

Since vectors are L2-normalized (||A|| = ||B|| = 1), this simplifies to d = 1 - A · B.

Distance	Meaning
`0.0`	Identical vectors
`< 0.3`	Very similar
`~ 1.0`	Unrelated
`2.0`	Opposite vectors

After KNN retrieval, results pass through the Limbic System for dynamic re-ranking based on salience, temporal decay, and co-occurrence patterns.

Configuration

Download the Model

Run the download script to fetch and export the ONNX model:

uv run python scripts/download_model.py

This downloads four files to ~/.cache/mcp-memory-v2/models/:

~/.cache/mcp-memory-v2/models/
├── model.onnx                # Exported ONNX model (~465 MB)
├── tokenizer.json            # HuggingFace fast tokenizer
├── tokenizer_config.json     # Tokenizer configuration
└── special_tokens_map.json   # Special tokens mapping

:::tip The download is a one-time operation. After the files are in place, the engine loads them from disk on every startup. No network access is needed at runtime. :::

Lazy Load Behavior

The EmbeddingEngine uses a singleton pattern with two-level lazy loading:

class EmbeddingEngine:
    _instance: "EmbeddingEngine | None" = None

    @classmethod
    def get_instance(cls) -> "EmbeddingEngine":
        if cls._instance is None:
            cls._instance = cls()
        return cls._instance

Import level: mcp_memory.embeddings is imported inside _get_engine(), not at module scope
Instance level: get_instance() creates the singleton on first call

Practical consequence:

Moment	Behavior
Server startup	Always fast — the model is not loaded
First `search_semantic` call	~3–5 seconds extra while loading ONNX + tokenizer into memory
Subsequent calls	Milliseconds — engine is already in memory

Running Without the Model

The server is designed to degrade gracefully:

Scenario	Behavior
Model downloaded	Semantic search and embedding generation work normally
Model not downloaded	`search_semantic` returns a clear error; all other tools work
sqlite-vec unavailable	Server continues without semantic search; CRUD tools function normally

Troubleshooting

”Embedding model not available”

The model files haven’t been downloaded yet. Run:

uv run python scripts/download_model.py

Then restart the MCP server (or make another search_semantic call — the engine retries loading on demand).

First search is slow (3–5 seconds)

This is expected. The ONNX model (~465 MB) and tokenizer are loaded into memory on the first call. Subsequent calls return in milliseconds. There is no way to pre-warm the engine besides making an initial search.

sqlite-vec not available

If the sqlite-vec extension fails to load, the server starts without vector search capability. All 10 tools (create_entities, search_nodes, open_nodes, etc.) continue working. Only search_semantic is affected.

Embeddings seem wrong after updating observations

Embeddings are regenerated completely when observations change — not updated incrementally. If you’ve added many observations and the results feel off, the entity may have hit the 480-token budget and older observations were dropped from the snapshot. Consider whether critical identifying information is in the entity’s first few observations (the “Head” segment).

Hybrid Search (FTS5 + RRF) — combining KNN with BM25 full-text search via Reciprocal Rank Fusion
Limbic System — dynamic re-ranking with salience, temporal decay, and co-occurrence
Architecture — full system architecture and component overview
Tools Reference — search_semantic tool parameters and return format