Maintenance & Operations — CachorroSpace

Overview

The knowledge graph is a living resource. Over time, observations accumulate, entities grow, and patterns shift. Without periodic maintenance, the graph degrades — duplicate observations inflate search results, oversized entities lose topical focus, and stale entries compete with fresh data for ranking relevance.

sofia acts as the gatekeeper for all maintenance operations. No tool in the maintenance toolkit modifies the graph without explicit approval. The philosophy is straightforward: tools flag issues, humans decide what to do.

The maintenance tools are read-only by design — they report, they never modify without explicit approval. This section covers four operational concerns and wraps up with best practices for keeping the graph healthy.

Semantic Deduplication

When new observations are added to an entity, the system checks them against existing observations for that entity. If a new observation is semantically similar to one that already exists, it is not discarded — instead, it is flagged for later review.

How flagging works

The observations table includes a similarity_flag column (default 0). When add_observations() runs, it calculates cosine similarity between the incoming observation and every existing observation for that entity:

Condition	Threshold	Action
Cosine similarity >= 0.85	`0.85`	Set `similarity_flag=1` on the new observation
Containment score >= 0.70 (asymmetric length)	`0.70`	Set `similarity_flag=1` on the new observation
No match	—	`similarity_flag` stays `0`

The combined similarity check handles asymmetric text length. When one text is at least 2x longer than the other (length ratio >= 2.0), cosine similarity alone is unreliable — a short observation will always look similar to a long one that contains it. In that case, the system computes a containment score (the fraction of the shorter text’s tokens present in the longer one). If containment >= 0.7, the observation is flagged even if cosine similarity is below 0.85.

:::note Flagged observations are not deleted and are not hidden from search. They remain in the graph and participate in embeddings and ranking. Flagging is a signal for the reviewer, not an automatic filter. :::

Reviewing flagged observations

Use find_duplicate_observations to surface flagged pairs within an entity:

find_duplicate_observations(entity_name: str, threshold: float = 0.85, containment_threshold: float = 0.7)

The tool returns pairs of observations with their similarity and containment scores, so you can decide which to keep and which to remove.

Deduplication workflow

1. Detect    — add_observations() automatically flags similar observations
2. Review    — find_duplicate_observations() surfaces flagged pairs
3. Consolidate — delete_observations() removes the redundant ones

Step 3 is manual. After deciding which observation to keep, use delete_observations() from the Tools Reference to remove the duplicate. The remaining observation’s embedding is regenerated automatically.

Entity Splitting

Entities can grow beyond their useful scope. A session entity that started with task notes may accumulate architecture decisions, deployment records, and debugging logs. When that happens, the entity loses topical coherence — search results become noisy, and the embedding represents a blurred average of unrelated topics.

Entity splitting decomposes a large entity into focused sub-entities, each with a clear topic.

Thresholds

Different entity types have different thresholds, reflecting how quickly they tend to accumulate observations:

Entity Type	Threshold	Rationale
`Sesion`	15	Sessions are single-day events; they accumulate observations fast
`Proyecto`	25	Projects span longer periods and naturally have more observations
All others	20	Default for custom entity types

Splitting workflow

The full pipeline has five stages:

find_split_candidates → analyze_entity_split → propose_entity_split_tool → sofia review → execute_entity_split_tool

Stage	Tool	Output	Modifies graph?
1. Scan	`find_split_candidates()`	List of all entities exceeding thresholds	No
2. Analyze	`analyze_entity_split(entity_name)`	Observation count, threshold, topics, split score	No
3. Propose	`propose_entity_split_tool(entity_name)`	Suggested sub-entities, observation assignments, relations	No
4. Review	sofia (human)	Approved or modified split plan	No
5. Execute	`execute_entity_split_tool(entity_name, approved_splits)`	New entities created, observations moved, relations established	Yes

Topic extraction

Stage 3 uses Agglomerative Clustering on embeddings with c-TF-IDF fallback for naming. The algorithm:

Tokenizes all observations for the entity
Clusters observations via Agglomerative Clustering on embeddings
Falls back to c-TF-IDF for topic naming if clustering is ambiguous
Assigns a topic label based on the dominant terms from the cluster

Split mechanics

When a split is executed:

New sub-entities are created with names like "Parent Entity - Topic Label"
Specified observations are moved from the parent to each child entity
Two relations are created per sub-entity:
- contiene — parent points to child
- parte_de — child points back to parent
The parent entity retains all observations not assigned to any sub-entity
The entire operation runs inside a BEGIN IMMEDIATE/COMMIT/ROLLBACK atomic transaction with auto_commit=False in CRUD methods — if any step fails, nothing is committed
Embeddings are regenerated for all new entities after the transaction completes

:::tip You don’t need to split all observations out of the parent. Leaving general or cross-cutting observations on the parent entity keeps it useful as a summary node. Only move observations that clearly belong to a single topic. :::

Consolidation Report

The consolidation_report tool generates a read-only health check for the entire knowledge graph. Run it periodically to catch issues before they compound.

consolidation_report(stale_days: float = 90.0)

What the report covers

The report has four sections, each identifying a different class of maintenance issue:

Section	What it finds	Criteria
Split candidates	Entities that should be split	Exceed type-specific observation threshold AND have sufficient topic diversity (`split_score > 1.0`)
Flagged observations	Potential duplicates	Observations with `similarity_flag=1` across the entire graph
Stale entities	Entities that haven’t been accessed recently	No access in N days (default 90) AND low total access count
Large entities	Entities approaching or exceeding thresholds	Observation count relative to type-specific threshold, regardless of topic diversity

Using the report

The report returns summary counts and detailed entity lists for each category. A typical workflow:

Run consolidation_report() — review the summary counts
For split candidates: run the full entity splitting workflow on the highest-priority candidates
For flagged observations: run find_duplicate_observations() on the affected entities and consolidate duplicates
For stale entities: evaluate whether they are still relevant — archive or delete if not
For large entities: monitor — they may become split candidates once they cross the diversity threshold

:::caution The report makes no modifications to the graph. It is purely diagnostic. All actions based on the report require explicit tool calls (splits, deletions, or observation removals) approved by sofia. :::

Recency Decay

Every time an entity is accessed — via search_semantic or open_nodes — the event is recorded in the entity_access_log table. This log feeds into the recency decay signal used by the Limbic System to rank search results.

How it works

The compute_importance() function in scoring.py combines three signals into a single score:

importance = access_norm × (1 + BETA_DEG × degree_norm) × (1 + ALPHA_CONS × consolidation)

Where:

Component	Computation	Description
`access_norm`	`log₂(1 + access_count) / log₂(1 + max_access)`	Normalized access frequency — how often this entity is accessed relative to the most-accessed entity
`degree_norm`	`min(degree, D_MAX) / D_MAX`	Normalized graph degree — how connected the entity is, capped at `D_MAX`
`consolidation`	`log₂(1 + access_days) / log₂(1 + max_access_days)`	Multi-day access pattern — rewards entities accessed on many different days
`ALPHA_CONS`	`0.2`	Weight of the consolidation factor
`BETA_DEG`	`0.15`	Weight of the graph degree factor

The formula has three multiplicative factors:

Access frequency (access_norm) — entities accessed more often score higher, with logarithmic scaling so the gap between 1 and 10 accesses matters more than between 100 and 110.
Graph connectivity (1 + BETA_DEG × degree_norm) — well-connected entities get a moderate boost. The +1 ensures the factor never drops below 1.
Consolidation (1 + ALPHA_CONS × consolidation) — entities accessed on multiple different days get a boost, distinguishing habitual references from one-off spikes.

Even old entities retain a minimum score — the Limbic System’s temporal floor (TEMPORAL_FLOOR = 0.1) ensures nothing is fully forgotten.

Operational use

The recency decay signal serves two maintenance purposes:

Search ranking — feeds directly into search_semantic via the Limbic System, ensuring fresh and frequently-used entities surface first
Archival candidates — entities with very low importance scores (high decay, low access count) are candidates for review in the consolidation report

Best Practices

Compact before persisting

Only store decisions, findings, and state changes in the knowledge graph. Each observation should carry information that would be difficult to reconstruct later.

Store:

Architectural decisions and their rationale
Configuration choices and why they were made
Bug root causes and resolutions
State transitions (e.g., “migrated from X to Y”)

Don’t store:

Test logs or test output
File listings or directory trees
Implementation details already in source code
Debug output or verbose error traces
Intermediate calculations that are transient

:::tip A good heuristic: if the information is already committed to a git repo or a config file, don’t duplicate it in the knowledge graph. Only persist the decision, not the details. :::

Review flagged observations regularly

Run find_duplicate_observations() on your most-active entities every few sessions. Duplicate observations inflate entity size, dilute embeddings, and can confuse search ranking. Catching them early keeps consolidation effort low.

Run consolidation reports monthly

A monthly consolidation_report() pass catches stale entities before they accumulate, surfaces entities approaching split thresholds, and identifies duplicate clusters you may have missed during regular use.

Split before it hurts

Don’t wait for an entity to become unmanageable. When an entity approaches 80% of its type threshold and has clearly distinct topic clusters, run the splitting workflow. Proactive splitting keeps the graph clean, embeddings focused, and search results relevant.

Tools Reference — full API specification for all MCP tools
Limbic System — how recency decay and access patterns affect search ranking
Architecture — system overview including the scoring module and entity splitter