Maintenance & Operations
Overview
Section titled “Overview”The knowledge graph is a living resource. Over time, observations accumulate, entities grow, and patterns shift. Without periodic maintenance, the graph degrades — duplicate observations inflate search results, oversized entities lose topical focus, and stale entries compete with fresh data for ranking relevance.
sofia acts as the gatekeeper for all maintenance operations. No tool in the maintenance toolkit modifies the graph without explicit approval. The philosophy is straightforward: tools flag issues, humans decide what to do.
The maintenance tools are read-only by design — they report, they never modify without explicit approval. This section covers four operational concerns and wraps up with best practices for keeping the graph healthy.
Semantic Deduplication
Section titled “Semantic Deduplication”When new observations are added to an entity, the system checks them against existing observations for that entity. If a new observation is semantically similar to one that already exists, it is not discarded — instead, it is flagged for later review.
How flagging works
Section titled “How flagging works”The observations table includes a similarity_flag column (default 0). When add_observations() runs, it calculates cosine similarity between the incoming observation and every existing observation for that entity:
| Condition | Threshold | Action |
|---|---|---|
| Cosine similarity >= 0.85 | 0.85 | Set similarity_flag=1 on the new observation |
| Containment score >= 0.70 (asymmetric length) | 0.70 | Set similarity_flag=1 on the new observation |
| No match | — | similarity_flag stays 0 |
The combined similarity check handles asymmetric text length. When one text is at least 2x longer than the other (length ratio >= 2.0), cosine similarity alone is unreliable — a short observation will always look similar to a long one that contains it. In that case, the system computes a containment score (the fraction of the shorter text’s tokens present in the longer one). If containment >= 0.7, the observation is flagged even if cosine similarity is below 0.85.
Reviewing flagged observations
Section titled “Reviewing flagged observations”Use find_duplicate_observations to surface flagged pairs within an entity:
find_duplicate_observations(entity_name: str, threshold: float = 0.85, containment_threshold: float = 0.7)The tool returns pairs of observations with their similarity and containment scores, so you can decide which to keep and which to remove.
Deduplication workflow
Section titled “Deduplication workflow”1. Detect — add_observations() automatically flags similar observations2. Review — find_duplicate_observations() surfaces flagged pairs3. Consolidate — delete_observations() removes the redundant onesStep 3 is manual. After deciding which observation to keep, use delete_observations() from the Tools Reference to remove the duplicate. The remaining observation’s embedding is regenerated automatically.
Entity Splitting
Section titled “Entity Splitting”Entities can grow beyond their useful scope. A session entity that started with task notes may accumulate architecture decisions, deployment records, and debugging logs. When that happens, the entity loses topical coherence — search results become noisy, and the embedding represents a blurred average of unrelated topics.
Entity splitting decomposes a large entity into focused sub-entities, each with a clear topic.
Thresholds
Section titled “Thresholds”Different entity types have different thresholds, reflecting how quickly they tend to accumulate observations:
| Entity Type | Threshold | Rationale |
|---|---|---|
Sesion | 15 | Sessions are single-day events; they accumulate observations fast |
Proyecto | 25 | Projects span longer periods and naturally have more observations |
| All others | 20 | Default for custom entity types |
Splitting workflow
Section titled “Splitting workflow”The full pipeline has five stages:
find_split_candidates → analyze_entity_split → propose_entity_split_tool → sofia review → execute_entity_split_tool| Stage | Tool | Output | Modifies graph? |
|---|---|---|---|
| 1. Scan | find_split_candidates() | List of all entities exceeding thresholds | No |
| 2. Analyze | analyze_entity_split(entity_name) | Observation count, threshold, topics, split score | No |
| 3. Propose | propose_entity_split_tool(entity_name) | Suggested sub-entities, observation assignments, relations | No |
| 4. Review | sofia (human) | Approved or modified split plan | No |
| 5. Execute | execute_entity_split_tool(entity_name, approved_splits) | New entities created, observations moved, relations established | Yes |
Topic extraction
Section titled “Topic extraction”Stage 3 uses TF-IDF to group observations into coherent clusters. The algorithm:
- Tokenizes all observations for the entity
- Computes TF-IDF weights (with Spanish stop words, minimum word length of 4 characters)
- Groups observations by their highest-weighted terms
- Assigns a topic label based on the dominant terms in each group
Split mechanics
Section titled “Split mechanics”When a split is executed:
- New sub-entities are created with names like
"Parent Entity - Topic Label" - Specified observations are moved from the parent to each child entity
- Two relations are created per sub-entity:
contiene— parent points to childparte_de— child points back to parent
- The parent entity retains all observations not assigned to any sub-entity
- The entire operation runs inside a single SQLite transaction — if any step fails, nothing is committed
- Embeddings are regenerated for all new entities after the transaction completes
Consolidation Report
Section titled “Consolidation Report”The consolidation_report tool generates a read-only health check for the entire knowledge graph. Run it periodically to catch issues before they compound.
consolidation_report(stale_days: float = 90.0)What the report covers
Section titled “What the report covers”The report has four sections, each identifying a different class of maintenance issue:
| Section | What it finds | Criteria |
|---|---|---|
| Split candidates | Entities that should be split | Exceed type-specific observation threshold AND have sufficient topic diversity (split_score > 1.0) |
| Flagged observations | Potential duplicates | Observations with similarity_flag=1 across the entire graph |
| Stale entities | Entities that haven’t been accessed recently | No access in N days (default 90) AND low total access count |
| Large entities | Entities approaching or exceeding thresholds | Observation count relative to type-specific threshold, regardless of topic diversity |
Using the report
Section titled “Using the report”The report returns summary counts and detailed entity lists for each category. A typical workflow:
- Run
consolidation_report()— review the summary counts - For split candidates: run the full entity splitting workflow on the highest-priority candidates
- For flagged observations: run
find_duplicate_observations()on the affected entities and consolidate duplicates - For stale entities: evaluate whether they are still relevant — archive or delete if not
- For large entities: monitor — they may become split candidates once they cross the diversity threshold
Recency Decay
Section titled “Recency Decay”Every time an entity is accessed — via search_semantic or open_nodes — the event is recorded in the entity_access_log table. This log feeds into the recency decay signal used by the Limbic System to rank search results.
How it works
Section titled “How it works”The compute_importance() function in scoring.py uses the ALPHA_CONS constant (0.2) as the consolidation signal for multi-day decay:
importance = ALPHA_CONS × log(1 + total_accesses) × (ALPHA_DECAY ^ days_since_access)| Factor | Effect |
|---|---|
log(1 + total_accesses) | Logarithmic scaling — the difference between 1 and 10 accesses matters more than between 100 and 110 |
ALPHA_DECAY ^ days_since_access | Exponential decay — entities untouched for longer periods fade progressively |
ALPHA_CONS | Controls the overall weight of the consolidation signal relative to other scoring factors |
Entities accessed frequently rank higher in search results. Entities that haven’t been touched in weeks or months gradually fade, but are never fully forgotten — the Limbic System’s temporal floor (TEMPORAL_FLOOR = 0.1) ensures even old entities retain a minimum score.
Operational use
Section titled “Operational use”The recency decay signal serves two maintenance purposes:
- Search ranking — feeds directly into
search_semanticvia the Limbic System, ensuring fresh and frequently-used entities surface first - Archival candidates — entities with very low importance scores (high decay, low access count) are candidates for review in the consolidation report
Best Practices
Section titled “Best Practices”Compact before persisting
Section titled “Compact before persisting”Only store decisions, findings, and state changes in the knowledge graph. Each observation should carry information that would be difficult to reconstruct later.
Store:
- Architectural decisions and their rationale
- Configuration choices and why they were made
- Bug root causes and resolutions
- State transitions (e.g., “migrated from X to Y”)
Don’t store:
- Test logs or test output
- File listings or directory trees
- Implementation details already in source code
- Debug output or verbose error traces
- Intermediate calculations that are transient
Review flagged observations regularly
Section titled “Review flagged observations regularly”Run find_duplicate_observations() on your most-active entities every few sessions. Duplicate observations inflate entity size, dilute embeddings, and can confuse search ranking. Catching them early keeps consolidation effort low.
Run consolidation reports monthly
Section titled “Run consolidation reports monthly”A monthly consolidation_report() pass catches stale entities before they accumulate, surfaces entities approaching split thresholds, and identifies duplicate clusters you may have missed during regular use.
Split before it hurts
Section titled “Split before it hurts”Don’t wait for an entity to become unmanageable. When an entity approaches 80% of its type threshold and has clearly distinct topic clusters, run the splitting workflow. Proactive splitting keeps the graph clean, embeddings focused, and search results relevant.
Related
Section titled “Related”- Tools Reference — full API specification for all MCP tools
- Limbic System — how recency decay and access patterns affect search ranking
- Architecture — system overview including the scoring module and entity splitter