Concepts
Cross-doc entity merge.
The pitch “one unified graph across sources” only delivers if entities actually merge across sources. This page explains how mantle decides that two mentions in two documents refer to the same real-world thing — the four signals, the safety floors, and how operators review the rest.
The problem
The same person, three names, three records.
Pull the same person across Drive, Gmail, Slack, and a CRM and you get a different surface form every time — Watson, Dr. John Watson, jwatson@…, JWatson. Naive extraction creates four entities. Your agent now thinks one person is four, splits the relationship history between them, and confidently tells you Dr. Watson has never spoken to anyone at Baker Street — because the four-record version of him hasn't.
Cross-doc merge fixes this. After extraction, mantle scores every plausible entity pair on four signals, applies a set of safety floors so we don't over-merge, auto-merges the high-confidence matches, and surfaces the rest to a human operator.
Before — 4 entities
After — 1 entity, 4 aliases
ent_watson_john
John H. Watson
4 sources merged · same_as edges created
How it works
Four signals. One score.
Every candidate pair is scored on four independent signals. No single signal can force a merge on its own — the scorer combines them and the safety floors below decide whether the result clears the bar.
S1
Name similarity
Normalized edit distance between surface forms, with alias expansion (honorifics, initials, common nicknames). Tells us 'J. Smith' and 'John Smith' look alike.
S2
Semantic embedding
Cosine similarity between context-window embeddings of the two mentions, stored in a pgvector HNSW index. Catches cases where surface forms diverge but the surrounding language is clearly about the same thing.
S3
Shared-neighbor overlap
Overlap coefficient over the two entities' graph neighborhoods. Two 'Watsons' who both link to 'Sherlock Holmes' and '221B Baker Street' are almost certainly one Watson.
S4
Type compatibility
Entity-type and property-shape agreement. A 'Watson' that's typed as person can't merge with one typed as company, regardless of how high the other signals score.
Safety
The merge is conservative by design.
Over-merging is worse than under-merging — once two entities collapse, the only fix is a manual split. The scorer carries hard floors that prevent merges no matter how high the composite score reaches.
MIN_NAME_SIM_FOR_MERGE = 0.40— name similarity floor. Two entities whose surface forms aren't at least 40% similar can never auto-merge, regardless of how high the embedding or neighbor signals score.MIN_SHARED_NEIGHBORS_FOR_MERGE = 3— structural-signal floor. The default branch requires at least three shared graph neighbors to auto-merge.- strong_embedding branch — when S2 (semantic embedding) alone is high enough to consider a merge without structural support, we still require
shared_neighbor_count ≥ 2. Phase 1.5e introduced this gate to cut false positives like “Echo Brickell” ↔ “830 Brickell” being merged as a single building class. - Type compatibility — S4 acts as a hard filter, not a weight. If types disagree, the candidate is never even considered.
Output
What the merge produces.
High-confidence matches are auto-merged: a same_as edge is created, the four (or however many) source records are linked under one canonical entity_id, and downstream queries follow that edge transparently. search_entities, get_entity_context, and traverse_graph all return the unified view without the caller having to know a merge happened.
Borderline matches — high-enough score to suspect a match but below the auto-merge threshold — are recorded as same_as candidates with status pending_review. They are not yet acted on; instead, they wait for an operator to confirm or reject them.
Operator workflow
Reviewing pending candidates.
Two MCP tools drive the operator-review flow. Both are tenant-admin-scoped — surface them in operator UIs, not in the default agent toolkit.
# 1. List the pending candidates an operator needs to review. mantle.list_entity_link_candidates(min_score=0.7, limit=20) # 2. Approve a match — creates a same_as edge. mantle.confirm_entity_link(candidate_id="lnk_8f2a", decision="approve") # 3. Reject — records a negative example for the scorer. mantle.confirm_entity_link(candidate_id="lnk_3e91", decision="reject")
Approving a candidate creates the same_asedge and merges the two entities under one canonical id. Rejecting it records a negative example so the scorer learns from it — repeated near-misses on the same pattern won't keep resurfacing.
Audit endpoint
The consolidation report.
The operator endpoint POST /v1/admin/maintenance/consolidate-entities/{tenant_id}returns a read-only preview of every candidate the scorer found, with per-signal scores and short evidence text. Safe to run against production — it doesn't write anything.
# Operator-only: preview cross-doc merge candidates for a tenant. # Read-only — no writes, safe to run against production. curl -s "https://api.mantleai.dev/v1/admin/maintenance/consolidate-entities/$TENANT_ID" \ -H "Authorization: Bearer mk_..." | jq '.candidates[:5]'
Run this before promoting embedding-driven merge to a new tenant — you'll see what the scorer wants to merge before any merge happens.
Further reading
- Why entity resolution is the missing primitive — the broader argument for cross-source identity in agent context.
- MCP tool catalog — full signatures for
list_entity_link_candidatesandconfirm_entity_link. - Changelog — Phase 1.5a/b/c/e entries describe each signal's release.