Concepts

Cross-doc entity merge.

The pitch “one unified graph across sources” only delivers if entities actually merge across sources. This page explains how mantle decides that two mentions in two documents refer to the same real-world thing — the four signals, the safety floors, and how operators review the rest.

The problem

The same person, three names, three records.

Pull the same person across Drive, Gmail, Slack, and a CRM and you get a different surface form every time — Watson, Dr. John Watson, jwatson@…, JWatson. Naive extraction creates four entities. Your agent now thinks one person is four, splits the relationship history between them, and confidently tells you Dr. Watson has never spoken to anyone at Baker Street — because the four-record version of him hasn't.

Cross-doc merge fixes this. After extraction, mantle scores every plausible entity pair on four signals, applies a set of safety floors so we don't over-merge, auto-merges the high-confidence matches, and surfaces the rest to a human operator.

Before — 4 entities

driveWatson
gmailDr. John Watson
slackJWatson
crmJohn H. Watson

After — 1 entity, 4 aliases

ent_watson_john

John H. Watson

WatsonDr. John WatsonJWatsonJohn H. Watson

4 sources merged · same_as edges created

How it works

Four signals. One score.

Every candidate pair is scored on four independent signals. No single signal can force a merge on its own — the scorer combines them and the safety floors below decide whether the result clears the bar.

S1

Name similarity

Normalized edit distance between surface forms, with alias expansion (honorifics, initials, common nicknames). Tells us 'J. Smith' and 'John Smith' look alike.

S2

Semantic embedding

Cosine similarity between context-window embeddings of the two mentions, stored in a pgvector HNSW index. Catches cases where surface forms diverge but the surrounding language is clearly about the same thing.

S3

Shared-neighbor overlap

Overlap coefficient over the two entities' graph neighborhoods. Two 'Watsons' who both link to 'Sherlock Holmes' and '221B Baker Street' are almost certainly one Watson.

S4

Type compatibility

Entity-type and property-shape agreement. A 'Watson' that's typed as person can't merge with one typed as company, regardless of how high the other signals score.

Safety

The merge is conservative by design.

Over-merging is worse than under-merging — once two entities collapse, the only fix is a manual split. The scorer carries hard floors that prevent merges no matter how high the composite score reaches.

  • MIN_NAME_SIM_FOR_MERGE = 0.40— name similarity floor. Two entities whose surface forms aren't at least 40% similar can never auto-merge, regardless of how high the embedding or neighbor signals score.
  • MIN_SHARED_NEIGHBORS_FOR_MERGE = 3 — structural-signal floor. The default branch requires at least three shared graph neighbors to auto-merge.
  • strong_embedding branch — when S2 (semantic embedding) alone is high enough to consider a merge without structural support, we still require shared_neighbor_count ≥ 2. Phase 1.5e introduced this gate to cut false positives like “Echo Brickell” ↔ “830 Brickell” being merged as a single building class.
  • Type compatibility — S4 acts as a hard filter, not a weight. If types disagree, the candidate is never even considered.

Output

What the merge produces.

High-confidence matches are auto-merged: a same_as edge is created, the four (or however many) source records are linked under one canonical entity_id, and downstream queries follow that edge transparently. search_entities, get_entity_context, and traverse_graph all return the unified view without the caller having to know a merge happened.

Borderline matches — high-enough score to suspect a match but below the auto-merge threshold — are recorded as same_as candidates with status pending_review. They are not yet acted on; instead, they wait for an operator to confirm or reject them.

Operator workflow

Reviewing pending candidates.

Two MCP tools drive the operator-review flow. Both are tenant-admin-scoped — surface them in operator UIs, not in the default agent toolkit.

# 1. List the pending candidates an operator needs to review.
mantle.list_entity_link_candidates(min_score=0.7, limit=20)

# 2. Approve a match — creates a same_as edge.
mantle.confirm_entity_link(candidate_id="lnk_8f2a", decision="approve")

# 3. Reject — records a negative example for the scorer.
mantle.confirm_entity_link(candidate_id="lnk_3e91", decision="reject")

Approving a candidate creates the same_asedge and merges the two entities under one canonical id. Rejecting it records a negative example so the scorer learns from it — repeated near-misses on the same pattern won't keep resurfacing.

Audit endpoint

The consolidation report.

The operator endpoint POST /v1/admin/maintenance/consolidate-entities/{tenant_id}returns a read-only preview of every candidate the scorer found, with per-signal scores and short evidence text. Safe to run against production — it doesn't write anything.

# Operator-only: preview cross-doc merge candidates for a tenant.
# Read-only — no writes, safe to run against production.
curl -s "https://api.mantleai.dev/v1/admin/maintenance/consolidate-entities/$TENANT_ID" \
  -H "Authorization: Bearer mk_..." | jq '.candidates[:5]'

Run this before promoting embedding-driven merge to a new tenant — you'll see what the scorer wants to merge before any merge happens.

Further reading