Entity resolution¶
Entity resolution is the task of linking multiple mentions of the same real person across documents - "is Adesola Okonkwo in our customer database the same person as A. Okonkwo in this signup form?"
arche-core ships entity resolution in the base install:
Two callables, two use cases:
resolve_entities(entities, ...)deduplicates a flat list of extractedEntityobjects (typical post-Pipeline.process()output).resolve_identity_records(evidence, threshold=..., jurisdiction=...)is the v2 evidence-driven path that producesIdentityRecordobjects for DPI hand-off.
Under the hood, two backends:
- Fuzzy (always available) - rapidfuzz token-sort + union-find clustering + Yoruba/Hausa/Swahili name equivalence. Suitable up to ~100K records on a laptop.
- Splink (via
pip install arche-core[resolve]) - Splink 4.x + DuckDB Fellegi-Sunter probabilistic matching with EM parameter estimation. Used for larger workloads when the extra is installed; falls back to fuzzy on import error.
Worked example - Nigerian fintech dedup¶
Three customer signup records arrive from three different channels (web form, USSD callback, branch tablet). All three are the same person, but the names are spelled differently and only one has a NIN.
from arche import Pipeline
from arche.resolve import resolve_entities
texts = [
"Customer Adesola Okonkwo, NIN 12345678901, phone 0803 555 7890.",
"A. Okonkwo, BVN 22156789012, phone +234 803 555 7890.",
"Adesola Okonkwo, phone +2348035557890, email aokonkwo@example.com.",
]
pipeline = Pipeline(jurisdiction="NG")
all_entities = []
for t in texts:
result = pipeline.process(t)
all_entities.extend(result.entities) # populated in v0.2.0a2; for v0.2.0a1, see result.detections
# Resolve into canonical records
clusters = resolve_entities(all_entities, use_splink=True)
for c in clusters:
print(c.canonical_name, "<->", [m.text for m in c.mentions])
# Adesola Okonkwo <-> ['Adesola Okonkwo', 'A. Okonkwo', 'Adesola Okonkwo']
The resolver succeeded for three reasons:
- Phone normalization -
0803 555 7890,+234 803 555 7890, and+2348035557890all normalize to+2348035557890via libphonenumber. That's the strongest match key. - African-name equivalence -
arche.detect._names.lexiconrecognises spelling variants and initial-based matches. - NIN / BVN cross-linking - record 1 has a NIN, record 2 has a BVN, record 3 has neither. The resolver still groups them because of (1) and (2). If a fourth record came in with the same NIN but a totally different name, it would merge by exact-match NIN - the typical bank-fraud-vs-typo signal.
NIN-based blocking - the production pattern¶
For 1M+ record dedup, blocking is the cost-controlling step. Splink's default blocking on entity_type + name-prefix is too loose for arche workloads. Override:
from arche.resolve import resolve_identity_records
records = resolve_identity_records(
evidence,
threshold=0.85,
jurisdiction="NG",
)
The Splink pipeline runs roughly:
1. Block on NIN exact match and phone prefix
2. Compare:
- name: JaroWinkler thresholds [0.9, 0.7]
- phone: ExactMatch
- NIN: ExactMatch
- BVN: ExactMatch
3. EM-estimate parameters
4. Predict pairwise probabilities
5. Cluster at threshold (default 0.5)
For the v0.2.0a3 release, this pipeline runs internally when arche-core[resolve] is installed.
Picking your backend¶
| Records | Backend | How to engage |
|---|---|---|
| <100 | Fuzzy | Default - no extra needed |
| 100 - 100K | Fuzzy or Splink | Default fuzzy; pip install arche-core[resolve] for Splink |
| 100K - 1M | Splink | pip install arche-core[resolve] - auto-engages |
| 1M+ | Splink + DuckDB-backed staging | Same install |
Resolution outputs ResolvedEntity / IdentityRecord Python objects that you can persist however you like.