arche-core¶
The identity data engine for Africa.
Arche finds identifying data, helps protect it according to the right jurisdiction, and prepares it for privacy-preserving resolution.
The public arche-core package focuses on three connected jobs:
- Detect identifying data in African text and documents.
- Protect it with jurisdiction-aware masking, tokenization, generalization, dropping, retention, and audit actions.
- Resolve more safely by producing normalized, policy-aware signals such as tokenized IDs, names, phones, and addresses.
Today, Detect and Protect are the lead product surface. Resolution support is intentionally narrow: name matching, tokenized identifiers, and optional Splink-backed workflows for larger linkage tasks.
Why use this library¶
- Simple. One
Pipeline.process(...)call runs detection, policy, redaction, and audit output. - African-first. Launch support covers Nigeria, Kenya, South Africa, and Ghana, with wider African identifier, phone, name, and address support.
- Statute-aware. Detections are grounded in NDPA-2023, POPIA, Kenya DPA, or Ghana DPA policy files.
- Lightweight by default. Heavy ML, Presidio, Splink, and document parsing dependencies are opt-in extras.
- Useful for review workflows. You can scan text, PDFs, DOCX files, invoices, DSAR responses, leaked documents, KYC records, and review extracts without building a separate compliance layer first.
Installation¶
For document parsing:
Optional extras include:
| Extra | Adds |
|---|---|
arche-core[doc] |
docling-backed PDF, DOCX, PPTX, XLSX, and HTML parsing |
arche-core[doc-ocr] |
OCR support for scanned documents |
arche-core[detect] |
GLiNER2-PII soft-PII detection |
arche-core[presidio] |
Microsoft Presidio integration |
arche-core[resolve] |
Splink and DuckDB entity resolution support |
What does it do?¶
Given text or a supported document file, arche returns:
- the detected PII spans
- their taxonomy category
- a sensitivity tier
- the statute citation used by the loaded jurisdiction
- the policy action applied
- redacted text
- audit records suitable for later review
These outputs are useful for redaction today and for safer record linkage later: tokenized IDs, normalized phones, detected names, and parsed address fragments can become privacy-preserving join signals.
Supported launch jurisdictions:
| Jurisdiction | Policy loaded |
|---|---|
NG |
NDPA-2023 |
ZA |
POPIA |
KE |
Kenya DPA |
GH |
Ghana DPA |
Example: detect PII in text¶
from arche import Pipeline
pipeline = Pipeline(jurisdiction="NG")
result = pipeline.process(
"Fatima Abdullahi, NIN 12345678901, BVN 22100987654."
)
print(result.redacted_text)
Example output:
You can inspect the detections directly:
for detection in result.detections:
print(
detection.category,
detection.sensitivity_tier.value,
detection.regulatory_citation,
)
Example output:
| Category | Tier | Citation |
|---|---|---|
PII-2-NIN |
high |
NDPA-2023 s.30, NIMC Act s.27 |
PII-2-BVN |
high |
NDPA-2023 s.30, CBN BVN policy 2014 |
PII-1-NAME |
moderate |
NDPA-2023 s.30 |
Example: scan a document¶
With arche-core[doc] installed, use the same pipeline on files:
from arche import Pipeline
pipeline = Pipeline(jurisdiction="ZA")
result = pipeline.process_file("dsar_response.pdf")
print(result.summary())
print(result.redacted_text)
process_file(...) delegates parsing to the document substrate, then sends the
extracted text through the same detection and policy pipeline.
What can it detect?¶
| Area | Current coverage |
|---|---|
| Government IDs | NG NIN, BVN, TIN, RC, PVC, drivers licence; KE National ID, Huduma, KRA PIN, NHIF; ZA ID, tax, passport; GH Ghana Card, SSNIT, TIN; plus wider African ID patterns |
| Names and local NER | African name lexicon and equivalence data, with optional GLiNER soft-PII detection |
| Phones | libphonenumber-backed E.164 normalization across African networks |
| Addresses | Nigeria and South Africa parser MVP |
| Digital identifiers | DIDs, Bitcoin addresses, Ethereum addresses |
| Network identifiers | IPv4 and IPv6 detection with private and special-range flags |
Matching names¶
from arche.match import match
score = match("Mamadou Diallo", "Muhammad Jallow", jurisdiction="NG")
print(score.decision, score.score)
Use this when you need culturally aware name matching before or after PII detection.
Detect, Protect, Resolve¶
| Step | What Arche does today |
|---|---|
| Detect | Finds PII and identity signals in text and supported document files |
| Protect | Applies jurisdiction-aware policy actions and emits audit-ready output |
| Resolve | Prepares normalized, protected signals for matching and linkage workflows |
Next steps¶
- Getting started
- Match African names
- Extract from an invoice
- Nigerian fintech KYC cookbook
- Introducing arche v0.2
- Pipeline API reference
Licence¶
The framework is Apache-2.0. Dataset licensing is documented separately in the dataset cards and repository licensing files.