Skip to content

arche-core

The identity data engine for Africa.

Arche finds identifying data, helps protect it according to the right jurisdiction, and prepares it for privacy-preserving resolution.

The public arche-core package focuses on three connected jobs:

  • Detect identifying data in African text and documents.
  • Protect it with jurisdiction-aware masking, tokenization, generalization, dropping, retention, and audit actions.
  • Resolve more safely by producing normalized, policy-aware signals such as tokenized IDs, names, phones, and addresses.

Today, Detect and Protect are the lead product surface. Resolution support is intentionally narrow: name matching, tokenized identifiers, and optional Splink-backed workflows for larger linkage tasks.

Why use this library

  • Simple. One Pipeline.process(...) call runs detection, policy, redaction, and audit output.
  • African-first. Launch support covers Nigeria, Kenya, South Africa, and Ghana, with wider African identifier, phone, name, and address support.
  • Statute-aware. Detections are grounded in NDPA-2023, POPIA, Kenya DPA, or Ghana DPA policy files.
  • Lightweight by default. Heavy ML, Presidio, Splink, and document parsing dependencies are opt-in extras.
  • Useful for review workflows. You can scan text, PDFs, DOCX files, invoices, DSAR responses, leaked documents, KYC records, and review extracts without building a separate compliance layer first.

Installation

pip install arche-core

For document parsing:

pip install "arche-core[doc]"

Optional extras include:

Extra Adds
arche-core[doc] docling-backed PDF, DOCX, PPTX, XLSX, and HTML parsing
arche-core[doc-ocr] OCR support for scanned documents
arche-core[detect] GLiNER2-PII soft-PII detection
arche-core[presidio] Microsoft Presidio integration
arche-core[resolve] Splink and DuckDB entity resolution support

What does it do?

Given text or a supported document file, arche returns:

  • the detected PII spans
  • their taxonomy category
  • a sensitivity tier
  • the statute citation used by the loaded jurisdiction
  • the policy action applied
  • redacted text
  • audit records suitable for later review

These outputs are useful for redaction today and for safer record linkage later: tokenized IDs, normalized phones, detected names, and parsed address fragments can become privacy-preserving join signals.

Supported launch jurisdictions:

Jurisdiction Policy loaded
NG NDPA-2023
ZA POPIA
KE Kenya DPA
GH Ghana DPA

Example: detect PII in text

from arche import Pipeline

pipeline = Pipeline(jurisdiction="NG")
result = pipeline.process(
    "Fatima Abdullahi, NIN 12345678901, BVN 22100987654."
)

print(result.redacted_text)

Example output:

NAME_... NAME_..., NIN [NIN], BVN [BVN].

You can inspect the detections directly:

for detection in result.detections:
    print(
        detection.category,
        detection.sensitivity_tier.value,
        detection.regulatory_citation,
    )

Example output:

Category Tier Citation
PII-2-NIN high NDPA-2023 s.30, NIMC Act s.27
PII-2-BVN high NDPA-2023 s.30, CBN BVN policy 2014
PII-1-NAME moderate NDPA-2023 s.30

Example: scan a document

With arche-core[doc] installed, use the same pipeline on files:

from arche import Pipeline

pipeline = Pipeline(jurisdiction="ZA")
result = pipeline.process_file("dsar_response.pdf")

print(result.summary())
print(result.redacted_text)

process_file(...) delegates parsing to the document substrate, then sends the extracted text through the same detection and policy pipeline.

What can it detect?

Area Current coverage
Government IDs NG NIN, BVN, TIN, RC, PVC, drivers licence; KE National ID, Huduma, KRA PIN, NHIF; ZA ID, tax, passport; GH Ghana Card, SSNIT, TIN; plus wider African ID patterns
Names and local NER African name lexicon and equivalence data, with optional GLiNER soft-PII detection
Phones libphonenumber-backed E.164 normalization across African networks
Addresses Nigeria and South Africa parser MVP
Digital identifiers DIDs, Bitcoin addresses, Ethereum addresses
Network identifiers IPv4 and IPv6 detection with private and special-range flags

Matching names

from arche.match import match

score = match("Mamadou Diallo", "Muhammad Jallow", jurisdiction="NG")
print(score.decision, score.score)

Use this when you need culturally aware name matching before or after PII detection.

Detect, Protect, Resolve

Step What Arche does today
Detect Finds PII and identity signals in text and supported document files
Protect Applies jurisdiction-aware policy actions and emits audit-ready output
Resolve Prepares normalized, protected signals for matching and linkage workflows

Next steps

Licence

The framework is Apache-2.0. Dataset licensing is documented separately in the dataset cards and repository licensing files.