Pipeline¶
Pipeline is the v0.2 framework primitive. A single Pipeline.process(text) call composes detection + jurisdiction-aware policy + audit and returns a typed Result.
from arche import Pipeline
pipeline = Pipeline(jurisdiction="NG")
result = pipeline.process(
"Customer Adesola Okonkwo, NIN 12345678901, phone 0803 555 7890."
)
print(result.redacted_text)
# Customer NAME_..., NIN [NIN], phone PHONE_...
Pipeline¶
class Pipeline:
def __init__(
self,
jurisdiction: str | None = None,
statute: str | None = None,
audit_log: AuditLog | None = None,
) -> None: ...
def process(self, text: str) -> Result: ...
def process_file(self, path: str | Path) -> Result: ...
Constructor parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
jurisdiction |
str \| None |
None |
ISO-3166-1 alpha-2 country code ("NG", "KE", "ZA", "GH"). Auto-loads the matching statute. |
statute |
str \| None |
None |
Explicit statute YAML name ("NDPA-2023", "POPIA", "KENYA-DPA", "GHANA-DPA"). Overrides jurisdiction. |
audit_log |
AuditLog \| None |
None |
Optional arche.graph.audit.AuditLog instance to record every detection. If None, audit entries are still produced on result.audit_entries but not persisted. |
At least one of jurisdiction or statute must be supplied.
Methods¶
process(text: str) -> Result¶
Run the substrate chain on a string. Returns a Result with detections, policy outcomes, redacted text, and audit entries.
process_file(path: str | Path) -> Result¶
Convenience: parse a file via arche.doc.parse (PDF/DOCX/PPTX/XLSX via docling - requires arche-core[doc]) then run process() on the extracted text.
Result¶
@dataclass
class Result:
text: str # Original input
redacted_text: str # After applying policy
detections: list[Detection] # Every category match (pre-policy)
policy_outcomes: list[PolicyOutcome] # Action + statute citation per detection
audit_entries: list[AuditEvent] # PII-free audit rows
statute: Statute # The loaded statute YAML
Methods¶
| Method | Returns | Notes |
|---|---|---|
to_dict() |
dict |
Plain-Python representation |
to_json(indent=2) |
str |
JSON string |
summary() |
dict |
Counts per category and per action |
Detection¶
@dataclass
class Detection:
category: str # Pan-African PII Taxonomy label, e.g. "PII-2-NIN"
value_redacted: str # Placeholder like "[NIN]"
start: int # Character offset
end: int
confidence: float # 1.0 for structurally validated IDs
country: str | None # ISO-3166-1 alpha-2 when known
source: str # "regex" / "validator" / "gliner" / "_africa" / ...
Detection.value_redacted is the placeholder used in result.redacted_text. The original PII value is not retained on the Detection - start/end index into result.text if the caller still has it.
Examples¶
Basic NDPA-2023 pipeline¶
from arche import Pipeline
pipeline = Pipeline(jurisdiction="NG")
result = pipeline.process("NIN 12345678901, BVN 22156789012.")
print([d.category for d in result.detections])
# ['PII-2-NIN', 'PII-2-BVN']
print(result.redacted_text)
# NIN [NIN], BVN [BVN].
Persisted audit log + signed regulator export¶
from arche import Pipeline
from arche.graph.audit import AuditLog
from arche.sign import generate_keypair
audit = AuditLog("./compliance.sqlite")
pipeline = Pipeline(jurisdiction="NG", audit_log=audit)
for text in batch_of_documents:
pipeline.process(text)
officer_key = generate_keypair()
report = audit.export_signed(key=officer_key, purpose="ndpc_quarterly_audit")
# `report` is a JWS-signed bundle the regulator can verify offline.
Pipeline + docling file ingest¶
# requires: pip install arche-core[doc]
from arche import Pipeline
pipeline = Pipeline(jurisdiction="ZA")
result = pipeline.process_file("dsar_response.pdf")
print(result.summary())