ADR-0014 — Locale-pluggable matcher comparators: content-addressed profiles that travel with the data¶
- Status: Accepted
- Date: 2026-06-16
Context¶
Former open question §11.7 — define the matcher's comparator extension point (comparator API, weight
configuration, per-deployment evaluation harness) — is the last of the original §11 items. The matching
pipeline (§5.2)
already states the requirement — "phonetic encodings, name structures, DOB precision handling, address
semantics are deployment configuration, not hardcoded" — and the matcher is advisory and external
(Python; §9.4): it
only ever proposes link candidates, which become ordinary assert/link events through the closed
identity algebra (§5.7).
This makes §11.7 structurally low-stakes in a way §11.6/§11.4 were not — it touches nothing irreversible: no envelope reserve, no day-one commitment, no new event stream. The blast radius is already contained twice over: (1) a comparator feeds only additive advisory evidence into a conservative, human-backstopped, coherence-checked decision (§5.2), and (2) "unmerge is always possible and clean" (§5.1), so even a wrong auto-link is reversible with no data loss. The architecture already absorbed "the matcher will sometimes be wrong."
Two forces, surfaced in case-mining with a clinician working across the Australian Top End, Kimberley, and the east/south coasts, make it richer than a pure configuration question:
- The right comparator is a property of the data's cultural origin, which travels with the patient — but comparator code cannot travel the clinical plane. Indigenous Top End / Cape York naming and birthdate-uncertainty norms are the rule there and completely different from Melbourne's; people relocate, and forcing one region's comparators onto another region's records on a merge is catastrophic (a foreign comparator that over-trusts weak evidence manufactures false merges — the asymmetric error). Yet a comparator is executable code, which must never sync over the clinical mesh (principle 8, ADR-0012), and a central registry is undesirable (a point of capture, against the mission).
- The silent false split is the one error the live matcher cannot see. Generous surfacing of suspected matches passes paper-parity (paper identity-matching is equally cumbersome), but a duplicate the matcher confidently rejected is never shown to anyone — fragmented history, hidden allergies on the other chart. Inline metrics cannot measure confident-rejects (humans only adjudicate what the matcher surfaced).
The deeper framing the case sharpened: hardcoding one culture's name/date/address model is cultural capture — the same anti-capture instinct behind vendor-independence (principle 7), applied to the demographic model. A matcher that assumes given+family order, Soundex, and a reliable Gregorian DOB fails the Kimberley clinic, the refugee camp, and the Indonesian mononym. Locale-pluggable, locally-evaluable comparators are paper-parity for the registrar in any culture.
Decision¶
§11.7 dissolves into existing primitives composed — no new founding principle, no envelope reserve, one small additive data-model field. Canonical home: identity §5.13; the assertion-level profile tag is demographics §4.1; the matcher's advisory/registered-actor status is §5.2 / ADR-0011.
-
Pluggable, never hardcoded — anti-capture applied to the demographic model. Cairn ships the comparator mechanism, never a privileged culture's model. Phonetic encodings, name structures, nickname/diminutive and transliteration lexicons, DOB-precision handling, and address semantics are all plugins selected by deployment, not baked into the core (principle 7 / principle 9).
-
The comparator API contract. A comparator is a pure, field-typed function
compare(value_a, value_b, context) → graded agreement, returning not a boolean but an agreement level (exact / nickname- or transliteration-equivalent / phonetic / edit-distance / none), because Fellegi–Sunter weighs each level differently. Three contract properties are principle-bearing: - Uncertainty-aware (§3.7):
no-data is never disagreement. A missing/unknown field contributes zero evidence, never a penalty —
fabricating "disagree" from an absent DOB is the §3.7 sin and actively misleads the matcher (the
generalization of §4.2's down-weighting of default
01-01birthdays). Precision-tagged values yield partial agreement (year-only DOB, estimated age). - Provenance-aware (§4.2). Agreement/disagreement weight scales with provenance — a verified-DOB clash is strong evidence against a link; an imported/unknown-DOB clash is weak.
-
Operates over the multi-valued name history set, not the current display value (§4.2). Maiden/married switching, changed family names, and discarded aliases match because the append-only name set retained them (match if any historical name agrees). Name comparison is token-based, role-tolerant, and order-tolerant — given-name order, given/family swaps, and hyphenated-surname order compare as bags of role-tagged tokens, not positionally.
-
Comparator identity travels with the data; comparator code travels the distribution plane; a missing comparator degrades honestly to human. The split that resolves "comparators must travel" without syncing code and without a central registry:
- A comparator-profile tag rides each demographic assertion as declarative, non-executable
provenance ("this value was asserted under naming-convention profile
namespace@content-hash") — ordinary append-only data on the clinical plane. It defaults silently from the registering node's locale, with a registrar-visible override (a one-tap convention selector) for the relocation and visitor cases (a tourist injured in Cape York must not be silently tagged with the local Indigenous convention, and vice-versa). It is per-assertion, so one patient may carry anangloname and acapeyorkname, each tagged. - The comparator code/weights travel the distribution plane (signed, per-node, verified before load, sneakernet-capable), never the clinical mesh.
- Content-addressed identity = no central registry. The tag is
namespace@content-hash— the human-readable namespace for selection/display, the hash for unambiguous global identity and integrity (the ADR-0013 content-addressing payoff applied to comparator identity: two nodes referring to the same hash mean the same comparator with no coordinator, and the tag still resolves against any mirror or sneakernet copy). -
Honest degradation, never the wrong comparator — the §3.13 legibility-ladder / §6.2 honest-assembly pattern applied to matching: when a node lacks a record's tagged comparator, or matches across two different profiles, it does not force its local comparator — it surfaces the pair to a human. Safety-preserving by construction: uncertainty about which comparator applies can only ever withhold an auto-link (push to human), never manufacture one, so it sits on the safe side of the false-merge ≫ false-split asymmetry automatically. The "foreign comparator forced on a relocated record" catastrophe becomes structurally unreachable.
-
Weight configuration is the locale parameter set; the matcher is a registered actor. The m/u probabilities per field per agreement-level are the deployment's locale tuning (a surname match means far more in a high-diversity population). Cairn ships defaults plus a way to learn them from local data. A comparator+weight bundle is the matcher's version-pinned standing configuration under ADR-0011 (the §3.12 inference-config analogue), so "which links did matcher-config v3 propose?" is recall-traceable and a bad rollout is recalled via the §5.5 contamination-cascade primitive, exactly like an AI model recall.
-
The evaluation harness and the duplicate-sweep miss-detector. The human-adjudication outcomes accumulating in the reconciliation queue (confirmed/rejected links, disputes) are labeled evaluation data produced by normal operation; the harness computes precision/recall with the false-merge rate as its own safety-asymmetric metric and gates config changes behind it. The confident-reject blind spot is closed by a periodic, low-priority, aggressive (low-threshold) background re-match sweep at the hub tier (fractal topology — the hub has the compute and sees the population; the Pi does live single-record matching). The sweep never auto-acts; it emits a ranked possible-duplicate worklist for a records officer, runs preemptibly at lowest priority and can never starve clinical work (the ADR-0013 byte-tier discipline), and is additive / advisory (safe un-owned, ADR-0010). Its yield is the miss-rate and drift metric the inline view cannot produce (the ADR-0010 atrophy-signal pattern). Two existing legs complete it: opportunistic re-match on every new assertion (§5.2/§5.4 — a confident-reject flips as a shared phone/ID/refined-DOB lands; monotonic refinement, §3.7), and a cheap point-of-care "this might be a duplicate — search & link" affordance (paper-parity gain: the patient who says "I have another file here" is evidence the matcher never had; finding the other folder on paper is harder).
-
The safety floor pluggability may not relax. Regardless of which comparators/weights are plugged in: auto-link still requires a conservative threshold, the wide middle band still goes to humans, and the coherence check (§5.2) still demotes contradictory links. A small closed set of hard vetoes (same-system identifier mismatch; verified DOB clash; verified sex-at-birth clash; deceased-status conflict — §4.2) forces a human decision — never an auto-link, and never an auto-reject (an auto-reject is itself a silent false split). "Err on the side of caution and prompt the user."
-
Distribution is federated, not central — GitHub doubles as the registry. Cairn-official, vetted, signed comparator packs in a Cairn repository plus community packs in theirs; signed and content-addressed, so trust is in the signature/hash, not the host. Git is mirrorable and sneakernet-cloneable, so GitHub is convenience, never a dependency — no single point of capture (mission-aligned).
-
Blast radius (§9). Fit-for-purpose (Python, advisory): every comparator, the weight-learning, the evaluation harness, and the duplicate sweep — a defect yields a bad proposal a human reviews, never a silent record corruption. Safety-critical (in-DB): the conservative auto-link threshold, the hard-veto set, the coherence check, and the proposal → identity-algebra apply seam (the one path where an advisory proposal becomes an authoritative event) — the recurring seam motif.
Consequences¶
- Easier: the matcher fits any culture without core changes; a relocated patient is matched safely because the comparator-profile travels as data and a node that lacks it degrades to human rather than guessing wrong; comparator identity is global with no central registry (content hash + git); the confident-reject blind spot becomes a measurable, standing, advisory background signal; and recall/rollout of a comparator config is free off the actor registry + contamination cascade.
- Harder / new surface: the comparator-profile tag is new (small, additive) provenance on assertions and a registration UI affordance (silent default + override); the hub duplicate-sweep is a new background job (preemptible, blocked/indexed for tractability); cross-profile matching needs an explicit "different conventions → human" path; and weight-learning per deployment is real ML/ops work (mitigated by shipped defaults).
- The bet: that an advisory, content-addressed, honestly-degrading comparator framework — backstopped by a conservative threshold, hard vetoes-to-human, the coherence check, and §5.1 reversibility — keeps the catastrophic false-merge rate near zero across wildly different populations while letting each tune its own locale, and that the low-priority hub sweep surfaces the silent false splits without ever compromising care. We would know it is wrong if cross-profile matching floods humans to the point of alert-fatigue (§5.12), or if the sweep proves intractable at population scale even with blocking.
- Policy-neutral (principle 9): Cairn ships the API, the harness, the sweep, the conservative-threshold-and-veto floor, and a starter comparator set; which comparators and weights a deployment runs, which hard vetoes it configures (above the floor), the sweep cadence, and whether registration auto-defaults or forces an explicit convention choice, are policy.