Real-World Performance Study
Cohort 3: Real-World Performance Study
The largest and most diverse Real-World Performance cohort to date under HEL-RWP-FRAMEWORK-v1.0: 57 scored real-world clinical cases (56 per-variant scoreable; 3 awaiting-comparator cases excluded; one pharmacogenetic re-analysis segregated), processed under a single locked classifier version (v3.39.1) against multi-source ground truth and the CellGenetics peer comparator, with every platform-versus-comparator discordance independently re-grounded in a per-variant forensic re-adjudication. Not a binary pass/fail study; Phase 1 foundational evidence at the current organizational stage.
Study specification
Layer 1 - Analytical performance
Two in-scope absences attributed under Category 6 (Case 38 P4HA1; Case 44 SERAC1); 11 further absences are out-of-scope SV/CNV events excluded from the detection denominator per framework Section 2.1. Benchmark: CLSI MM09-A2 at least 99%.
Namespace-aware (Ensembl ENST vs RefSeq NM reconciled per variant)
Identical class on re-resolution within session
Layer 2 - Classification performance
Within the published 60-75% inter-laboratory benchmark range (Amendola 2016; Harrison 2017; Bergquist 2025).
Clinically equivalent agreement per ClinGen SVI; within the published inter-laboratory range.
Every PARTIAL is an LP-vs-VUS or LB-vs-VUS adjacent-category divergence -- the most frequent inter-laboratory locus (Bergquist 2025). No opposite-direction PARTIAL.
Low-penetrance / risk-allele boundary calls (e.g. HFE, CHEK2, SERPINA1 Z, WARS2, AMPD1). None is an opposite-direction (P/LP-vs-B/LB) error; on two of them the evidence favours Helena.
11 out-of-scope SV/CNV scope-boundary absences; 2 in-scope variant-calling gaps (Cases 38, 44). Excluded from the concordance fractions per framework Section 2.1.
The LP/VUS discordance set was independently re-grounded from the session databases and traced through the full classifier path; each resolves to a validated guard, a correct eligibility or combining outcome, or a comparator over-call on a single carrier allele. No Category 1 defect confirmed.
Layer 3 - Clinical workflow performance
N/A on genotype-only carrier screens with no phenotype input.
For example NPR2 85.4%, USP9Y 94.3%, PKD1 100%.
Three material items escalated to the AI service; the remaining are polish items.
Clean cohort-wide; the earlier-cohort hallucination pattern did not recur.
Six-category methodological-disposition taxonomy
Every non-FULL outcome in Cohort 3 is resolved into one of the six categories below, and every LP/VUS discordance additionally received an independent per-variant forensic re-adjudication. The result is methodological characterization rather than binary pass/fail disposition.
Classifier defect
None confirmedOutput inconsistent with ACMG/AMP 2015 or ClinGen SVI. This category is reserved for the human clinical-scientific gate and is never auto-assigned. The re-adjudicated LP/VUS discordance set was traced end to end and none was a classifier-logic defect.
Disposition: No Category 1 remediation owed by this cohort. Two engineering findings the study surfaced (PM5 re-enablement; comp-het splice-partner detection) were nonetheless fixed and are live.
Validated design behavior
Dominant for LoF-null discordancesA documented conservative guard operating per design -- most often the autosomal-recessive carrier guard correctly holding a single heterozygous null allele at carrier VUS, and the PM3 comp-het partner-validation guard withholding auto-fire where no ClinVar-anchored partner exists in trans.
Disposition: No remediation. Change Request registry updated to validated with the cohort cases as evidence anchor.
Feature-gap accumulation at LP/VUS boundary
Roadmap-trackedDocumented, roadmap-tracked feature gaps or both-defensible low-penetrance boundaries prevent reaching LP. Distinct from a defect: these are documented limitations, including the low-penetrance risk-allele frequency-convention boundary.
Disposition: Roadmap registration; no specification breach.
Manual-criteria-dependent automation limitation
Inherent scope limitThe comparator classification rests primarily on evidence unavailable from VCF input alone -- de-novo trio, functional, segregation, or literature evidence. An inherent limit of automated VCF-only classification, not a defect.
Disposition: Long-term automation roadmap only.
Tier 1 ground-truth concordance
Anchors the 102 FULL outcomesHelena and the comparator reach the same class, anchored in Tier 1 ground truth (VCEP curations such as ENIGMA BRCA1/2, ClinGen Hearing Loss GJB2, RASopathy, LGMD, Platelet, Monogenic Diabetes, Hemoglobinopathy, Lysosomal), often via different criteria.
Disposition: No action. System operates correctly.
Input data layer upstream pipeline failure
13 NOT EVALUABLETarget absent from the input VCF -- 11 out-of-scope SV/CNV scope-boundary events and 2 in-scope variant-calling gaps. Platform integrity was verified separately in every case, and the AI never fabricated an in-scope variant to fill a gap.
Disposition: Out-of-scope absences require no platform remediation; the two in-scope gaps trigger the upstream-attribution audit.