Helena
Back to Studies

Real-World Performance Study

Cohort 3: Real-World Performance Study

The largest and most diverse Real-World Performance cohort to date under HEL-RWP-FRAMEWORK-v1.0: 57 scored real-world clinical cases (56 per-variant scoreable; 3 awaiting-comparator cases excluded; one pharmacogenetic re-analysis segregated), processed under a single locked classifier version (v3.39.1) against multi-source ground truth and the CellGenetics peer comparator, with every platform-versus-comparator discordance independently re-grounded in a per-variant forensic re-adjudication. Not a binary pass/fail study; Phase 1 foundational evidence at the current organizational stage.

Study specification

DocumentHEL-RWP-C3-2026-001 v5_0 (DRAFT, forensic-adjudicated)
Study typeReal-World Performance Study under HEL-RWP-FRAMEWORK-v1.0
PeriodConcordance validation under v3.39.1; cohort closed June 2026
Classifier versionv3.39.1 (final at cohort closure)
Reference sourcesMulti-source ground truth: VCEP curations, ClinVar 2-star+ aggregated submissions, calibrated computational predictors, and the CellGenetics peer comparator (comparator only, not ground truth)
Cohort size60 manifest cases; 57 scored (3 awaiting-comparator excluded); 56 per-variant scoreable (one TPMT-only pharmacogenetic re-analysis segregated); 184 shared-variant denominator
Clinical domains15+ (cardiology, nephrology, neurology, neuromuscular, skeletal dysplasia, ophthalmology, oncology, metabolic, hematology, endocrinology, RASopathy, deafness, reproductive, thrombophilia, carrier screening)
Variant typesFull coding range: missense, nonsense, frameshift/indel, canonical and non-canonical splice, synonymous with RNA effect, in-frame deletion, start-loss, 5-prime UTR, complex delins (out-of-scope SV/CNV surfaced by the comparator are reported separately)
Inheritance patterns8+ (autosomal dominant, autosomal recessive incl. compound-het and carrier states, X-linked recessive and dominant, Y-linked, dual AD/AR mechanism, low-penetrance risk alleles)
Performance modelThree-layer (analytical, classification, clinical) with an added per-variant forensic re-adjudication of every LP/VUS discordance
Disposition methodologyMethodological characterization (not binary pass/fail), six-category disposition taxonomy
L1

Layer 1 - Analytical performance

Variant detection (P1)184/186 in-scope shared variants (98.9%, Wilson 95% CI 96.2-99.7%)

Two in-scope absences attributed under Category 6 (Case 38 P4HA1; Case 44 SERAC1); 11 further absences are out-of-scope SV/CNV events excluded from the detection denominator per framework Section 2.1. Benchmark: CLSI MM09-A2 at least 99%.

Gene / HGVS / consequence / zygosity (P2, S1-S3)Concordant on all detected shared variants

Namespace-aware (Ensembl ENST vs RefSeq NM reconciled per variant)

Repeatability / reproducibilityDeterministic where examined

Identical class on re-resolution within session

L2

Layer 2 - Classification performance

FULL concordance vs peer102/184 (55.4%, Wilson 95% CI 48.2-62.4%)

Within the published 60-75% inter-laboratory benchmark range (Amendola 2016; Harrison 2017; Bergquist 2025).

FULL + CLINICAL (clinical-equivalent)121/184 (65.8%, Wilson 95% CI 58.6-72.2%)

Clinically equivalent agreement per ClinGen SVI; within the published inter-laboratory range.

PARTIAL components56 across 34 cases

Every PARTIAL is an LP-vs-VUS or LB-vs-VUS adjacent-category divergence -- the most frequent inter-laboratory locus (Bergquist 2025). No opposite-direction PARTIAL.

DISCORDANT outcomes7 (all both-defensible)

Low-penetrance / risk-allele boundary calls (e.g. HFE, CHEK2, SERPINA1 Z, WARS2, AMPD1). None is an opposite-direction (P/LP-vs-B/LB) error; on two of them the evidence favours Helena.

NOT EVALUABLE13 (reported separately)

11 out-of-scope SV/CNV scope-boundary absences; 2 in-scope variant-calling gaps (Cases 38, 44). Excluded from the concordance fractions per framework Section 2.1.

Forensic re-adjudication0 confirmed classifier-logic defects

The LP/VUS discordance set was independently re-grounded from the session databases and traced through the full classifier path; each resolves to a validated guard, a correct eligibility or combining outcome, or a comparator over-call on a single carrier allele. No Category 1 defect confirmed.

L3

Layer 3 - Clinical workflow performance

Q1 - Tier placementPASS on detected, P/LP, phenotype-matched diagnostic anchors

N/A on genotype-only carrier screens with no phenotype input.

Q2 - Phenotype matchHIGH on matched diagnostic cases

For example NPR2 85.4%, USP9Y 94.3%, PKD1 100%.

Q4 - AI report qualityAI interpretation in 49/57 cases

Three material items escalated to the AI service; the remaining are polish items.

AI faithfulness49/49 AI-bearing cases free of gene-disease name hallucination

Clean cohort-wide; the earlier-cohort hallucination pattern did not recur.

Six-category methodological-disposition taxonomy

Every non-FULL outcome in Cohort 3 is resolved into one of the six categories below, and every LP/VUS discordance additionally received an independent per-variant forensic re-adjudication. The result is methodological characterization rather than binary pass/fail disposition.

Classifier defect

None confirmed

Output inconsistent with ACMG/AMP 2015 or ClinGen SVI. This category is reserved for the human clinical-scientific gate and is never auto-assigned. The re-adjudicated LP/VUS discordance set was traced end to end and none was a classifier-logic defect.

Disposition: No Category 1 remediation owed by this cohort. Two engineering findings the study surfaced (PM5 re-enablement; comp-het splice-partner detection) were nonetheless fixed and are live.

Validated design behavior

Dominant for LoF-null discordances

A documented conservative guard operating per design -- most often the autosomal-recessive carrier guard correctly holding a single heterozygous null allele at carrier VUS, and the PM3 comp-het partner-validation guard withholding auto-fire where no ClinVar-anchored partner exists in trans.

Disposition: No remediation. Change Request registry updated to validated with the cohort cases as evidence anchor.

Feature-gap accumulation at LP/VUS boundary

Roadmap-tracked

Documented, roadmap-tracked feature gaps or both-defensible low-penetrance boundaries prevent reaching LP. Distinct from a defect: these are documented limitations, including the low-penetrance risk-allele frequency-convention boundary.

Disposition: Roadmap registration; no specification breach.

Manual-criteria-dependent automation limitation

Inherent scope limit

The comparator classification rests primarily on evidence unavailable from VCF input alone -- de-novo trio, functional, segregation, or literature evidence. An inherent limit of automated VCF-only classification, not a defect.

Disposition: Long-term automation roadmap only.

Tier 1 ground-truth concordance

Anchors the 102 FULL outcomes

Helena and the comparator reach the same class, anchored in Tier 1 ground truth (VCEP curations such as ENIGMA BRCA1/2, ClinGen Hearing Loss GJB2, RASopathy, LGMD, Platelet, Monogenic Diabetes, Hemoglobinopathy, Lysosomal), often via different criteria.

Disposition: No action. System operates correctly.

Input data layer upstream pipeline failure

13 NOT EVALUABLE

Target absent from the input VCF -- 11 out-of-scope SV/CNV scope-boundary events and 2 in-scope variant-calling gaps. Platform integrity was verified separately in every case, and the AI never fabricated an in-scope variant to fill a gap.

Disposition: Out-of-scope absences require no platform remediation; the two in-scope gaps trigger the upstream-attribution audit.