Real-World Performance Study

Cohort 2: Multi-Source Performance Characterization

Real-World Performance Study under HEL-RWP-FRAMEWORK-v1.0. Twenty cases evaluated under a three-layer performance model with multi-source ground truth construction. Methodological characterization of all divergences. Not a binary pass/fail study; foundational scientific evidence at the current organizational stage.

Study specification

DocumentHEL-RWP-C2-2026-001 v1.0

Study typeReal-World Performance Study under HEL-RWP-FRAMEWORK-v1.0

PeriodMarch - April 2026

Classifier versionv3.28.0 (final at cohort closure)

Reference sourcesMulti-source ground truth: VCEP curations (where available), ClinVar 2-star+ aggregated submissions, peer reference comparator

Sample size20 cases

Clinical domains13 (neurology, autoinflammatory, nephrology X-linked, RASopathy, neuro-metabolic, endocrine, neuromuscular, lysosomal storage, skeletal dysplasia, connective tissue, respiratory, craniofacial, ciliopathy, Y-linked reproductive, mitochondrial)

Variant types7 (missense, nonsense, frameshift, splice region, polypyrimidine tract, complex indel, large recurrent indel)

Inheritance patterns8 (AD reduced penetrance, AD haploinsufficient, AR, AR carrier, AR compound heterozygous, X-linked recessive, Y-linked LoF, dual AD/AR mechanism)

Performance modelThree-layer (analytical, classification, clinical workflow)

Disposition methodologyMethodological characterization (not binary pass/fail), six-category disposition taxonomy

Layer 1 - Analytical performance

Variant detection (P1)19/20 (95%)

Single non-detection attributed to upstream sequencing pipeline failure via 7-step audit trail; Helena platform integrity verified via 21:21 input-to-database record correspondence; Helena attribution CLEARED

Gene assignment (P2)19/19 (100% where evaluable)

HGVS / consequence / zygosity (S1-S3)All PASS where evaluable

Layer 2 - Classification performance

FULL concordance13 components

Identical ACMG class with the peer reference comparator. Anchored variously in ClinVar 2-3 star, ClinGen Hearing Loss VCEP v1.0, KDIGO 2025 ADPKD guideline, or independent multi-criterion convergence.

PARTIAL components6 components (across 4 cases)

Methodologically characterized: 2 validated design behavior (computational-only LP guard and AR carrier guard operating per design), 3 feature-gap accumulation at LP/VUS boundary (PP3 tool divergence, PM1 coverage, PP2 implementation), 1 manual-criteria-dependent inherent automation limitation

DISCORDANT outcomes0

No opposite-direction classifications observed

NOT EVALUABLE1

Upstream variant calling pipeline failure (verified non-Helena attribution)

Inter-laboratory benchmarkWithin published range

Amendola 2016 (66%), Harrison 2017 (72-76%), Bergquist 2025 (70-75%) inter-laboratory FULL concordance benchmarks

Layer 3 - Clinical workflow performance

Q1 - Tier placementPASS in diagnostic indication cases

Target P/LP variants placed in Tier 1 of phenotype matching output

Q2 - Phenotype match scoreHIGH in cases with appropriate HPO input

at least 50% match score

Q4 - AI report qualityGOOD in majority

Variant mentioned, interpretation correct, overall quality good

AI Faithfulness14/20 (70%) PASS

Seven consecutive PASS at cohort closure (Cases 14-20). Improvement trajectory documented across the cohort.

Change Request validations10 deployed CRs validated

Trigger-configuration validation of conservative-guard architecture across diverse real-world clinical scenarios

Six-category methodological-disposition taxonomy

Every observed concordance pattern in Cohort 2 is resolved into one of the categories below, providing comprehensive methodological characterization rather than binary pass/fail disposition. The taxonomy is the principal scientific contribution of Cohort 2 and is intended to inform future cohort design.

Tier 1 ground-truth concordance

13 components

Helena and the peer comparator both reach the same classification, anchored in VCEP curation, ClinVar 2-star or higher, or established clinical practice guidelines. Multiple ACMG criteria converge on the same class.

Disposition: No action. System operates correctly. Multiple validation milestones documented.

Validated design behavior

2 components

Helena classification reflects a documented classifier guard operating per ClinGen-aligned conservative design principles. Disagreement with the peer comparator does not reflect platform error; the guard is the explicit intended output for genotype-context-aware classification.

Disposition: No remediation. Update Change Request registry status to validated with cohort cases as evidence anchor.

Feature-gap accumulation at LP/VUS boundary

3 components

Multiple documented feature gaps in Helena automated classification scope collectively prevent reaching the LP combining-rule threshold. Three identified gaps: PP3 computational predictor tool divergence, PM1 critical-domain coverage scope, and PP2 not yet implemented in the automated classifier. Each gap is independently characterized and roadmap-tracked. Distinct from defect: the gaps are documented limitations, not errors.

Disposition: Roadmap-tracked. PM1 domain coverage expansion in flight (HELIX-CR-2026-082, deployed April 2026 with UniProt residue-level evidence integration). PP2 implementation roadmap formalization. PP3 tool-choice divergence: no classifier change recommended (defensible methodological choice per ClinGen SVI).

Manual-criteria-dependent inherent automation limitation

1 component

Classification depends primarily on manual-curation evidence (case-control / case-series literature, family cosegregation, individual ClinVar submitter evaluation) that is not available from VCF input. Distinct from feature gap: not a classifier roadmap item, but an inherent scope limitation of automated VCF-only classification.

Disposition: Long-term automation roadmap items only: literature mining for PS4 case-control automation, external trio data ingestion for PP1 segregation, gene-aware PP2 thresholds for small genes. Not classifier defect remediation.

Input data layer upstream pipeline failure

1 case (NOT EVALUABLE)

Target variants absent from input VCF due to upstream sequencing or variant calling pipeline. Helena platform integrity verified via 21:21 input-to-database record correspondence at correct genomic target coordinates. Failure is upstream, not Helena.

Disposition: No Helena remediation required (attribution cleared). Upstream pipeline investigation pending sequencing facility action.

Back to Studies