How Helena Works

Helena follows a simple principle: variant classification must be rule-based and auditable. AI assists in evidence synthesis but never makes classification decisions. Every result is traceable to its source data, published criteria, and referenced literature.

Quality Control

Annotation

Classification

Phenotype

Literature

Screening

Interpretation

Under 15 minutes

Full genome processing and clinical interpretation

The Analysis Pipeline

Seven stages transform a raw VCF file into a clinician-ready interpretation report. Each stage produces traceable, auditable output.

VCF Processing & Quality Control

Standard VCF parsing accepts files from any sequencing platform, whole genome, whole exome, or targeted panels. Quality metrics are assessed per variant, applying configurable filters for read depth, genotype quality, and allelic balance.

Variants with documented clinical significance in ClinVar are preserved regardless of quality score. This maximum sensitivity approach ensures that no clinically relevant variant is discarded due to quality metrics alone, a deliberate design decision for clinical safety.

VCF standard formatQuality filteringClinVar protectionMaximum sensitivity

Output: Quality-filtered variant set with clinically significant variants preserved

Variant Annotation

Each variant is annotated through Ensembl VEP (Variant Effect Predictor) for consequence prediction, protein impact, and functional domain mapping. Parallel processing enables efficient annotation of millions of variants per genome.

Multi-source database enrichment adds population frequencies from gnomAD (global and population-specific allele frequencies), clinical significance from ClinVar, functional impact predictions from 12+ computational tools including SIFT, PolyPhen-2, CADD, REVEL, AlphaMissense, DANN, MetaSVM, GERP++, PhyloP, and PhastCons, gene constraint metrics (pLI, LOEUF, o/e loss-of-function), and gene-disease associations from ClinGen.

Ensembl VEPgnomADClinVardbNSFPClinGen12+ predictors60+ annotations per variant

Output: Fully annotated variants with population, functional, conservation, and clinical data

ACMG/AMP Classification

Variant classification follows the 2015 ACMG/AMP guidelines (Richards et al., Genetics in Medicine), the international standard for clinical variant interpretation. All 28 evidence criteria are systematically evaluated: PVS1, PS1–4, PM1–6, PP1–5, BA1, BS1–4, and BP1–7.

Classification is strictly rule-based. No AI model determines variant pathogenicity. Each variant receives one of five standard classifications, Pathogenic, Likely Pathogenic, Variant of Uncertain Significance (VUS), Likely Benign, or Benign, with an explicit listing of every criterion applied.

ACMG/AMP 2015 guidelinesRule-based28 evidence criteria5-tier classificationExplicit criteria listing

Output: Classified variants with complete ACMG criteria audit trail

Phenotype-Genotype Correlation

Patient phenotype, described using Human Phenotype Ontology (HPO) terms, is systematically compared against the known phenotypic profiles of genes carrying candidate variants. The HPO ontology hierarchy enables semantic similarity analysis that accounts for term specificity and information content, not just exact matches.

Each gene receives a normalized relevance score (0–100) with tiered clinical classification. A Pathogenic BRCA1 variant is not flagged as clinically relevant when the patient was referred for epilepsy. Phenotype matching connects technical classification to clinical relevance for the specific patient.

HPO ontologySemantic similarityInformation contentNormalized scoringTiered relevance

Output: Ranked gene list prioritized by phenotype match strength for this patient

Literature Evidence

A locally maintained, genetics-filtered database of biomedical literature provides sub-second clinical queries across millions of PubMed publications. Publications are pre-processed with extracted gene mentions, variant mentions, and phenotype associations, enabling instant, targeted evidence retrieval for any variant or gene in the analysis.

Multi-component relevance scoring ranks publications by clinical utility for the specific case. Every literature citation includes its PubMed identifier (PMID), DOI, and extracted evidence context, fully traceable back to the original publication.

Local PubMed databasePre-extracted entitiesRelevance scoringPMID/DOI trackingSub-second queries

Output: Ranked literature evidence with traceable citations per gene and variant

Clinical Screening & Prioritization

After classification, annotation, phenotype matching, and literature review, a multi-dimensional prioritization algorithm ranks variants by overall clinical relevance. Scoring adapts to the clinical context, patient age, sex, family history, and indication for genetic testing.

The system supports multiple screening strategies including neonatal intensive care, pediatric genetics, adult diagnostic workup, proactive screening, and carrier testing. The output is a tiered shortlist: Tier 1 (actionable findings requiring immediate clinical attention), Tier 2 (potentially actionable, warranting further review), with incidental findings identified and flagged separately.

Context-aware scoringAge/sex adaptationNeonatalPediatricAdult diagnosticCarrier screeningTiered output

Output: Focused shortlist of clinically actionable variants from hundreds of candidates

Family and Trio Analysis (Optional)

When family members are sequenced alongside the proband, Helena adds inheritance-aware evidence on top of the upstream classification. Three algorithms run sequentially on pre-classified data: de novo detection with confidence tiers, compound heterozygous phasing with parental origin determination, and segregation scoring per the ClinGen SVI 2021 framework. Sample QC via PLINK identity-by-descent runs first to detect sample-swap, consanguinity, and duplicate-sample issues before any inheritance call.

No variant re-calling. The service consumes the existing classified DuckDB files and joins them on chromosome, position, and allele. A typical WGS trio completes the full inheritance analysis in approximately 30 to 90 seconds, with explicit feasibility flags recording which phases were planned as feasible and why.

Trio, duo, sibling supportDe novo detectionCompound het phasingClinGen SVI 2021 segregationPLINK sample QCFeasibility flags

Output: Inheritance-annotated variants with de novo, compound het, and segregation evidence

Cohort Analytics (Optional)

For research-grade population-level work, Helena aggregates classified samples into a unified cohort matrix with deduplicated variant catalog and sparse genotype storage. Six statistical analyses share the matrix: gene-level burden testing with Fisher, CMC, and SKAT-O methods plus FDR correction; pathway enrichment with proper background correction; pLoF analysis; cohort versus gnomAD frequency analysis; GWAS signal replication; and polygenic risk scoring with PGS Catalog weight files.

A weighted candidate gene nomination engine integrates evidence across all six analyses and produces a ranked list with per-component breakdowns and human-readable evidence summaries. Power analysis is reported per gene so non-significant results in underpowered genes are interpretable.

Cohort matrixBurden testingPathway enrichmentpLoF analysisGWAS replicationPolygenic scoresCandidate ranking

Output: Ranked candidate genes with statistical evidence and per-component scoring

Mitochondrial DNA Analysis

Mitochondrial DNA biology differs fundamentally from the nuclear genome. Maternal inheritance, heteroplasmy with tissue-specific threshold effects, lack of recombination, and haplogroup structure all change how variants should be classified. Helena routes mitochondrial variants through a dedicated classifier following the McCormick 2020 specifications produced by the ClinGen Mitochondrial Disease Variant Curation Expert Panel.

Twenty ACMG criteria are applied with mtDNA-specific thresholds and tools including APOGEE2 for protein-coding genes, MitoTIP and HmtVAR for tRNA, MITOMAP and HmtDB for population frequency. Seven criteria are explicitly excluded with verbatim biological rationale. Haplogroup-aware BA1 and NUMT pseudogene detection prevent the most common false-positive patterns.

MMDWG 2020ClinGen Expert PanelHeteroplasmy awareHaplogroup awareNUMT detectionAPOGEE2 / MitoTIP / HmtVAR

Output: mtDNA variants classified per the ClinGen Mitochondrial Expert Panel framework

AI-Powered Clinical Interpretation

An AI model synthesizes all upstream evidence, ACMG classifications, phenotype correlations, literature findings, and screening results, into a structured clinical narrative. The AI does not classify variants. Classification is rule-based in Step 3. The AI integrates, summarizes, and presents evidence in a format ready for clinical review.

Interpretation depth adapts dynamically based on available data: from basic variant summary (classification only) to comprehensive diagnostic synthesis (classification, phenotype, literature, and screening combined). Reports are generated in PDF and DOCX formats with structured sections and complete evidence attribution. All AI inference runs on dedicated EU infrastructure, no data is sent to external AI services.

Evidence synthesisAdaptive depthPDF/DOCX reportsOn-premise AIEU data residency

Output: Downloadable clinical interpretation report with structured evidence and recommendations

Built for Clinical Trust

Every design decision in Helena prioritizes transparency, auditability, and geneticist authority over black-box convenience.

Rule-Based Classification

Variant pathogenicity is determined by ACMG/AMP criteria applied through systematic rules, not by AI prediction. The AI assists with evidence gathering and presentation, never with classification decisions.

Complete Evidence Trail

Every classification links to the specific ACMG criteria applied, every literature reference to its PMID, every phenotype score to its HPO terms. Nothing is a black box.

Reproducible Results

The same VCF input with the same clinical profile produces the same classification output. Rule-based processing ensures deterministic, auditable results across runs.

Geneticist Authority

Helena is a clinical decision support tool. It gathers evidence, applies guidelines, and presents findings. The geneticist reviews, validates, and makes the clinical decision.

See the Pipeline in Action

Request a demo to see how Helena processes a real genome, from VCF upload to clinical report.

For Geneticists Rare Disease Newborn Screening Carrier Screening Methodology