Authors |
Iman Egab1, Peter Orchard2, Jennifer E. Posey3, Richard A. Gibbs3,4, Claudia M.B. Carvalho3,5, James R. Lupski3,4,5 Chad A. Shaw3,6, Stephen (Song) Yi7, Luisa Mestroni8,9, Matthew Taylor8,9, Eric Boerwinkle10, Alexander P. Reiner11, Paul de Vries1, Alanna C. Morrison1, Sujatha Jagannathan12,13, Zeynep Coban Akdemir1,3
NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium
|
Abstract Text |
Protein-truncating variants (PTVs) introduce premature termination codons (PTCs) into mRNA (PTC-variants). Typically, these PTC-variant bearing transcripts are degraded by nonsense-mediated decay (NMD), resulting in loss-of-function (LoF) alleles. However, some PTC-variant bearing transcripts escape NMD. These transcripts may lead to truncated or altered proteins with potential LoF, gain-of-function (GoF), or neutral effects. Accurately predicting whether a PTC variant will undergo NMD remains challenging. Existing models, such as the canonical rule based on the exon junction complex (EJC), explain only about 50% of NMD outcomes for PTC variants, highlighting the need for additional rules, possibly incorporating alternative signals like the EJC-independent model of NMD, to improve prediction accuracy.
We propose identifying novel NMD prediction rules by analyzing large-scale RNA sequencing (RNA-Seq) and whole genome sequencing (WGS) datasets. We extracted ~13K high-quality heterozygous germline PTC-variants from 7,617 samples available from the NHLBI Trans-Omics for Precision Medicine (TOPMed) freeze 1.1 RNA-Seq dataset coupled with WGS and mapped them to the GENCODE v26 canonical transcript set. For NMD-triggering variants, the expected proportion of wild-type allele to total expression is ~1, while for NMD-escape variants, it should be ~0.5. We quantified the NMD efficiency of these variants based on their extracted allele-specific expression (ASE) values and analyzed these values against known and potential novel NMD rules. These rules included canonical rule of NMD, 3'UTR length, PTC distance to start codon/stop codon/nearest exon junction, RNA-binding motifs near PTC-variants, potential EJCs, and within 3’UTR sequences.
Our analysis confirmed the canonical rule, showing higher NMD efficiency for PTC-variants upstream of the last EJC (p-value=4.4e-12). Other significant rules included increased NMD efficiency for PTC-variants with longer 3'UTRs (p-value=1.9e-3) and decreased efficiency for PTC-variants near the start codon (first 200 bp) (p-value=1.6e-4) and within the first 31% of a transcript (p-value=8.7e-5). PTC-variants with three downstream exon-exon junctions (p-value=7e-4) had increased NMD efficiency, whereas those further from the nearest exon-exon junction (≥162 bp) had decreased efficiency (p-value=7.2e-3). Using these rules, we built a random forest classification model to predict NMD outcomes for PTC-variants on a training set of high-quality germline PTC-variants from ~80% of the samples (N=6,094) in the dataset. We tested the model on the remaining samples (N=1,523), achieving ~71% prediction accuracy. With this model, we classified ~42.6% of all of the 165.3K high-quality PTC-variants available from the 160.8K samples in the TOPMed freeze 9 dataset as NMD-escape and the remaining 57.4% as NMD-triggering.
In sum, identifying novel NMD prediction rules and developing a more accurate NMD prediction model can better inform genetic and clinical research, potentially leading to more precise diagnosis and treatment strategies for diseases caused by such PTC-variants.
|