Hematology and Hemostasis | NHLBI Trans-Omics for Precision Medicine

Unveiling the Hidden Rules: Enhancing NMD Prediction for Protein-Truncating Variants

Submitted by	Iman Egab
Authors	Iman Egab1, Peter Orchard2, Jennifer E. Posey3, Richard A. Gibbs3,4, Claudia M.B. Carvalho3,5, James R. Lupski3,4,5 Chad A. Shaw3,6, Stephen (Song) Yi7, Luisa Mestroni8,9, Matthew Taylor8,9, Eric Boerwinkle10, Alexander P. Reiner11, Paul de Vries1, Alanna C. Morrison1, Sujatha Jagannathan12,13, Zeynep Coban Akdemir1,3 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium
Name and Date of Professional Meeting	ASHG 2024 (November 5-9,2024).
Associated paper proposal(s)	Systematic profiling and characterization of transcripts that are associated with quantitative blood disease traits through putative gain-of-function variants
Working Group(s)	Hematology and Hemostasis
Abstract Text	Protein-truncating variants (PTVs) introduce premature termination codons (PTCs) into mRNA (PTC-variants). Typically, these PTC-variant bearing transcripts are degraded by nonsense-mediated decay (NMD), resulting in loss-of-function (LoF) alleles. However, some PTC-variant bearing transcripts escape NMD. These transcripts may lead to truncated or altered proteins with potential LoF, gain-of-function (GoF), or neutral effects. Accurately predicting whether a PTC variant will undergo NMD remains challenging. Existing models, such as the canonical rule based on the exon junction complex (EJC), explain only about 50% of NMD outcomes for PTC variants, highlighting the need for additional rules, possibly incorporating alternative signals like the EJC-independent model of NMD, to improve prediction accuracy. We propose identifying novel NMD prediction rules by analyzing large-scale RNA sequencing (RNA-Seq) and whole genome sequencing (WGS) datasets. We extracted ~13K high-quality heterozygous germline PTC-variants from 7,617 samples available from the NHLBI Trans-Omics for Precision Medicine (TOPMed) freeze 1.1 RNA-Seq dataset coupled with WGS and mapped them to the GENCODE v26 canonical transcript set. For NMD-triggering variants, the expected proportion of wild-type allele to total expression is ~1, while for NMD-escape variants, it should be ~0.5. We quantified the NMD efficiency of these variants based on their extracted allele-specific expression (ASE) values and analyzed these values against known and potential novel NMD rules. These rules included canonical rule of NMD, 3'UTR length, PTC distance to start codon/stop codon/nearest exon junction, RNA-binding motifs near PTC-variants, potential EJCs, and within 3’UTR sequences. Our analysis confirmed the canonical rule, showing higher NMD efficiency for PTC-variants upstream of the last EJC (p-value=4.4e-12). Other significant rules included increased NMD efficiency for PTC-variants with longer 3'UTRs (p-value=1.9e-3) and decreased efficiency for PTC-variants near the start codon (first 200 bp) (p-value=1.6e-4) and within the first 31% of a transcript (p-value=8.7e-5). PTC-variants with three downstream exon-exon junctions (p-value=7e-4) had increased NMD efficiency, whereas those further from the nearest exon-exon junction (≥162 bp) had decreased efficiency (p-value=7.2e-3). Using these rules, we built a random forest classification model to predict NMD outcomes for PTC-variants on a training set of high-quality germline PTC-variants from ~80% of the samples (N=6,094) in the dataset. We tested the model on the remaining samples (N=1,523), achieving ~71% prediction accuracy. With this model, we classified ~42.6% of all of the 165.3K high-quality PTC-variants available from the 160.8K samples in the TOPMed freeze 9 dataset as NMD-escape and the remaining 57.4% as NMD-triggering. In sum, identifying novel NMD prediction rules and developing a more accurate NMD prediction model can better inform genetic and clinical research, potentially leading to more precise diagnosis and treatment strategies for diseases caused by such PTC-variants.

Calls are held monthly on the 4th Wednesday of the month at 12:00pm Pacific/ 3:00pm Eastern. This sub-group will be facilitated by Eric Whitsel. The goal of the TOPMed Environmental Working Group is to identify and characterize omics signatures of the environment in TOPMed cohorts, their relationships to biological aging, and age-related disease risk. In this setting, “environment” is inclusive, referring broadly both to contextual- and individual-level exposures, as well as their putative interactions.

One-hour monthly recurring call of the whole TOPMed Hematology and Hemostasis Working Group scheduled for 4th Thursdays, 10-11am Pacific / 12-1pm Central / 1-2pm Eastern.

Whole genome sequencing interaction study identifies loci with sex-specific effects underlying platelet aggregation traits

Submitted by	Ruczinski, Ingo
Authors	Julius S. Ngwa, Ming-Huei Chen, Lisa R. Yanek, Brady Gaynor, Kathy Ryan, Kanika Kanchan, Kai Kammers, Lew Becker, Margaret Taub, Joshua P. Lewis, Andrew D. Johnson, Nauder Faraday, Rasika Mathias, Ingo Ruczinski
Name and Date of Professional Meeting	ASHG 2023 (November 2, 2023)
Associated paper proposal(s)	Leveraging Whole Genome Sequencing data in TOPMed to refine prior GWAS loci and further identify new/novel loci that determine platelet aggregation.
Working Group(s)	Hematology and Hemostasis
Abstract Text	Platelets play a key role in both hemostasis and thrombosis, and in inflammation, atherogenesis, and cancer metastasis. Whole-genome sequencing (WGS) based association studies have successfully identified several single nucleotide polymorphisms (SNPs) associated with platelet aggregation phenotypes. However, despite strong differences in platelet aggregation between males and females, a comprehensive analysis to elucidate sex-specific genetic effects has not been conducted. In this study, we used sequencing data available through the NHLBI's Trans-Omics for Precision Medicine (TOPMed) program to analyze 19 harmonized platelet traits, in response to using ADP, epinephrine or collagen, in three family-based studies: the AMISH cohort (n=255 participants), GeneSTAR (n=857 participants of European and n=683 participants of African ancestry) and the Framingham Heart Study (n=1476 participants). Linear mixed effects models, as implemented in the GENESIS Bioconductor package, including an interaction term for bi-allelic SNPs and sex were conducted separately for each study using inverse normalized age- and sex- adjusted residuals of the platelet traits as dependent variables, followed by a fixed-effects inverse-variance weighted meta-analysis using METAL. In the primary analysis using a 1 degree of freedom test for the SNP by sex interaction we identified ten SNPs on chromosome 10 below the conventional genome wide significance threshold of 5x10-8 for two epinephrine aggregation phenotypes. The chromosome 10 peak SNP (rs116725046, p=5.2x10-9), located in the long intergenic non-protein coding RNA gene LINC00702, had a mean minor allele frequency (MAF) of 1.5% across all studies. Association of the closest protein coding gene KLF6 with platelet phenotypes was previously reported in the GWAS catalogue, however an in-silico functional analysis based on RNA expression and chromatin accessibility implicates LINC00702 as the regulatory element. In addition, several loci on chromosomes 2, 3, 7 and 11 also showed significant interactions with sex in the genome-wide meta-analysis for epinephrine and collagen. A more targeted analysis also showed sex specific effects for a variant (rs17081713) in the Estrogen Receptor 1 (ESR1) gene on chromosome 6 for ADP aggregation traits.

Rare variants affecting telomere length and disease identified through multi-omic modeling

Submitted by	Keener, Rebecca
Authors	Rebecca Keener, Taibo Li, François Aguet, Kristin Ardlie, Jerome Rotter, Steven Rich, and Alexis Battle on behalf of the NHLBI TOPMed Consortium *authors contributed equally to this work
Name and Date of Professional Meeting	ASHG Conference (November 5-9, 2023)
Associated paper proposal(s)	Assessing genetic determinants of telomere length leveraging whole genome sequence data in the NHLBI Trans-Omics for Precision Medicine Program – breakout proposal.
Working Group(s)	Hematology and Hemostasis
Abstract Text	Telomeres protect the ends of linear chromosomes and telomere length (TL) decreases as humans age. Individuals with extremely short telomeres present with Short Telomere Syndromes (STS). To gain insight into the genetic regulation of TL, prior work from our group and others leveraged genome-wide association studies to examine the role of common genetic variation in TL genetics. This strategy successfully identified novel genes involved in TL regulation, some of which we experimentally validated in cell culture. However, this approach ignores the effects of rare genetic variation, which can have larger effect sizes and impact genes under strong constraint. Studies of rare variant effects on TL have improved our understanding of TL biology, but have largely required laborious STS patient single pedigree studies. We sought to leverage TL estimates and rare variant data from the Trans-Omics for Precision Medicine (TOPMed) Program to broadly examine the impact of rare variation on TL. Previously we developed Watershed, a Bayesian hierarchical model, which uses paired mutli-omic data (whole genome sequencing, expression, splicing, methylation, and/or protein levels) to prioritize rare variants causing significant disruption of at least one of the molecular signals. We used multi-omic data and TL estimates from the MESA cohort to train the Watershed model and observed that, in individuals with extreme TL, it prioritized a rare variant affecting expression of TPP1, a known TL regulation gene. We will expand our analysis across TOPMed and incorporate multi-omic data where available. Examination of highly weighted variants in individuals with extreme TL (extremely long or extremely short) relative to average TL will potentially identify novel genes involved in TL regulation. TOPMed has TL estimates on people with an age range of 0-98 years old; we will leverage the scale of TOPMed to examine the interplay between TL genetic regulation and multi-omic signals over age. Further, we will apply our model to multi-omic data from STS patients to improve their genetic diagnosis. Together this work has utility in improving diagnosis of individuals with disease caused by extreme TL and furthering our understanding of the molecular mechanisms governing TL regulation.

Multi-ancestry genome-wide analysis of circulating D-dimer

Submitted by	Nicholas, Jayna
Authors	Jayna C. Nicholas, Michael Brown, Jennifer E. Huffman, Maria Sabater-Lleal, Paul S. de Vries, Nicholas L. Smith, Alanna C. Morrison, Laura M. Raffield on behalf of the CHARGE Hemostasis Working Group
Name and Date of Professional Meeting	Spring 2023 CHARGE Meeting (May 9-11)
Associated paper proposal(s)	(Revised) Whole genome sequence analysis of D-dimer across TOPMed studies
Working Group(s)	Hematology and Hemostasis
Abstract Text	D-dimer is a peptide product of fibrinolysis and clinical biomarker of diseases involving activated coagulation, including venous thromboembolism and ischemic cardiovascular disease. Circulating D-dimer levels are heritable and heterogeneous across populations, with, for example, higher levels observed in African (AFR) ancestry relative to European (EUR) ancestry populations. To date, our understanding of genetic contributors to D-dimer variation has been limited to the European ancestry population. Here, we performed the largest, most diverse genome-wide analysis of circulating D-dimer [total n=46,031: 36,688 EUR, 7,397 AFR, 1,321 Hispanic, 654 East Asian, 7 American Indian/Alaska Native, 54 unknown]. We performed both single variant and gene-based aggregate rare variant tests in whole genome sequences (WGS) from 14,334 participants within the Trans-Omics for Precision Medicine (TOPMed) program. For single variant tests, we then meta-analyzed TOPMed-imputed genotype results from 31,697 participants within the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium using a fixed-effects, p-value based method. Single-variant analysis results revealed 6 known genetic loci associated (P<5E-9) with D-dimer, including the AFR-driven HBB sickle-cell locus, and 1 new signal on chromosome 20. Notably, the lead variant at this signal is a common (EAF=10%) intergenic variant in near-perfect correlation (r2>0.99) with rs867186, a missense variant in PROC, and several lead GWAS variants for coagulation factors and thrombotic disease. Gene-based aggregate tests implicate FGL1 missense variants (P=7.46E-06) in D-dimer regulation. Together, these loci provide new targets for functional work to disentangle mechanisms regulating fibrin production and fibrinolysis.