Skip to main content

Analysis

StocSum: a reference-panel-free summary statistics framework for diverse populations

Authors
Han Chen, Nannan Wang, Bing Yu, Goo Jun, Qibin Qi, Ramon A. Durazo-Arvizu, Sara Lindstrom, Alanna C. Morrison, Robert C. Kaplan, Eric Boerwinkle
Name and Date of Professional Meeting
IGES 33rd Annual Meeting (November 3-4, 2024)
Associated paper proposal(s)
Working Group(s)
Abstract Text
Genomic summary statistics have been widely used to address various scientific questions in genetic and genomic research. Applications that involve multiple genetic variants, such as conditional analysis, variant set and gene-based tests, heritability and genetic correlation estimation, also require correlation or linkage disequilibrium (LD) information between genetic variants, often obtained from an external reference panel. While these methods usually have good performance for common variants in populations of only European ancestry, in practice, it is usually difficult to find external reference panels that accurately represent the LD structure for isolated, underrepresented or admixed populations, as well as rare genetic variants from whole genome sequencing (WGS) studies, limiting their applications to European populations. To maximize the applicability of summary statistics-based methods and make them equally beneficial to all human populations, we have developed StocSum, a novel reference-panel-free statistical framework for generating, managing, and analyzing stochastic summary statistics using random matrix algorithms. Using two cohorts from the Trans-Omics for Precision Medicine Program, we demonstrate the accuracy of StocSum-based LD measures as compared to those directly computed from individual-level genotype data, in European-, African-, and Hispanic/Latino-Americans. We also show that for admixed populations such as African- and Hispanic/Latino-Americans, LD measures computed from external reference panels perform much worse, even if all ancestry populations are included in those reference panels. As a reference-panel-free framework, StocSum will facilitate sharing and utilization of genomic summary statistics from WGS studies, especially for isolated, underrepresented and admixed populations.

Genetic and phenotypic association analyses of cardiometabolic traits in diverse African samples with whole-genome sequencing data

Authors
Daniel Hui*, Matt Hansen*, Daniel Harris, Michael McQuillan, Dan Ju, Alexander Platt, William Beggs, Sunungouko Wata Mpoloka, Gaonyadiwe George Mokone, Gurja Belay, Thomas Nyambo, Stephen Chanock, Meredith Yeager, TOPMed Consortium, Giorgio Sirugo, Marylyn D. Ritchie, Scott Williams, Sarah A. Tishkoff
Name and Date of Professional Meeting
American Society of Human Genetics, November 2023
Associated paper proposal(s)
Working Group(s)
Abstract Text
African populations demonstrate exceptional genetic and phenotypic diversity, due in part to their varied environments, lifestyles, and demographic history. We conducted genetic and phenotypic association analyses in 6,965 geographically and ethnically diverse Sub-Saharan African individuals (6,280 with whole-genome sequences from the NIH TOPMed consortium and 685 with genotypes from Illumina arrays), using 15 cardiometabolic phenotypes (range 686-6,854 individuals/trait). Each phenotype had at least one ethnicity with significantly differing mean values compared to the remaining cohort, such as short stature in the Baka rainforest hunter-gatherers of Cameroon, and high adiposity in the Herero pastoralists of Botswana. An analysis of ethnicity-sex interactions revealed several ethnic groups with significant sexual dimorphism for at least one cardiometabolic phenotype, such as Herero women having markedly higher body mass index than men. Comparison between the African cohort and African ancestry UK Biobank (UKBB) individuals showed the latter have higher mean values than any of the 53 African ethnic groups for multiple cardiometabolic measurements, including low density lipoprotein cholesterol (LDL), body fat percentage (BFP), and systolic blood pressure. We also found that phenotype-phenotype correlations differ between the UKBB and African cohort, as well as between African ethnicities. For example, BFP and LDL had low correlation in the UKBB (R=0.04) but showed a range of correlation among African groups, from R = 0.00 in the Maasai pastoralists of eastern Africa to R = 0.43 in the Agaw agriculturalists of Ethiopia. Genome-wide association analyses identified 76 significantly associated loci (p<5.0x10-8), with 14 passing a more stringent empirical threshold (p<3.0x10-9), including APOE and APOC1 loci for various blood lipids, PCSK9 for LDL, and CETP for high density lipoprotein cholesterol (HDL), as well as novel loci. Set-based rare variant analyses for loss-of-function variants found 12 gene-phenotype associations replicating known associations with PCSK9 and APOE for LDL and total cholesterol and uncovering several novel gene-trait associations for adiposity traits and HDL. Ongoing analyses include phenotype associations with subsistence and genetically inferred ancestry, replication of genetic associations, and gene-set enrichment. In total, these results offer insights into the genetic and phenotypic landscape of cardiometabolic traits in African populations. This work was supported by grant numbers: ADA 1–19-VSN-02, NIH grants 1R35GM134957, R01DK104339, and R01AR076241, and 1X01HL139409-01.

MagicalRsq-X: A cross-cohort transferable genotype imputation quality metric

Authors
Quan Sun, Yingxi Yang, Jiawen Chen, Jia Wen, Michael R. Knowles, Charles Kooperberg, Alex Reiner, Laura M. Raffield, April Carson, Stephen Rich, Jerome Rotter, Ruth Loos, Eimear Kenny, Byron C. Jaeger, Yuan-I Min, Christian Fuchsberger, Yun Li
Name and Date of Professional Meeting
IGES 2023 Annual Meeting (Nov.5-8, 2023)
Associated paper proposal(s)
Working Group(s)
Abstract Text
Since genotype imputation was introduced, researchers have been relying on the estimated imputation quality from imputation software to perform post-imputation quality control (QC). However, this quality estimate (denoted as Rsq) performs less well for lower frequency variants. We recently published MagicalRsq, a machine-learning-based imputation quality calibration metric, which leverages additional typed markers from the same cohort and outperforms Rsq as a QC metric. In this work, we extended the original MagicalRsq to allow cross-cohort model training, named MagicalRsq-X. We removed the cohort-specific estimated minor allele frequency and additionally included LD scores and recombination rates as variant-level features. Leveraging whole genome sequencing data from TOPMed, specifically participants in BioMe, JHS, WHI and MESA studies, we performed comprehensive cross-cohort evaluations for European and African ancestral individuals based on their inferred global ancestry with the 1000 Genomes and HGDP data as reference. Our results suggest MagicalRsq-X outperforms Rsq in almost every setting, with 7.3-14.4% improvement in squared Pearson correlation with true R2, corresponding to 85-218K variant gains. We further developed a metric to quantify the genetic distances of a target cohort relative to a reference cohort and showed that such metric could largely explain the performance of MagicalRsq-X models. Finally, we found that MagicalRsq-X saved 9-53 GWAS variants in one of the largest blood cell traits GWAS results that would be missed using the original Rsq for QC. In conclusion, MagicalRsq-X shows clear superiority for post-imputation QC and can greatly benefit genetic studies by rescuing well-imputed low frequency and rare variants.

StocSum: stochastic summary statistics for whole genome sequencing studies

Authors
Nannan Wang, Bing Yu, Goo Jun, Qibin Qi, Ramon A. Durazo-Arvizu, Sara Lindstrom, Alanna C. Morrison, Robert C. Kaplan, Eric Boerwinkle, Han Chen
Name and Date of Professional Meeting
ASHG Meeting (November 1-5, 2023)
Associated paper proposal(s)
Working Group(s)
Abstract Text
Genomic summary statistics, usually defined as single-variant test results from genome-wide association studies, have been widely used to address different scientific questions in genetic and genomic research, such as meta-analysis, heritability estimation, conditional analysis, variant set and gene-based tests, multiple phenotype analysis, genetic correlation or co-heritability estimation. Applications that involve multiple genetic variants also require their correlations or linkage disequilibrium (LD) information, often obtained from an external reference panel. While these methods usually have good performance for common variants in populations of European ancestry, in practice, it is usually difficult to find suitable external reference panels that represent the LD structure for isolated, underrepresented and admixed populations, or rare genetic variants from whole genome sequencing (WGS) studies, limiting the scope of applications for genomic summary statistics. We have developed StocSum, a novel reference-panel-free statistical framework for generating, managing, and analyzing stochastic summary statistics using random vector algorithms. Regardless of the complex sample correlation structure, StocSum always scales linearly with both the sample size and the number of genetic variants in computing stochastic summary statistics from individual-level data. We develop various downstream applications using StocSum including single-variant tests, conditional association tests, gene-environment interaction tests, variant set tests, as well as meta-analysis and LD score regression tools. The complexity of all these downstream applications does not depend on the sample size. We demonstrate the accuracy and computational efficiency of StocSum using two cohorts from the Trans-Omics for Precision Medicine Program. Specifically, we show that StocSum can be used to perform long-range variant set tests, expanding the aggregation units beyond genes or genomic regions in close proximity. We also show that for admixed populations, LD scores estimated by StocSum are much more accurate compared to those from external reference panels, even if all ancestry populations are included in those reference panels. In summary, as a reference-panel-free framework, StocSum will facilitate sharing and utilization of genomic summary statistics from WGS studies, especially for isolated, underrepresented and admixed populations.

StocSum: stochastic summary statistics for whole genome sequencing studies

Authors
Nannan Wang, Bing Yu, Goo Jun, Qibin Qi, Ramon A. Durazo-Arvizu, Sara Lindstrom, Alanna C. Morrison, Robert C. Kaplan, Eric Boerwinkle, Han Chen
Name and Date of Professional Meeting
Joint Statistical Meetings (August 5-10, 2023)
Associated paper proposal(s)
Working Group(s)
Abstract Text
Genomic summary statistics, usually defined as single-variant test results from genome-wide association studies, have been widely used to advance the genetics field in a wide range of applications. Applications that involve multiple genetic variants also require their correlations or linkage disequilibrium (LD) information, often obtained from an external reference panel. In practice, it is usually difficult to find suitable external reference panels that represent the LD structure for underrepresented and admixed populations, or rare variants from whole genome sequencing (WGS) studies, limiting the scope of applications for genomic summary statistics. We develop StocSum, a novel reference-panel-free statistical framework for generating, managing, and analyzing stochastic summary statistics using random vector algorithms. Regardless of the complex sample correlation structure, StocSum always scales linearly with both the sample size and the number of genetic variants in computing stochastic summary statistics from individual-level data. We demonstrate the accuracy and computational efficiency of StocSum using two cohorts from the Trans-Omics for Precision Medicine WGS studies.
Back to top