Abstract Text |
Genomic summary statistics, usually defined as single-variant test results from genome-wide association studies, have been widely used to address different scientific questions in genetic and genomic research, such as meta-analysis, heritability estimation, conditional analysis, variant set and gene-based tests, multiple phenotype analysis, genetic correlation or co-heritability estimation. Applications that involve multiple genetic variants also require their correlations or linkage disequilibrium (LD) information, often obtained from an external reference panel. While these methods usually have good performance for common variants in populations of European ancestry, in practice, it is usually difficult to find suitable external reference panels that represent the LD structure for isolated, underrepresented and admixed populations, or rare genetic variants from whole genome sequencing (WGS) studies, limiting the scope of applications for genomic summary statistics. We have developed StocSum, a novel reference-panel-free statistical framework for generating, managing, and analyzing stochastic summary statistics using random vector algorithms. Regardless of the complex sample correlation structure, StocSum always scales linearly with both the sample size and the number of genetic variants in computing stochastic summary statistics from individual-level data. We develop various downstream applications using StocSum including single-variant tests, conditional association tests, gene-environment interaction tests, variant set tests, as well as meta-analysis and LD score regression tools. The complexity of all these downstream applications does not depend on the sample size. We demonstrate the accuracy and computational efficiency of StocSum using two cohorts from the Trans-Omics for Precision Medicine Program. Specifically, we show that StocSum can be used to perform long-range variant set tests, expanding the aggregation units beyond genes or genomic regions in close proximity. We also show that for admixed populations, LD scores estimated by StocSum are much more accurate compared to those from external reference panels, even if all ancestry populations are included in those reference panels. In summary, as a reference-panel-free framework, StocSum will facilitate sharing and utilization of genomic summary statistics from WGS studies, especially for isolated, underrepresented and admixed populations.
|