Skip to main content

Analysis

Sparse Empirical Kinship Matrices Enable Computationally Efficient and Accurate Association Tests in Large Samples

Authors
Matthew P Conomos
Tianyu Zhang
Stephanie M Gogarten
Deepti Jain
Caitlin P McHugh
Yao Hu
Alexander P Reiner
Kenneth M Rice
Name and Date of Professional Meeting
ASHG October 15-19, 2019
Associated paper proposal(s)
Working Group(s)
Abstract Text
Mixed models for genetic association testing have traditionally accounted for structure among samples by using an empirical genetic relationship matrix (GRM) that measures genetic covariance, genome-wide, from both ancestry and relatedness. However, fitting mixed models in samples with tens or hundreds of thousands of individuals can be a prohibitive computational burden. Here, we address this problem by using a sparse empirical kinship matrix (KM) and ancestry principal components in place of a GRM.

Standard forms of empirical GRMs and KMs estimated from genotype data are dense; i.e. have no entries equal to zero. To exploit the computational speedups that sparse matrices enable, we make an empirical KM sparse by clustering samples based on their pairwise kinship estimates, setting all inter-cluster estimates to zero; this can also be thought of as approximating low levels of relatedness as `unrelated’. In today’s large-scale population studies, where those in pedigrees are a small proportion of the overall sample, this approximation can be expected to be highly accurate, and the computational speedup substantial.

To illustrate the computational advantage and statistical impact of using sparse empirical KMs, we performed genetic association analyses using seven red blood cell traits and WGS data from TOPMed freeze 6. Between 17,469 and 48,858 samples were available for these traits. Using a 4th degree relatedness threshold (i.e. kinship > 0.022) and our proposed algorithm, 98.3% to 99.5% of entries in the sparse KM were set to zero, and the largest cluster ranged from 1667 to 2459 samples. Compared to using a GRM, using a sparse KM significantly improved computational performance; e.g. fitting the null models for these traits took just 0.5-6.2% of the CPU time and required 1.4-6.7% of the memory. Furthermore, differences in association p-values between the two approaches were small. For these traits, over 99.99% of tests differed in -log10(p) by less than 0.5; i.e. by an amount very unlikely to change the practical interpretation of results. With the level of sparsity attainable in population studies such as TOPMed, we also find that our approach performs favorably compared to SAIGE, another mixed model method designed for analysis of large samples. The use of sparse KMs is a promising and flexible approach to improve the computational efficiency of association testing in large population studies, without sacrificing accuracy.

Carriers-only tests of association of a rare genetic variant with a binary outcome

Authors
In the abstract I will only list authors who worked on the methodology development, and the sleep working group.
Tamar Sofer, Jiwon Lee, Elizabeth Schifano, and the TOPMed Sleep working group.

The manuscript will have a different authors list.

Name and Date of Professional Meeting
ASHG, October 15, 2019.
Associated paper proposal(s)
Working Group(s)
Abstract Text
Background: Rare variants association analyses with binary outcomes that use likelihood-based tests such as the Score and Wald tests, suffer from inflated type 1 error. The problem is exacerbated when pooling together individual-level data from diverse parent studies, with potentially different proportions of affected individuals and different frequencies of genetic variants. For example, the analysis of short sleep (< 6 hours/night) considers ~22,000 individuals from TOPMed of European, African, Hispanic, and Asian descent, and the proportion of short sleepers is highest among African Americans. Here, the Score test detected many spurious associations
 
Methods: We propose a class of semi-parametric tests for association of rare variants with binary traits. Given a vector of probabilities of affection status among a set of (possibly correlated) carriers of a rare variants, we test whether the number of affected individuals among these carriers is consistent with their probability vector. Our carriers-only tests use the Poisson-Binomial (BinomiRare test) or Conway-Maxwell-Poisson (CMP test) distributions.
 
Results: We performed genetic analysis of short sleep in TOPMed. The Score test did not control the type 1 error. The SAIGE test improved upon the Score test, but had also high inflation when there were less than 30 carriers of the rare variant, and was slower than the carriers-only approach. When there are at least 30 carriers, the SAIGE and carriers-only tests had similar performance (correlation between p-value = 0.96), with SAIGE being often slightly more powerful. Still, in the single region that passed the genome-wide significance threshold, the most significant associations were detected by BinomiRare. A variant chr2: 33765596:A:G (chr2 p22.3; 65 carriers; CADD score=15.3) had BinomiRare p-value 7.4x10-9, SAIGE p-value=2.6x10-8.

Conclusions: carriers-only tests offer a computationally efficient (in terms of both time and memory) alternative to other tests of rare variant associations. They protect the type 1 error for arbitrary number of carriers of the rare variant.
Back to top