Skip to main content

Population Genetics

Approach to determine germline epigenetic state and mutators alleles based on local mutation rate

Authors
Vladimir Seplyarskiy*, Ruslan Soldatov*, Jacob Goldman, Wendy Wong, Christian Glissen, TOPMED consortia, Peter Kharchenko, Shamil Sunyaev
Name and Date of Professional Meeting
ASHG
Associated paper proposal(s)
Working Group(s)
Abstract Text
Stereotypic mutational processes operating in human germline are the source of genetic diversity and the cause of hereditary diseases. Patterns of germline mutations vary on different scales including existence of single nucleotide hotspots, clusters of multiple mutations at scales of ten kilobases and megabase-scale variation. Observed variation is a consequence of exposures to a combination of unknown mutational processes. However, etiology, intensities and spectrum of mutational processes in germline are almost unexplored.
Current understanding of the mutational processes in human cells is primarily based on cancer data. Inference of mutational signatures extracted from cancers relies on the fact that individual tumors have dramatic variation of exposure to a mutational process.
Here we made use of strong variability of mutational patterns along the genome to infer the underlying mutational processes. We assume that variability of the mutational spectra between loci is driven by the difference in relative contribution of an unknown but fixed set of stereotypic mutational processes.
To formally extract mutational signatures, the genome was binned in non-overlapping windows of fixed size (e.g. 20 or 100 kilobases) and spectra were compared between windows. We use very rare polymorphisms (allele frequency below 10-4) from GNomad and TOPMEDs projects served as input. Inference of signatures was formulated as a matrix decomposition problem. Using independent component analysis to perform matrix decomposition we discovered seven major mutational processes including signatures created by transcription coupled nucleotide excision repair, error-prone asymmetric bypass of DNA damages during replication, signature associated with replication timing, signature associated with repeat expansion and oocyte-specific signature. These signatures were cofirmed with de novo germline mutations. Oocyte-specific signature is localized in regions with disproportionally high fractions of mutations of maternal origin, which were recently discovered in studies of de novo mutations in human trios. This signature is active in several genomic regions comprising about 5% of the genome; and shows the highest intensity on non-transcribed strand of long genes (WWOX1, CSMD1). Also, we show that replication-associated signatures predict replication fork polarity and inter-origin distance in germline opening up an avenue of mutation-based inference of molecular features in human cells.

Using ~20,000 public whole genomes to build reference panels for fine-mapping HLA effects in multi-ethnic cohorts

Authors
Y. Luo, M. Kanai, M. Gutierrez-Arcelus, J.G. Wilson, S. Kathiresan, J.I. Rotter, S.S. Rich, M.H. Cho, W.S. Choi, B. Han, Y. Okada, A. Metspalu, T. Esko, P.J. McLaren, S. Raychaudhuri, NHLBI TOPMed Consortium
Name and Date of Professional Meeting
ASHG (Oct 16-20, 2018)
Associated paper proposal(s)
Working Group(s)
Abstract Text
The human leukocyte antigen (HLA) region harbors genes that are crucial to many human diseases. However, it remains a challenge to pinpoint the causal variants for these associations due to the extreme complexity of the region. We constructed the largest multi-ethnic HLA haplotype panel to date to better understand immune related adaptive evolution, and to facilitate fine-mapping studies from genome-wide association studies (GWAS).

First, we inferred HLA types at G-group resolution using whole-genome sequences (WGS) from 20,209 individuals of different ancestries (10,699 Europeans, 7,644 African Americans (AA), 1,016 Hispanics and 850 East Asians) using population reference graphs (Dilthey et al. 2016). We evaluated inferred HLA types against sequencing based typing (SBT) among 295 Japanese samples sequenced at 15x coverage. The accuracies for HLA-A, B, C, DQA1, DQB1 and DRB1 were 95.4%, 97.9%, 98.5%, 99.3%, 98.2% and 97.2% respectively. We observed high levels of differentiation in population allele frequencies among inferred HLA types (P = 8.7e-267). In many cases these differences are likely to be related to adaptive selection such as enrichment for the B*53:01:01G and C*04:01:01G alleles in AA samples that have been previously associated with malaria protection.

We next built a multi-ethnic HLA reference panel based on inferred HLA types and genetic variation in 5,376 multi-ethnic samples. To evaluate the imputation accuracy of the multi-ethnic panel, we compared imputed HLA haplotypes against SBT among 1,067 AA subjects. The average accuracy among the six classical HLA genes was 96.3%, compared to 77.4% when using a European panel of 5,225 samples alone. To illustrate fine-mapping advantages due to increasing ancestral diversity in the reference panel, we meta-analyzed published HIV-1 virus load GWAS in a total of 6,315 European and 2,924 AA subjects. The most significantly associated allele was an amino acid at HLA-B position 97 in both populations. However, conditional analysis identified different secondary associations at an amino acid at HLA-B position 67 and B*08:01:01G for the European and AA respectively.

These results highlight the benefits of a multi-ethnic reference panel for the discovery and characterization of HLA-disease associations. In the next phase, we will build an HLA panel using >20,000 WGS. This resource will open an exciting opportunity to understand immune-related genetic architecture across populations of diverse ancestries.

Analysis of densely imputed UK Biobank genetic data reveals disease-associated rare loss of function variation

Authors
SA Gagliano, W Zhou, D Taliun, J Nielsen, J LeFaive, R Dey, S Das, GR Abecasis
Name and Date of Professional Meeting
ASHG Annual Meeting (October 2018)
Associated paper proposal(s)
Working Group(s)
Abstract Text
Loss of function (LoF) variants, such as those that introduce a premature stop codon or shift the reading frame for translation machinery, alter protein structure to the extent of eliminating or greatly diminishing protein action. This biologically interpretability makes LoF variants a particularly important subset of genetic variation to study.

To pinpoint disease-associated LoF, we imputed additional variants into the UK Biobank cohort of half a million genotyped individuals using 60,039 deeply whole genome sequenced individuals from the multi-ethnic TOPMed project. This allowed us to expand the number of genotyped or imputed variants characterized in the UK Biobank from 39,131,578 to 177,895,992. The vast majority (94%; 167,502,731) of the imputed variants are rare (alternate allele frequency<0.5%), of which 0.03% (49,892) are high-confidence predicted LoF. This dramatic increase in the number of variants in one of the largest health-based cohorts to date provides an ideal setting to assess the impact of LoF variation. To identify disease-associated LoF variants, we conducted single-variant and gene-burden association tests for >1,400 binary traits constructed from health record billing codes.

In the single-variant analyses, we identified five rare LoF variants (not found in the 39M dataset) to be associated with disease, including two variants associated with breast cancer: a frameshift indel in CHEK2 (chr22:28695868:AG:A; build38; p=7.0E-22) and a SNP resulting in a gained stop in PALB2 (chr16:23621362:C:T; p=6.9E-14). Both are present in the ClinVar database as potentially pathogenic for familial breast cancer, but this is the first time these variants have been identified by GWAS.

Furthermore, we found significant burden signals in genes previously implicated in familial disease, but for which no LoF variants were significantly associated in the single-variant analyses; for example, USH2A (a known gene for Usher’s syndrome, for which retinis pigmentosa is a primary symptom) for hereditary retinal dystrophies with only 83 cases, and IFT140 (a known gene for the kidney-disease characterized Mainzer-Saldino syndrome) for kidney cyst with 1,257 cases.

We demonstrate that association studies in large-scale biobanks, even with a relatively small number of cases, are capable of yielding pathogenic findings that previously were only detected in clinical cases or difficult-to-collect family cohorts, in which it can be challenging to obtain robust associations.
Back to top