Abstract Text |
Large-scale whole-genome sequencing (WGS) studies enable the evaluation of the functional impact of structural variations (SVs) on multiple human phenotypes and conditions. SVs are 50bp or larger genomic alterations and have been shown to be the main driver of genomic diversity with clear roles in human diseases or other phenotypes. There remains substantial technical challenges in discovery and genotyping of SVs from short-read sequence data, especially at-scale. A plethora of SV detection methods have been developed over the last decade and these methods have strengths identifying different SV types and sizes. We developed Parliament2, a SV discovery and merging pipeline that harmonizes calls from multiple SV detection software into a single SV callset that shows better specificity and sensitivity than any single method. We also developed an efficient joint-genotyping method, muCNV, that reduces false discoveries that accumulate across methods and large sample sizes. The Parliament2-muCNV pipeline has been employed to generate a comprehensive catalog of SVs for genetic association analyses across 138,134 multi-ethnic TOPMed WGS samples. We identified a total of 466,800 SVs, including 231,817 deletions, 197,412 duplications/CNVs, and 37,571 inversions. As expected, the majority of SVs were rare, with almost 46% being singletons. On average, an individual carries 3,303 deletions, 3,570 duplications, and 185 inversions. To estimate genotyping accuracy, we evaluated non-reference Mendelian inconsistency rates using 11,580 trios. The estimated error rates were 0.29% for deletions, 0.83% for biallelic duplications, and 3.1% for inversions. De novo heterozygote rates also showed similar numbers: 0.45%, 0.66%, and 4.2% for deletions, duplications, and inversions, respectively, demonstrating that the callset is suitable for genetic association analyses. The TOPMed program is ongoing and growing in sample size, and we expect an even larger and highly accurate SV callset at the time of presentation. These SV data are publicly available for download from dbGaP to researchers with an approved TOPMed manuscript proposal. Assessing contributions of SVs to common and complex human traits and risk factors will further advance our understanding of genetics and biology of heart, lung, blood, and sleep disorders, especially when combined with transcriptomic, epigenetic, and metabolomic resources from the TOPMed program.
|