Abstract Text |
It is well known that whole genome sequencing with short read technology is prone to miss de novo sequence not found in the human reference genome. A recent analysis [1] identified a large number of non-repetitive, non-reference sequences in an Icelandic cohort, and determined that a large fraction (> 95%) of those sequences longer than 200bp, could be aligned to the chimp genome, and nearly 2/3 of them were in linkage disequilibrium with GWAS catalogue markers. Here we present a complementary pipeline for discovering and analyzing ancestral hominid sequences that are absent in the human genome reference and that likely constitute polymorphic deletions in the human population. Starting from unmapped reads and their mates, which have been extensively filtered, de novo sequences were assembled using multiple k-mer sizes, and then concurrently mapped against several hominid reference genomes to select non-reference human contigs. In an analysis of ~18K WGS sequences from the NHLBI TOPMed study, the total amount of non-reference hominid contigs varied from ~150kb to ~500kb per individual (N50~1kb). These have been merged and clustered into a set of non-redundant localized “undel” events by leveraging their location in other hominid genomes. Undels range from high frequency events to some that are private to a single individual in the studied cohort. Given that a significant number of these sequences overlap genic and promoter regions, some may account for phenotypic differences or even contribute to increased disease risk.
Our presentation will 1) detail the individual steps of the pipeline; 2) compare our results with other similar analyses [1,2,3]; and 3) include preliminary results from analysis of two large ethnically diverse WGS collections from the NHLBI TOPMed program and the NHGRI Centers for Common Disease program.
1. Kehr, B. et al, Nature Genetics 49: 588-593 (2017)
2. Telenti, A. et al, PNAS 113(42): 11901-11906 (2016)
3. Mallick, S. et al, Nature 538: 201-206 (2016)
|