Overview
TOPMed generates scientific resources to enhance understanding of fundamental biological processes that underlie heart, lung, blood and sleep disorders. It is part of efforts to harness data science to drive precision medicine, which aims to provide disease treatments that consider unique genes and environment. TOPMed integrates -omics data with molecular, behavioral, imaging, environmental, and clinical data from diverse participants in NHLBI's population and epidemiology studies. Integrating this data supports researchers in their efforts to expand their analyses and identify factors that increase or decrease the risk of disease, identify subtypes of disease, and develop more targeted and personalized treatments.
Currently, TOPMed's Freeze 5b includes more than 70 different studies with approximately 145,000 samples with whole genome sequencing (WGS) completed or in progress. These studies encompassed several experimental designs (e.g. cohort, case-control, family) and many different clinical trait areas (e.g. asthma, COPD, atrial fibrillation, atherosclerosis, sleep). See study descriptions on the Parent Studies Descriptions & Statements page.
TOPMed WGS genotype call sets (called "Freezes") are being released on dbGaP periodically (approximately 6-12 month intervals). WGS data for samples from Phase 1 studies, with reads mapped to human genome build GRCh37, were released in 2016 (Freeze 3) and 2017 (Freeze 4). The Freeze 5b genotype call set, with samples from Phase 1 and 2 studies and reads mapped to genome build GRCh38, are being released in 2018. A summary of the dbGaP accessions for these studies, including their approximate sample numbers, are provided in Table 1.
Some TOPMed studies have previously released genotypic and phenotypic data on dbGaP in "parent" accessions (see Table 1). For those studies, the TOPMed WGS accession contains only WGS-derived data and, therefore, genotype-phenotype analysis requires data from both the parent and TOPMed WGS accessions. For the studies in the Table without a specific parent accession number, the TOPMed WGS accession contains both genotype and phenotype data.
Table 1: Summary of TOPMed Study Accessions in Freeze 5b
Project1
|
Study Accession
|
Study Name2
|
Study/Cohort Abbreviation
|
Study PI
|
Sample Size3
|
Sequencing Center4
|
Phase
|
Parent Study Accession
|
AA_CAC
|
phs001412
|
NHLBI TOPMed: African American Coronary Artery Calcification (AA CAC)
|
DHS
|
Allred
|
339
|
BROAD
|
2
|
|
AA_CAC, GeneSTAR
|
phs001218
|
NHLBI TOPMed: GeneSTAR (Genetic Study of Atherosclerosis Risk)
|
GeneSTAR
|
Mathias
|
1,639
|
MACROGEN, BROAD5
|
2
|
phs001074
|
AA_CAC, HyperGEN_GENOA
|
phs001345
|
NHLBI TOPMed: Genetic Epidemiology Network of Arteriopathy (GENOA)
|
GENOA
|
Peyser & Kardia
|
1,143
|
BROAD, UW
|
2
|
phs001238
|
AA_CAC, MESA
|
phs001416
|
NHLBI TOPMed: MESA and MESA Family AA-CAC
|
MESA
|
Rich & Rotter
|
4,819
|
BROAD
|
2
|
phs000209
|
AFGen
|
phs000997
|
NHLBI TOPMed: The Vanderbilt AF Ablation Registry
|
VAFAR
|
Shoemaker
|
154
|
BROAD
|
1
|
|
AFGen
|
phs001024
|
NHLBI TOPMed: Partners HealthCare Biobank
|
Partners
|
Lubitz
|
111
|
BROAD
|
1
|
|
AFGen
|
phs001032
|
NHLBI TOPMed: The Vanderbilt Atrial Fibrillation Registry
|
VU_AF
|
Darbar
|
1,018
|
BROAD
|
1
|
|
AFGen
|
phs001040
|
NHLBI TOPMed: Novel Risk Factors for the Development of Atrial Fibrillation in Women
|
WGHS
|
Albert & Chasman
|
98
|
BROAD
|
1
|
|
AFGen
|
phs001062
|
NHLBI TOPMed: MGH Atrial Fibrillation Study
|
MGH_AF
|
Lubitz
|
918
|
BROAD
|
1
|
phs001001
|
AFGen
|
phs001189
|
NHLBI TOPMed: Cleveland Clinic Atrial Fibrillation Study
|
CCAF
|
Chung & Barnard
|
329
|
BROAD
|
1
|
phs000820
|
AFGen, FHS
|
phs000974
|
NHLBI TOPMed: Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study
|
FHS
|
Ramachandran
|
3,749
|
BROAD
|
1
|
phs000007
|
Amish
|
phs000956
|
NHLBI TOPMed: Genetics of Cardiometabolic Health in the Amish
|
Amish
|
Mitchell
|
1,028
|
BROAD
|
1
|
|
BAGS
|
phs001143
|
NHLBI TOPMed: The Genetics and Epidemiology of Asthma in Barbados
|
BAGS
|
Barnes
|
962
|
ILLUMINA
|
1
|
|
CFS
|
phs000954
|
NHLBI TOPMed: The Cleveland Family Study (WGS)
|
CFS
|
Redline
|
920
|
UW
|
1
|
phs000284
|
COPD
|
phs000946
|
NHLBI TOPMed: Boston Early-Onset COPD Study in the TOPMed Program
|
EOCOPD
|
Silverman
|
66
|
UW
|
1
|
phs001161
|
COPD
|
phs000951
|
NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene) in the TOPMed Program
|
COPDGene
|
Silverman
|
8,742
|
BROAD, UW
|
1, 2
|
phs000179
|
CRA_CAMP
|
phs000988
|
NHLBI TOPMed: The Genetic Epidemiology of Asthma in Costa Rica
|
CRA
|
Weiss
|
1,041
|
UW
|
1
|
|
GenSalt
|
phs001217
|
NHLBI TOPMed: Genetic Epidemiology Network of Salt Sensitivity (GenSalt)
|
GenSalt
|
He
|
1,695
|
BAYLOR
|
2
|
phs000784
|
GOLDN
|
phs001359
|
NHLBI TOPMed: Genetics of Lipid Lowering Drugs and Diet Network (GOLDN)
|
GOLDN
|
Arnett
|
904
|
UW
|
2
|
phs000741
|
HyperGEN_GENOA
|
phs001293
|
NHLBI TOPMed: HyperGEN - Genetics of Left Ventricular (LV) Hypertrophy
|
HyperGEN
|
Arnett
|
1,776
|
UW
|
2
|
|
JHS
|
phs000964
|
NHLBI TOPMed: The Jackson Heart Study
|
JHS
|
Correa
|
3,128
|
UW
|
1
|
phs000286
|
PGX_Asthma
|
phs000920
|
NHLBI TOPMed: Genes-environments and Admixture in Latino Asthmatics (GALA II) Study
|
GALAII
|
Burchard
|
913
|
NYGC5
|
1
|
phs001180
|
PGX_Asthma
|
phs000921
|
NHLBI TOPMed: Study of African Americans, Asthma, Genes and Environment (SAGE) Study
|
SAGE
|
Burchard
|
451
|
NYGC5
|
1
|
|
SAFS
|
phs001215
|
NHLBI TOPMed: San Antonio Family Heart Study (WGS)
|
SAFS
|
Blangero
|
1,502
|
ILLUMINA
|
1
|
|
Sarcoidosis
|
phs001207
|
NHLBI TOPMed: African American Sarcoidosis Genetics Resource
|
Sarcoidosis
|
Montgomery
|
608
|
BAYLOR
|
2
|
|
SAS
|
phs000972
|
NHLBI TOPMed: Genome-wide Association Study of Adiposity in Samoans
|
SAS
|
McGarvey
|
1,208
|
UW, NYGC
|
1, 2
|
phs000914
|
THRV
|
phs001387
|
Rare Variants for Hypertension in Taiwan Chinese (THRV)
|
THRV
|
Rao & Chen
|
1,525
|
BAYLOR
|
2
|
|
VTE
|
phs001368
|
NHLBI TOPMed: Cardiovascular Health Study
|
CHS
|
Heckbert
|
69
|
BAYLOR
|
2
|
phs000287
|
VTE
|
phs001402
|
NHLBI TOPMed: Whole Genome Sequencing of Venous Thromboembolism (WGS of VTE)
|
Mayo_VTE
|
de Andrade
|
1,251
|
BAYLOR
|
2
|
phs000289
|
VTE, AFGen
|
phs000993
|
NHLBI TOPMed: Heart and Vascular Health Study (HVH)
|
HVH
|
Heckbert & Smith
|
614
|
BROAD, BAYLOR
|
1, 2
|
phs001013
|
VTE, AFGen
|
phs001211
|
NHLBI TOPMed: Trans-Omics for Precision Medicine Whole Genome Sequencing Project: ARIC
|
ARIC
|
Boerwinkle
|
3,612
|
BAYLOR, BROAD
|
1, 2
|
phs000280
|
WHI
|
phs001237
|
NHLBI TOPMed: Women's Health Initiative (WHI)
|
WHI
|
Kooperberg
|
10,047
|
BROAD
|
2
|
phs000200
|
|
|
|
TOTAL SAMPLES
|
54,854
|
|
|
|
1TOPMed Project. AFGen=Atrial Fibrillation Genetics Consortium. Amish=Genetics of Cardiometabolic Health in the Amish; BAGS=Barbados Asthma Genetics Study; CFS=Cleveland Family Study; COPD=Genetic Epidemiology of COPD; CRA_CAMP=The Genetic Epidemiology of Asthma in Costa Rica and the Childhood Asthma Management Program; FHS=Framingham Heart Study; JHS=Jackson Heart Study; PGX_Asthma=Pharmacogenomics of Bronchodilator Response in Minority Children with Asthma; SAS=Samoan Adiposity Study; VTE=Venous Thromboembolism; AA_CAC=African American Coronary Artery Calcification; GeneSTAR=Genetic Studies of Atherosclerosis Risk; GenSalt=Genetic Epidemiology Network of Salt Sensitivity; GOLDN=Genetics of Lipid Lowering Drugs and Diet Network; HyperGEN_GENOA=Hypertension Genetic Epidemiology Network and Genetic Epidemiology Network of Arteriopathy; MESA=Multi-Ethnic Study of Atherosclerosis; SAFS=San Antonio Family Studies; Sarcoidosis=Genetics of Sarcoidosis in African Americans; WHI=Women's Health Initiative. Project descriptions are available on the TOPMed website, https://topmed.nhlbi.nih.gov.
2Study name as it appears in dbGaP
3Approximate sample size for freeze5b release
4NYGC = New York Genome Center; BROAD = Broad Institute of MIT and Harvard; UW = University of Washington Northwest Genomics Center; ILLUMINA = Illumina Genomic Services; MACROGEN = Macrogen Corp.; BAYLOR = Baylor Human Genome Sequencing Center
5ILLUMINA was an additional sequencing center for legacy data contributed by GALAII (n=6 samples), SAGE (n=10 samples) and GeneSTAR (n=283 samples).
|
Please note that most (but not all) samples in previous releases (genotype call sets for Freezes 3 and 4) are included in Freeze 5b (along with many new samples). Because some investigators are in the process of analyzing Freeze 4 data across multiple studies and because it includes some samples that are not in Freeze 5b, the Freeze 4 call set will also be included along with Freeze 5b in the 2018 dbGaP releases.
The following sections of this document describe methods of data acquisition, processing and quality control (QC) for TOPMed WGS data contained in the 2018 Freeze 5b call set. (A separate document describes methods for the Freeze 4 call set.) Briefly, approximately 30X whole genome sequencing was performed at several different Sequencing Centers (named in Table 1). In most cases, all samples for a given study were sequenced at the same center (see Table 1 for exceptions), except for a small number of control samples described below. The reads were aligned to human genome build GRCh37 or GRCh38 at each center using similar, but not identical, processing pipelines. The resulting sequence data files were transferred from all centers to the TOPMed Informatics Research Center (IRC), where they were re-aligned to build GRCh38, using a common pipeline to produce a set of 'harmonized' .cram files. The IRC performed joint genotype calling on all samples in the Freeze 5b release (as well as additional studies to be released at a later time). The resulting VCF files were split by study and consent group for distribution to approved dbGaP users. They can be reassembled easily for cross-study, pooled analysis since the files for all studies contain identical lists of variant sites. Quality control was performed at each stage of the process by the Sequencing Centers, the IRC and the TOPMed Data Coordinating Center (DCC). Only samples that passed QC are included in the call set, whereas all variants (whether passed or failed) are included.
Genotype call sets are provided in VCF format, with one file per chromosome. GRCh38 read alignments are not provided currently by dbGaP, but there is a plan to do so in the future.
TOPMed DNA sample/sequencing-instance identifiers
Each DNA sample processed by TOPMed is given a unique identifier as "NWD" followed by six digits (e.g. NWD123456). These identifiers are unique across all TOPMed studies. Each NWD identifier is associated with a single study subject identifier used in other dbGaP files (such as phenotypes, pedigrees and consent files). A given subject identifier may link to multiple NWD identifiers if duplicate samples are sequenced from the same individual. Study investigators assign NWD IDs to subjects. Their biorepositories assign DNA samples and NWD IDs to specific bar-coded wells/tubes supplied by the Sequencing Center and record those assignments in a sample manifest, along with other metadata (e.g. sex, DNA extraction method). At each Sequencing Center, the NWD ID is propagated through all phases of the pipeline and is the primary identifier in all results files. Each NWD ID results in a single sequencing instance and is linked to a single subject identifier in the sample-subject mapping file for each accession. In contrast to the project wide NWD identifiers, subject identifiers are study-specific and may not be unique across all of TOPMed accessions.
Control Samples
In Phase 1, one parent-offspring trio from the Framingham Heart Study (FHS) was sequenced at each of four Sequencing Centers (family ID 746, subject IDs 13823, 15960 and 20156) All four WGS runs for each subject are provided in the TOPMed FHS accession (phs000974). In Phase 2, one 1000G Puerto Rican Trio (HG01110, HG01111, HG01249) was sequenced once at each center. In addition, HapMap subjects NA12878 (CEU, Lot K6) and NA19238 (YRI, Lot E2) were sequenced at each of the Sequencing Centers in alternation, once approximately every 1000 study samples throughout both Phases 1 and 2. The 1000G and HapMap sequence data will be released publicly as a BioProject in the future.