Skip to main content
collaborator banner

Guide for Using TOPMed Data

Orientation slides for new investigators.

This page describes how to identify, access, and use TOPMed data for TOPMed investigators (information on TOPMed Data Access for the Scientific Community is available on the public website). TOPMed investigators analyze phenotylic, genotypic, and other omics data in the context of TOPMed Working Groups (WGs) and in accordance with TOPMed Publications Policy and associated procedures for paper proposals and manuscripts. This page assumes some familiarity with TOPMed’s structure  information. These are all summarized in the orientation material for new TOPMed investigators.

Guidelines

Nominations for membership to the Working Groups will go through the TOPMed Project/center PIs. Anyone who wishes to join a working group or nominate someone else should contact the PI of the TOPMed Project/center with which the person is affiliated. If the PI approves, s/he should contact the ACC (topmed-admin@westat.com), who will add the approved nominee to the TOPMed directory, and inform the convener(s) of the relevant Working Group.

Investigators from individual TOPMed studies are an important source of information on the availability of phenotypes from these studies, and Working Groups are encouraged to discuss availability directly with those familiar with individual studies. To help, a list of designated phenotype liaisons from each TOPMed study is available.

To better understand the resources available in TOPMed, the ACC has also prepared;

  • A survey of phenotypes available in phase 1 TOPMed studies, returned by their PIs
  • A collection of links to study design papers, clinic forms/questionnaires, manuals/protocols, data documentation/dictionaries, all of which may contain further information
  • Phase 1 and Phase 2 summaries of PI guidance for which study-consent groups can be used for analyses in each working group domain

The following tools are useful for identifying source phenotypes on dbGaP:

  • Searches of dbGaP. A tutorial  on these is available, and a list of TOPMed study accession numbers may also be helpful
  • Freeze-specific sequencing methods documents (see Genetic data) contain links between TOPMed study accessions and prior study accessions, as phenotypes may reside in these prior accessions
  • PhenoExplorer, a web tool for cross-study phenotype identification in dbGaP and elsewhere. Registration is required, but signup is free and simple

The DCC has built a system for phenotype harmonization across TOPMed studies, which only uses phenotypes available in released dbGaP accessions. In the future, an associated online system will provide inventories of source phenotypes available for harmonization and of harmonized phenotypes, a form to specify and request harmonization of phenotypes, and documentation of the harmonization process. The work done in WGs identifying source phenotypes and designing harmonization algorithms will be used when implementing this system.  The capacity of centralized harmonization by the DCC is limited, so many working groups are performing their own phenotype harmonizations.

The DCC has compiled some points to consider when harmonizing  phenotype variables across studies. WGs can use this information to document their phenotype harmonization work, as shown in this example for height .

Compared to the phenotype information, genotype information is relatively homogenous across studies. Single nucleotide variants and short insertion/deletion genotype calls are generated from TOPMed project sequencing data, jointly across subjects from all participating studies, in periodic data freezes by the IRC. Information about each freeze is available on the genotypes page. The genotypes from each freeze are provided as single chromosome .vcf files, bgzipped and tabixed. Each .vcf file has genotypes for individuals from all of the studies registered in dbGaP at the time of the freeze.  In addition, study specific .vcf files are provided to each study PI from a password protected ftp site at the IRC.

The original sequence reads for phase 1 studies have been submitted to the NCBI Sequence Read Archive (SRA) and are available in .sra format.  The IRC provides initial sequence QC information, with a public version and more detailed password-protected version.

The IRC will provide:

  • Support vector machine (SVM) site level variant quality score
  • Component features used in the SVM score
  • Variant annotation from SnpEff 4.1 (only in the sites-only .vcf file to save space)
  • Sample-subject ID mapping NWD_ID to each study’s “submitted_subject_id” from dbGaP
  • Sample annotation – sex, if reported in dbGaP, and TOPMed study phs number

The ACC will provide:

  • Sample annotation, including mapping between sample ID (e.g., NWD_ID and subject ID)
  • WGSA annotation of variants
  • Kinship coefficients and principal components

The location of these resources in the common Exchange Area for any given freeze is specified under Genetic data.

Before commencing any analysis of TOPMed data, an investigator must engage with relevant WG(s) to develop and submit a paper proposal. Specifically, the work of TOPMed investigators generally falls within the scope of the TOPMed Publications Policy, which is intended to ensure transparency, enable productivity tracking, and promote synergy across the TOPMed projects and studies. See the Paper Proposal Instructions document and the Manuscript Writing and Submission Guidelines for full procedural details.

All relevant forms of data (aligned sequence data, genotype call sets and phenotype data) may be stored in the TOPMed Exchange Area (EA), a temporary holding area at dbGaP that provides pre-release access to data for TOPMed investigators (i.e. prior to release to the general scientific community). The EA consists of multiple components: a combined EA for cross-study genotype call sets and study-specific EAs to contain phenotypic and other data types. Within each study-specific EA there is a link to the aligned sequence data files

TOPMed studies are also being released in dbGaP for access by the scientific community. Following release of a major data freeze by the IRC, the DCC works with study investigators to complete study-specific QC and pre-curation of dbGaP files to facilitate timely release on dbGaP. Search the dbGaP site for “TOPMed” to identify currently released accessions.

Phenotypes harmonized centrally by the DCC  are added to the study-specific EAs as they are generated. In addition, many working groups are exchanging harmonized phenotype data through the study-specific EAs. Data in the EAs will only be accessible by TOPMed investigators (as designated by TOPMed study PIs), and access will be obtained by application to dbGaP; see TOPMed Data Sharing Policies for more information.

The ACC suggests two approaches to obtaining and sharing TOPMed data:

  • Approach 1:  Files may be uploaded to a study-specific EA by each study that has completed dbGaP registration.  Individuals designated as eligible by study PIs may apply for access to the study-specific EAs, which also provides access to the combined EA containing genotype call sets.  These applications are reviewed by the NHLBI Data Access Committee.  Approved applicants may download data from the Exchange Areas to their local institutional servers.  This approach is outlined under Data sharing through the dbGaP Exchange Areas.
  • Approach 2: This is similar to approach 1, but the application process is coordinated among a group of investigators who intend to share data with one another, generally in a cloud computing environment. This approach is described under Sharing dbGaP Data in a Cloud Environment.

Investigators should note that:

  • Investigators named on Data Access Requests are responsible for assuring that all analyses performed with the data they download are compliant with data-use limitations
  • These procedures for sharing individual-level data have more administrative overhead than familiar GWAS results-sharing approaches, which rely on meta-analysis of study-specific results. However, this form of meta-analysis is not recommended for WGS data; not only would it require sharing of impractically-large files, but it severely limits the forms of statistical analysis that may be successfully performed. Once access to the data is obtained, performing the analysis in one step should also be much faster and less error-prone than creating, sharing and meta-analyzing results files from individual studies.

Permission to use is separate from permission to access data. Permissions and mechanisms for access are discussed in the previous section. Permission to use TOPMed data is granted through the dataset selection process. In brief, the submitter (or their designate) of an approved paper proposal requests datasets (study-consent groups) via an online form. Dataset contacts review the requests and either approve or decline. No study-consent group may be used in a paper proposal or manuscript absent an approved dataset request. This process is described in more detail in the Paper Proposal Instructions.

Links to detailed descriptions of methods of data acquisition, data processing and quality assurance/quality control (QA/QC) can be found in the methods documents linked in the Genetic data page.

While the QA/QC checks performed by the sequencing centers, IRC and ACC are expected to be very useful, they are not exhaustive. In particular, WGs should carefully consider how issues of confounding (by ancestry, batch effects, or some other mechanism) affect their analysis.

Analysis of whole genome sequencing (WGS) data requires considerable computing resources; it is not expected that analyses will be performed on a single-user computer system. As with GWAS, various institutions are establishing their own “pipelines” that offer high-throughput association analyses for WGS data and corresponding annotation information; e.g. see TOPMed Cloud Pilots. NHLBI is also innovating cloud-based analysis platforms in DataSTAGE, with TOPMed serving as a key source for developing primary use cases.

Please provide regular updates on the progress of your paper proposal via the Progress Report form. When the proposal progresses to the manuscript stage, refer to TOPMed Manuscript Writing and Submission Guidelines. Manuscripts must be submitted to the TOPMed website, linked to their originating proposal, prior to submission to a journal.

Please reach out to TOPMed Key Contacts for topic-specific information.

Back to top