Skip to content Skip to footer

Repositories

Researchers usually need to submit generated human genomics data to a central repository in order to obtain an accession that will allow publication (ONE, 2019; Cannon et al., 2021; CellPress, 2021; Nature, 2022) and to meet funding requirements. Where to submit depends both on the data type and permissions related to data use.

Several tools exist to help a researcher decide where to submit their data:

Open Access ‘Omics Archives

The International Nucleotide Sequence Database Collaboration (INSDC) is the major global provider of infrastructure to archive and provide open access to genetic sequencing data (Arita et al., 2020).

Ownership of data remains with the original data providers, but the INSDC requires free and unrestricted access to all data records without use restrictions, licensing requirements or fees for use (Arita et al., 2020).

It is composed of three nodes:

Data submitted to each node must adhere to the INSDC data and metadata standard. This standard is implemented in a set of XML schemas. The shared data model enables each archive to mirror all data submitted to any INSDC node. Archival accessions provided by each archive follow a consistent pattern with node specific lettering used based on where the data was originally submitted, ‘E’ for ENA, ‘D’ for DDBJ and ‘S’ for GenBank (see ENA Accession numbers documentation for more information).

While not part of the INSDC, the China National Center for Bioinformation (CNCB) and its National Genomics Data Center (NGDC) provide archives that follow the same themes and data standards as those part of the INSDC (CNCB-NGDC, 2019). They also mirror the metadata of data within INSDC archives, enabling global search across all four major genetic sequencing archives.

Table 1. Databases of the three INSDC nodes (adapted from (Arita et al., 2020) and the NGDC CNCB. INSDC resources coloured in blue

Institute Annotated sequences NGS Reads Project Metadata Sample Information Functional Genomics Processed functional genomics Human Genomes (controlled) Metabolomics Proteomics
DDBJ DDBJ SRA BioProject BioSample GEA JGA Metabobank
EMBL-EBI ENA BioStudies BioSamples ArrayExpress Expression Atlas EGA Metabolights PRIDE
NCBI GenBank SRA BioProject BioSample GEO dbGaP
CNCB-NGDC GWH GSA BioProject BioSample GEN GSA-Human

Controlled Access data repositories

When dealing with human data, it is often necessary to archive data in a controlled access repository in order to comply with consent, ethics and legal regulations. Controlled access means that anyone who wants to use the data, must apply for access to the data and agree to use the data in accordance with a data use policy. It varies whether that repository has its own Data Access Committee (DAC) that decides whether a data access request (DAR) complies with the policy and access is granted, or whether the data remains under the control of a dataset specific DAC.

Within Australia, the need to keep data onshore and the challenges involved in archiving in major international repositories means the majority of Australian human genomics research data remains siloed and un-FAIR in unsearchable institutional repositories. However, of the major controlled access archives, the The European Genome-phenome Archive (EGA) is the most used archive by Australian researchers, as it is the only one available to researchers without the need for specific funding arrangements or explicit approval.

Data repositories explained

Choose a data repository below to learn more about it:

Search

Data repositories summary table

For full-size interactive table see here: https://marion-biocommons.shinyapps.io/field_guide_repo_table/

References

  1. Nature, S. (2022). Mandated data types \textbar Authors \textbar Springer Nature. https://www.springernature.com/gp/authors/research-data-policy/repositories-socsci/19540364
  2. Cannon, M., Graf, C., McNeice, K., Chan, W. M., Callaghan, S., Carnevale, I., Cranston, I., Edmunds, S. C., Everitt, N., Ganley, E., Hrynaszkiewicz, I., Khodiyar, V. K., Leary, A., Lemberger, T., MacCallum, C. J., Murray, H., Sharples, K., Soares E Silva, M., Wright, G., … (Moderator) Sansone, S.-A. (2021). Repository Features to Help Researchers: An invitation to a dialogue. Zenodo. https://doi.org/10.5281/zenodo.4683794
  3. CellPress. (2021). Author’s guide: Standardized datatypes, datatype specific repositories, and general-purpose repositories recommended by Cell Press. https://www.cell.com/pb-assets/journals/research/cellpress/data/RecommendRepositories-1621989644133.pdf
  4. Arita, M., Karsch-Mizrachi, I., & Cochrane, G. (2020). The international nucleotide sequence database collaboration. Nucleic Acids Research, 49(D1), D121–D124. https://doi.org/10.1093/nar/gkaa967
  5. CNCB-NGDC. (2019). Genome Sequence Archive for Human - Policies. https://ngdc.cncb.ac.cn/gsa-human/policy/policy.jsp#responsibilitiesSubmitter
  6. ONE, P. L. O. S. (2019). Data Availability. In Data Availability \textbar PLOS ONE. https://journals.plos.org/plosone/s/data-availability

Relevant tools and resources

Skip tool table
Tool or resource Description Related pages Registry
DNA Data Bank of Japan (DDBJ) Japan-based Nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis. BioSample Tool info Publication
EMBL-EBI's data submission wizard EMBL-EBI's wizard for finding the right EMBL-EBI repository for your data.
European Nucleotide Archive (ENA) A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation Tool info Standards/Databases Training Publication
GenBank A database of genetic sequence information. GenBank may also refer to the data format used for storing information around genetic sequence data. Tool info Standards/Databases Training
The European Genome-phenome Archive (EGA) EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects BioSample EGA Omics Discovery Index Tool info Standards/Databases Training Publication
Contributors