Researchers usually need to submit generated human genomics data to a central repository in order to obtain an accession that will allow publication (ONE, 2019; Cannon et al., 2021; CellPress, 2021; Nature, 2022) and to meet funding requirements. Where to submit depends both on the data type and permissions related to data use.
Several tools exist to help a researcher decide where to submit their data:
- EMBL-EBI’s data submission wizard
- NCBI submission helper tool: https://submit.ncbi.nlm.nih.gov/
- DDBJ submission guide https://www.ddbj.nig.ac.jp/documents/data-categories-e.html
- NIH list of Repositories for Sharing Scientific Data https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data
- FAIRsharing - https://fairsharing.org/
Open Access ‘Omics Archives
The International Nucleotide Sequence Database Collaboration (INSDC) is the major global provider of infrastructure to archive and provide open access to genetic sequencing data (Arita et al., 2020).
Ownership of data remains with the original data providers, but the INSDC requires free and unrestricted access to all data records without use restrictions, licensing requirements or fees for use (Arita et al., 2020).
It is composed of three nodes:
Data submitted to each node must adhere to the INSDC data and metadata standard. This standard is implemented in a set of XML schemas. The shared data model enables each archive to mirror all data submitted to any INSDC node. Archival accessions provided by each archive follow a consistent pattern with node specific lettering used based on where the data was originally submitted, ‘E’ for ENA, ‘D’ for DDBJ and ‘S’ for GenBank (see ENA Accession numbers documentation for more information).
While not part of the INSDC, the China National Center for Bioinformation (CNCB) and its National Genomics Data Center (NGDC) provide archives that follow the same themes and data standards as those part of the INSDC (CNCB-NGDC, 2019). They also mirror the metadata of data within INSDC archives, enabling global search across all four major genetic sequencing archives.
Table 1. Databases of the three INSDC nodes (adapted from (Arita et al., 2020) and the NGDC CNCB. INSDC resources coloured in blue
Institute | Annotated sequences | NGS Reads | Project Metadata | Sample Information | Functional Genomics | Processed functional genomics | Human Genomes (controlled) | Metabolomics | Proteomics |
DDBJ | DDBJ | SRA | BioProject | BioSample | GEA | JGA | Metabobank | ||
EMBL-EBI | ENA | BioStudies | BioSamples | ArrayExpress | Expression Atlas | EGA | Metabolights | PRIDE | |
NCBI | GenBank | SRA | BioProject | BioSample | GEO | dbGaP | |||
CNCB-NGDC | GWH | GSA | BioProject | BioSample | GEN | GSA-Human |
Controlled Access data repositories
When dealing with human data, it is often necessary to archive data in a controlled access repository in order to comply with consent, ethics and legal regulations. Controlled access means that anyone who wants to use the data, must apply for access to the data and agree to use the data in accordance with a data use policy. It varies whether that repository has its own Data Access Committee (DAC) that decides whether a data access request (DAR) complies with the policy and access is granted, or whether the data remains under the control of a dataset specific DAC.
Within Australia, the need to keep data onshore and the challenges involved in archiving in major international repositories means the majority of Australian human genomics research data remains siloed and un-FAIR in unsearchable institutional repositories. However, of the major controlled access archives, the The European Genome-phenome Archive (EGA) is the most used archive by Australian researchers, as it is the only one available to researchers without the need for specific funding arrangements or explicit approval.
Data repositories explained
Choose a data repository below to learn more about it:
Data repositories summary table
For full-size interactive table see here: https://marion-biocommons.shinyapps.io/field_guide_repo_table/
References
- Nature, S. (2022). Mandated data types \textbar Authors \textbar Springer Nature. https://www.springernature.com/gp/authors/research-data-policy/repositories-socsci/19540364
- Cannon, M., Graf, C., McNeice, K., Chan, W. M., Callaghan, S., Carnevale, I., Cranston, I., Edmunds, S. C., Everitt, N., Ganley, E., Hrynaszkiewicz, I., Khodiyar, V. K., Leary, A., Lemberger, T., MacCallum, C. J., Murray, H., Sharples, K., Soares E Silva, M., Wright, G., … (Moderator) Sansone, S.-A. (2021). Repository Features to Help Researchers: An invitation to a dialogue. Zenodo. https://doi.org/10.5281/zenodo.4683794
- Arita, M., Karsch-Mizrachi, I., & Cochrane, G. (2020). The international nucleotide sequence database collaboration. Nucleic Acids Research, 49(D1), D121–D124. https://doi.org/10.1093/nar/gkaa967
- CNCB-NGDC. (2019). Genome Sequence Archive for Human - Policies. https://ngdc.cncb.ac.cn/gsa-human/policy/policy.jsp#responsibilitiesSubmitter
- ONE, P. L. O. S. (2019). Data Availability. In Data Availability \textbar PLOS ONE. https://journals.plos.org/plosone/s/data-availability
Relevant tools and resources
Skip tool tableTool or resource | Description | Related pages | Registry |
---|---|---|---|
DNA Data Bank of Japan (DDBJ) | Japan-based Nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis. | BioSample | Tool info Publication |
EMBL-EBI's data submission wizard | EMBL-EBI's wizard for finding the right EMBL-EBI repository for your data. | ||
European Nucleotide Archive (ENA) | A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation | Tool info Standards/Databases Training Publication | |
GenBank | A database of genetic sequence information. GenBank may also refer to the data format used for storing information around genetic sequence data. | Tool info Standards/Databases Training | |
The European Genome-phenome Archive (EGA) | EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects | BioSample EGA Omics Discovery Index | Tool info Standards/Databases Training Publication |