Repositories

Researchers usually need to submit generated human genomics data to a central repository in order to obtain an accession that will allow publication (ONE, 2019; Cannon et al., 2021; CellPress, 2021; Nature, 2022) and to meet funding requirements. Where to submit depends both on the data type and permissions related to data use.

Several tools exist to help a researcher decide where to submit their data:

EMBL-EBI’s data submission wizard
NCBI submission helper tool: https://submit.ncbi.nlm.nih.gov/
DDBJ submission guide https://www.ddbj.nig.ac.jp/documents/data-categories-e.html
NIH list of Repositories for Sharing Scientific Data https://sharing.nih.gov/data-management-and-sharing-policy/sharing-scientific-data/repositories-for-sharing-scientific-data
FAIRsharing - https://fairsharing.org/

Open Access ‘Omics Archives

The International Nucleotide Sequence Database Collaboration (INSDC) is the major global provider of infrastructure to archive and provide open access to genetic sequencing data (Arita et al., 2020).

Ownership of data remains with the original data providers, but the INSDC requires free and unrestricted access to all data records without use restrictions, licensing requirements or fees for use (Arita et al., 2020).

It is composed of three nodes:

The European Nucleotide Archive (ENA)
The DNA Data Bank of Japan (DDBJ)
GenBank

Data submitted to each node must adhere to the INSDC data and metadata standard. This standard is implemented in a set of XML schemas. The shared data model enables each archive to mirror all data submitted to any INSDC node. Archival accessions provided by each archive follow a consistent pattern with node specific lettering used based on where the data was originally submitted, ‘E’ for ENA, ‘D’ for DDBJ and ‘S’ for GenBank (see ENA Accession numbers documentation for more information).

While not part of the INSDC, the China National Center for Bioinformation (CNCB) and its National Genomics Data Center (NGDC) provide archives that follow the same themes and data standards as those part of the INSDC (CNCB-NGDC, 2019). They also mirror the metadata of data within INSDC archives, enabling global search across all four major genetic sequencing archives.

Table 1. Databases of the three INSDC nodes (adapted from (Arita et al., 2020) and the NGDC CNCB. INSDC resources coloured in blue

Institute	Annotated sequences	NGS Reads	Project Metadata	Sample Information	Functional Genomics	Processed functional genomics	Human Genomes (controlled)	Metabolomics	Proteomics
DDBJ	DDBJ	SRA	BioProject	BioSample	GEA		JGA	Metabobank
EMBL-EBI	ENA		BioStudies	BioSamples	ArrayExpress	Expression Atlas	EGA	Metabolights	PRIDE
NCBI	GenBank	SRA	BioProject	BioSample	GEO		dbGaP
CNCB-NGDC	GWH	GSA	BioProject	BioSample		GEN	GSA-Human

Controlled Access data repositories

When dealing with human data, it is often necessary to archive data in a controlled access repository in order to comply with consent, ethics and legal regulations. Controlled access means that anyone who wants to use the data, must apply for access to the data and agree to use the data in accordance with a data use policy. It varies whether that repository has its own Data Access Committee (DAC) that decides whether a data access request (DAR) complies with the policy and access is granted, or whether the data remains under the control of a dataset specific DAC.

Within Australia, the need to keep data onshore and the challenges involved in archiving in major international repositories means the majority of Australian human genomics research data remains siloed and un-FAIR in unsearchable institutional repositories. However, of the major controlled access archives, the The European Genome-phenome Archive (EGA) is the most used archive by Australian researchers, as it is the only one available to researchers without the need for specific funding arrangements or explicit approval.

Data repositories explained

Choose a data repository below to learn more about it:

Filter by affiliation

Data repositories summary table

For full-size interactive table see here: https://marion-biocommons.shinyapps.io/field_guide_repo_table/

References

Nature, S. (2022). Mandated data types \textbar Authors \textbar Springer Nature. https://www.springernature.com/gp/authors/research-data-policy/repositories-socsci/19540364
Cannon, M., Graf, C., McNeice, K., Chan, W. M., Callaghan, S., Carnevale, I., Cranston, I., Edmunds, S. C., Everitt, N., Ganley, E., Hrynaszkiewicz, I., Khodiyar, V. K., Leary, A., Lemberger, T., MacCallum, C. J., Murray, H., Sharples, K., Soares E Silva, M., Wright, G., … (Moderator) Sansone, S.-A. (2021). Repository Features to Help Researchers: An invitation to a dialogue. Zenodo. https://doi.org/10.5281/zenodo.4683794
CellPress. (2021). Author’s guide: Standardized datatypes, datatype specific repositories, and general-purpose repositories recommended by Cell Press. https://www.cell.com/pb-assets/journals/research/cellpress/data/RecommendRepositories-1621989644133.pdf
Arita, M., Karsch-Mizrachi, I., & Cochrane, G. (2020). The international nucleotide sequence database collaboration. Nucleic Acids Research, 49(D1), D121–D124. https://doi.org/10.1093/nar/gkaa967
CNCB-NGDC. (2019). Genome Sequence Archive for Human - Policies. https://ngdc.cncb.ac.cn/gsa-human/policy/policy.jsp#responsibilitiesSubmitter
ONE, P. L. O. S. (2019). Data Availability. In Data Availability \textbar PLOS ONE. https://journals.plos.org/plosone/s/data-availability

Relevant tools and resources

Tool or resource	Description	Related pages	Registry
DNA Data Bank of Japan (DDBJ)	Japan-based Nucleotide sequence archive database and accompanying database tools for sequence submission, entry retrieval and annotation analysis.	BioSample	Tool info Publication
EMBL-EBI's data submission wizard	EMBL-EBI's wizard for finding the right EMBL-EBI repository for your data.
European Nucleotide Archive (ENA)	A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation		Tool info Standards/Databases Training Publication
GenBank	A database of genetic sequence information. GenBank may also refer to the data format used for storing information around genetic sequence data.		Tool info Standards/Databases Training
The European Genome-phenome Archive (EGA)	EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects	BioSample EGA Omics Discovery Index	Tool info Standards/Databases Training Publication

Contributors

Repositories

Open Access ‘Omics Archives

Controlled Access data repositories

Data repositories explained

Australian Genomics Data Repository

ArrayExpress

BioProject

BioSample

BioStudies

dbGAP

EGA

Expression Atlas

Gene Expression Omnibus

GSA Human

Human Cell Atlas Data Portal

JGA

Omics Discovery Index

Other repositories

Data repositories summary table

References

Relevant tools and resources