Globus, developed and maintained by the University of Chicago, is a transformative research data management platform designed to address the challenges of moving, sharing, and discovering large-scale scientific data across distributed systems. By abstracting the complexities of data transfer protocols, storage systems, and security frameworks, Globus empowers researchers to focus on their scientific objectives rather than infrastructure logistics. This article explores Globus’ core functionalities, use cases, limitations, applications in human genomics, and operational workflows, supplemented by a case study demonstrating its real-world impact.
Understanding Globus: Purpose and Architecture
Defining Globus
Globus is a cloud-based service that provides secure, reliable, and scalable solutions for research data management. It operates as a not-for-profit platform, offering both free and subscription-based tiers to academic institutions, national laboratories, and commercial entities (1, 6). The system’s architecture integrates federated identity management (via Globus Auth), high-performance data transfer (via GridFTP and HTTPS), and programmable interfaces (REST APIs, Python SDK) to create a unified ecosystem for data handling (1, 6).
Core Functionalities
- Data Transfer: Globus enables seamless movement of data ranging from kilobytes to petabytes between endpoints, including supercomputers, cloud storage, and personal devices (1, 6).
- Secure Sharing: Researchers can share data with collaborators using email addresses or institutional identities, with fine-grained access controls for protected datasets (6, 8).
- Automation: Tools like Globus Flows allow users to orchestrate multi-step data workflows, such as periodic transfers between storage systems or integration with high-performance computing (HPC) resources (1, 7).
- Discovery: The platform supports metadata cataloguing and search APIs, enabling the creation of domain-specific data portals (1, 2).
Use Cases Across Scientific Domains
Genomics and Biomedical Research
Globus has become indispensable in genomics, where next-generation sequencing (NGS) generates terabytes of raw data per experiment. For example, the University of Michigan’s Advanced Genomics Core processes 6 TB of sequencing data per run, using Globus to automate transfers between sequencers, HPC clusters, and long-term archives (4, 12). The system’s ability to handle protected health information (PHI) under HIPAA compliance makes it suitable for clinical genomics collaborations (8, 13).
Multi-Institutional Collaborations
Projects like the Natural Hazards Engineering Research Infrastructure (NHERI) rely on Globus to distribute earthquake simulation datasets across 16 institutions. The CyberShake project transferred over 700 TB of seismic hazard models between Oak Ridge National Laboratory and the San Diego Supercomputer Center using Globus’ fault-tolerant transfer protocols (2, 7).
Cloud and Hybrid Infrastructure Integration
Organisations such as Rice University leverage Globus connectors to bridge on-premises HPC resources with cloud storage (e.g., Google Drive, AWS S3). This hybrid approach enables cost-effective archiving while maintaining low-latency access to active datasets (2, 10).
Data Publication and DOI Assignment
Oak Ridge National Laboratory’s Digital Object Identifier (DOI) service uses Globus shared endpoints to streamline data publication. Researchers assemble datasets across multiple storage systems, which Globus then packages with metadata for DOI minting via DataCite (2, 13).
Technical Limitations and Considerations
Data Size and Throughput Constraints
While Globus excels at bulk transfers, its Compute service imposes strict limits:
- 10 MB maximum payload per task submission (3)
- 750 GB daily upload quota when using Google Drive connectors (10)
- 400,000 item limit per Google Shared Drive (10)
These constraints necessitate alternative strategies for exascale workflows, such as combining Globus transfers with parallel filesystem mounts for in-situ processing (3, 7).
Permission Management Challenges
Globus currently lacks native support for propagating custom permissions when transferring data to Google Drive. All uploaded files inherit parent folder permissions, complicating multi-user collaborations (10).
Task Lifetime Limitations
Tasks abandoned for over two weeks are automatically purged, requiring users to implement checkpointing for long-running analyses (3).
Globus in Human Genomics: A Paradigm Shift
End-to-End Workflow Automation
The Globus Genomics platform integrates with the Galaxy workflow engine to automate NGS analysis pipelines. Researchers at the University of Washington reduced exome processing time from 24 hours to 2.5 hours by combining Globus transfers with AWS spot instances (7, 12). Key features include:
- Automated retrieval of raw FASTQ files from sequencing cores (4, 14)
- Elastic provisioning of cloud resources for alignment (BWA) and variant calling (GATK) (7, 12)
- Secure delivery of VCF results to collaborators via encrypted guest collections (8, 13)
Case Study: University of Michigan Advanced Genomics Core
Challenge
With sequencing output doubling every 20 months, the Core faced unsustainable data growth—6 TB per instrument run, plus 6.5 TB of processed data (4). Traditional methods (FTP, physical drives) caused delays and risked PHI exposure.
Solution Implementation
- Push-Pull Architecture:
- Tiered Archiving:
- Active datasets stored on high-performance Lustre filesystems
- Cold data migrated to tape archives after 30-day retention period (4)
- Compliance Framework:
Outcomes
- 90% reduction in data delivery time (weeks → hours) (4)
- Zero unplanned PHI breaches over three years (13)
- Enabled cross-border collaborations with Plant & Food Research New Zealand, transferring 3 TB genomic data in three hours (11)
Operational Guide: Using Globus for Data Management
Step-by-Step Workflow
- Endpoint Configuration
- Initiating Transfers
- Sharing Data
- Automating Pipelines
Future Directions and Community Impact
Globus continues evolving through partnerships with initiatives like the Open Storage Network (OSN), which leverages its Auth and transfer services to create a 1.5 PB federated storage fabric (2). Emerging features include enhanced metadata search via GraphQL and integration with FAIR data principles for improved reproducibility.
For genomics specifically, ongoing developments focus on:
- Containerised Workflows: Packaging alignment/variant calling tools as Docker images executable via Globus Flows (7, 12)
- AI/ML Integration: Direct data streaming from Globus endpoints to TensorFlow/PyTorch clusters (2, 9)
- Global Metadata Catalogue: Cross-institutional indexing of genomic variants with GA4GH standards (13)
Conclusion
Globus has redefined research data management by providing a unified platform that transcends institutional and disciplinary boundaries. Its impact on human genomics exemplifies how robust cyberinfrastructure can accelerate discovery—from enabling multi-continental collaborations to ensuring compliance with stringent data regulations. While limitations in task scalability and permission granularity persist, ongoing development and community-driven enhancements position Globus as an enduring cornerstone of modern scientific workflows. As data volumes continue their exponential growth, platforms like Globus will remain critical for realising the full potential of collaborative, data-intensive science.
Contributors