Skip to content Skip to footer

Technologies standards: Globus

Globus, developed and maintained by the University of Chicago, is a transformative research data management platform designed to address the challenges of moving, sharing, and discovering large-scale scientific data across distributed systems. By abstracting the complexities of data transfer protocols, storage systems, and security frameworks, Globus empowers researchers to focus on their scientific objectives rather than infrastructure logistics. This article explores Globus’ core functionalities, use cases, limitations, applications in human genomics, and operational workflows, supplemented by a case study demonstrating its real-world impact.


Understanding Globus: Purpose and Architecture

Defining Globus

Globus is a cloud-based service that provides secure, reliable, and scalable solutions for research data management. It operates as a not-for-profit platform, offering both free and subscription-based tiers to academic institutions, national laboratories, and commercial entities (1, 6). The system’s architecture integrates federated identity management (via Globus Auth), high-performance data transfer (via GridFTP and HTTPS), and programmable interfaces (REST APIs, Python SDK) to create a unified ecosystem for data handling (1, 6).

Core Functionalities

  1. Data Transfer: Globus enables seamless movement of data ranging from kilobytes to petabytes between endpoints, including supercomputers, cloud storage, and personal devices (1, 6).
  2. Secure Sharing: Researchers can share data with collaborators using email addresses or institutional identities, with fine-grained access controls for protected datasets (6, 8).
  3. Automation: Tools like Globus Flows allow users to orchestrate multi-step data workflows, such as periodic transfers between storage systems or integration with high-performance computing (HPC) resources (1, 7).
  4. Discovery: The platform supports metadata cataloguing and search APIs, enabling the creation of domain-specific data portals (1, 2).

Use Cases Across Scientific Domains

Genomics and Biomedical Research

Globus has become indispensable in genomics, where next-generation sequencing (NGS) generates terabytes of raw data per experiment. For example, the University of Michigan’s Advanced Genomics Core processes 6 TB of sequencing data per run, using Globus to automate transfers between sequencers, HPC clusters, and long-term archives (4, 12). The system’s ability to handle protected health information (PHI) under HIPAA compliance makes it suitable for clinical genomics collaborations (8, 13).

Multi-Institutional Collaborations

Projects like the Natural Hazards Engineering Research Infrastructure (NHERI) rely on Globus to distribute earthquake simulation datasets across 16 institutions. The CyberShake project transferred over 700 TB of seismic hazard models between Oak Ridge National Laboratory and the San Diego Supercomputer Center using Globus’ fault-tolerant transfer protocols (2, 7).

Cloud and Hybrid Infrastructure Integration

Organisations such as Rice University leverage Globus connectors to bridge on-premises HPC resources with cloud storage (e.g., Google Drive, AWS S3). This hybrid approach enables cost-effective archiving while maintaining low-latency access to active datasets (2, 10).

Data Publication and DOI Assignment

Oak Ridge National Laboratory’s Digital Object Identifier (DOI) service uses Globus shared endpoints to streamline data publication. Researchers assemble datasets across multiple storage systems, which Globus then packages with metadata for DOI minting via DataCite (2, 13).


Technical Limitations and Considerations

Data Size and Throughput Constraints

While Globus excels at bulk transfers, its Compute service imposes strict limits:

  • 10 MB maximum payload per task submission (3)
  • 750 GB daily upload quota when using Google Drive connectors (10)
  • 400,000 item limit per Google Shared Drive (10)

These constraints necessitate alternative strategies for exascale workflows, such as combining Globus transfers with parallel filesystem mounts for in-situ processing (3, 7).

Permission Management Challenges

Globus currently lacks native support for propagating custom permissions when transferring data to Google Drive. All uploaded files inherit parent folder permissions, complicating multi-user collaborations (10).

Task Lifetime Limitations

Tasks abandoned for over two weeks are automatically purged, requiring users to implement checkpointing for long-running analyses (3).


Globus in Human Genomics: A Paradigm Shift

End-to-End Workflow Automation

The Globus Genomics platform integrates with the Galaxy workflow engine to automate NGS analysis pipelines. Researchers at the University of Washington reduced exome processing time from 24 hours to 2.5 hours by combining Globus transfers with AWS spot instances (7, 12). Key features include:

  • Automated retrieval of raw FASTQ files from sequencing cores (4, 14)
  • Elastic provisioning of cloud resources for alignment (BWA) and variant calling (GATK) (7, 12)
  • Secure delivery of VCF results to collaborators via encrypted guest collections (8, 13)

Case Study: University of Michigan Advanced Genomics Core

Challenge

With sequencing output doubling every 20 months, the Core faced unsustainable data growth—6 TB per instrument run, plus 6.5 TB of processed data (4). Traditional methods (FTP, physical drives) caused delays and risked PHI exposure.

Solution Implementation

  1. Push-Pull Architecture:
    • Sequencers write directly to Globus-mounted network storage
    • Post-processed data pushed to researcher endpoints via automated transfers (4)
    • Email notifications trigger collaborator-initiated “pull” transfers (4)
  2. Tiered Archiving:
    • Active datasets stored on high-performance Lustre filesystems
    • Cold data migrated to tape archives after 30-day retention period (4)
  3. Compliance Framework:
    • Globus Auth integrates with institutional Active Directory for PHI access control
    • All transfers encrypted with AES-256, audit logs retained for HIPAA reporting (8, 13)

Outcomes

  • 90% reduction in data delivery time (weeks → hours) (4)
  • Zero unplanned PHI breaches over three years (13)
  • Enabled cross-border collaborations with Plant & Food Research New Zealand, transferring 3 TB genomic data in three hours (11)

Operational Guide: Using Globus for Data Management

Step-by-Step Workflow

  1. Endpoint Configuration
    • Install Globus Connect Personal on local machines or use institutional endpoints (e.g., OSC’s /fs/scratch) (6)
    • For cloud storage, authenticate via OAuth2 to services like Google Drive (10)
  2. Initiating Transfers
    • Log in via institutional credentials (e.g., Ohio Supercomputer Center) (6)
    • In the two-panel interface, select source (e.g., sequencer output directory) and destination (e.g., HPC cluster) (6)
    • Enable checksum verification for integrity validation (6)
  3. Sharing Data
    • Right-click folder → “Share” → enter collaborator emails
    • Set permissions (read/write/administrator) (6)
    • For PHI, restrict sharing to vetted identities via Globus Groups (8)
  4. Automating Pipelines
    • Define recurring transfers using cron-like schedules
    • Integrate with HPC schedulers (Slurm, PBS) via REST API triggers (7, 12)

Future Directions and Community Impact

Globus continues evolving through partnerships with initiatives like the Open Storage Network (OSN), which leverages its Auth and transfer services to create a 1.5 PB federated storage fabric (2). Emerging features include enhanced metadata search via GraphQL and integration with FAIR data principles for improved reproducibility.

For genomics specifically, ongoing developments focus on:

  • Containerised Workflows: Packaging alignment/variant calling tools as Docker images executable via Globus Flows (7, 12)
  • AI/ML Integration: Direct data streaming from Globus endpoints to TensorFlow/PyTorch clusters (2, 9)
  • Global Metadata Catalogue: Cross-institutional indexing of genomic variants with GA4GH standards (13)

Conclusion

Globus has redefined research data management by providing a unified platform that transcends institutional and disciplinary boundaries. Its impact on human genomics exemplifies how robust cyberinfrastructure can accelerate discovery—from enabling multi-continental collaborations to ensuring compliance with stringent data regulations. While limitations in task scalability and permission granularity persist, ongoing development and community-driven enhancements position Globus as an enduring cornerstone of modern scientific workflows. As data volumes continue their exponential growth, platforms like Globus will remain critical for realising the full potential of collaborative, data-intensive science.

Contributors