Setup nf-core/proteinfold on Setonix

Overview

Teaching: 5 min
Exercises: 5 min

Questions

Objectives

Locate the resources available to support running AlphaFold2 on Setonix.

Prepare a Nextflow configuration file to utilise available resources.

Setup

Before commencing the exercise, navigate to the relevant working directory:
```
 cd $MYSCRATCH/2025-ABACBS-workshop/exercises/exercise2/
 ls
```
You should see the following files which will be used in this exercise:
```
 abacbs_profile.config  examples  fasta  samplesheet.csv
```

nf-core/proteinfold

proteinfold is a Nextflow pipeline designed to support numerous models for molecular structure prediction.

pfold

Today, we will use proteinfold to predict the structure of our uncharacterised protein using the AlphaFold2 model.

We will use a development branch (commit: 53a1008) to access some of the latest features that are not yet available in the current release.

Load a specific nextflow version, and pull the required proteinfold commit:

 module load nextflow/25.04.6 
 nextflow pull nf-core/proteinfold -r 53a1008

View the directory tree:

 tree $NXF_HOME/assets/nf-core/proteinfold/ -L 2 --filelimit=20

Result

├── assets
│   ├── adaptivecard.json
│   ├── comparison_template.html
│   ├── dummy_db
│   ├── dummy_db_dir
│   ├── email_template.html
│   ├── email_template.txt
│   ├── methods_description_template.yml
│   ├── multiqc_config.yml
│   ├── nf-core-proteinfold_logo_light.png
│   ├── NO_FILE
│   ├── NO_FILE_PAE
│   ├── report_template.html
│   ├── samplesheet.csv
│   ├── schema_input.json
│   ├── sendmail_template.txt
│   └── slackreport.json
├── bin
│   ├── extract_metrics.py
│   ├── fasta_to_alphafold3_json.py
│   ├── fix_obsolete.py
│   ├── generate_comparison_report.py
│   ├── generate_report.py
│   ├── mmcif_to_pdb.py
│   ├── msa_manager.py
│   ├── __pycache__
│   └── utils.py
├── CHANGELOG.md
├── CITATIONS.md
├── CODE_OF_CONDUCT.md
├── conf [32 entries exceeds filelimit, not opening dir]
├── docs
│   ├── images
│   ├── output.md
│   ├── README.md
│   └── usage.md
├── LICENSE
├── main.nf
├── modules
│   ├── local
│   └── nf-core
├── modules.json
├── nextflow.config
├── nextflow_schema.json
├── nf-test.config
├── README.md
├── ro-crate-metadata.json
├── subworkflows
│   ├── local
│   └── nf-core
├── tests
│   ├── alphafold2_download.nf.test
│   ├── alphafold2_download.nf.test.snap
│   ├── alphafold2_split.nf.test
│   ├── alphafold2_split.nf.test.snap
│   ├── alphafold3.nf.test
│   ├── alphafold3.nf.test.snap
│   ├── colabfold_download.nf.test
│   ├── colabfold_download.nf.test.snap
│   ├── colabfold_local.nf.test
│   ├── colabfold_local.nf.test.snap
│   ├── colabfold_webserver.nf.test
│   ├── colabfold_webserver.nf.test.snap
│   ├── default.nf.test
│   ├── default.nf.test.snap
│   ├── esmfold.nf.test
│   ├── esmfold.nf.test.snap
│   ├── nextflow.config
│   ├── split_fasta.nf.test
│   └── split_fasta.nf.test.snap
├── tower.yml
└── workflows
    ├── alphafold2.nf
    ├── alphafold3.nf
    ├── boltz.nf
    ├── colabfold.nf
    ├── esmfold.nf
    ├── helixfold3.nf
    └── rosettafold_all_atom.nf

Setup environment

Containers are one of the most effective ways to manage software in workflow development, especially with Nextflow.

Background: Why containers?

A container is a lightweight, portable environment that bundles an application together with everything it needs to run such as libraries, dependencies, and system tools. It is a simple and reliable alternative to installing software directly on your system or HPC environment (which often requires managing dependencies manually). On HPC systems, this means isolation from other environments, reproducibility across platforms, and simplified maintenance without manual installs or dependency troubleshooting.

It is recommended to pull containers from trusted sources like BioContainers, quay.io or Seqera. During execution, Nextflow automatically pulls required images, often from these repositories, and stores them in the work directory.

In this exercise, we will be using Singularity containers.

Load the corresponding module on Setonix.
```
 module load singularity/3.11.4-nompi
```
Before executing the workflow, we will define a number of environment variables. These variables tell Nextflow and Singularity where to find and store container images so you don’t waste time and space downloading them repeatedly.
- These environment variables need to be set each time you log in to the HPC system, or
- Include them in your job submission script before running Nextflow.
```
 mkdir $MYSCRATCH/containers
 export SINGULARITY_CACHEDIR=$MYSCRATCH/containers
 export SINGULARITY_LIBRARYDIR=/scratch/references/abacbs2025/containers
 export NXF_SINGULARITY_CACHEDIR=$MYSCRATCH/containers
 export NXF_SINGULARITY_LIBRARYDIR=/scratch/references/abacbs2025/containers
```

Confirm that several images are visible to Nextflow by:

 ls $NXF_SINGULARITY_LIBRARYDIR

Output:

 alphafold2_pred-single.sif
 alphafold2-single.sif
 community-cr-prod.seqera.io-docker-registry-v2-blobs-sha256-24-241f0746484727a3633f544c3747bfb77932e1c8c252e769640bd163232d9112-data.img
 community-cr-prod.seqera.io-docker-registry-v2-blobs-sha256-ef-eff0eafe78d5f3b65a6639265a16b89fdca88d06d18894f90fcdb50142004329-data.img
 depot.galaxyproject.org-singularity-multiqc-1.27--pyhdfd78af_0.img
 depot.galaxyproject.org-singularity-multiqc-1.29--pyhdfd78af_0.img
 depot.galaxyproject.org-singularity-python-3.8.3.img
 quay.io-nf-core-proteinfold_alphafold2_msa-dev.img

If you execute a Nextflow workflow that requires a container that is not located in the shared $NXF_SINGULARITY_LIBRARYDIR, the pipeline will attempt to pull the container from a hosted repository and store the image in your personal $NXF_SINGULARITY_CACHEDIR.

Background: Why Environment Variables?

When using containers on HPC systems, Nextflow needs to know where to store and retrieve container images. By default, it downloads containers into the workflow’s work/ directory, which can be inefficient and waste storage if you run multiple workflows.

Setting environment variables allows you to:

Cache container images in a shared location → Avoid repeated downloads and speed up execution.

Control storage paths → Prevent filling up your home directory or job scratch space.

Ensure reproducibility → Use the same cached image across multiple runs and workflows.

AMD-compatible images

The proteinfold workflow includes several modules that are executed on the GPU.
Default containers hosted by the nf-core organisation support NVIDIA hardware and will not run on the Setonix AMD GPUs.
Pawsey provides a number of AMD-compatible containers which can be used to run structure prediction models.
Pre-built images have been provided for the workshop today at /scratch/references/abacbs2025/containers.
We can configure the workflow to use these non-standard images by defining their path in a custom Nextflow config.

Note

Normally AlphaFold2 runs 5 different models and picks the best result.

Today we are using a modified version of AlphaFold2 (alphafold2-single.sif) that only runs a single model to reduce execution time.

Check the abacbs_workshop.config file to confirm that the workflow modules are configured to use non-standard images available on Setonix:

 grep -w RUN_ALPHAFOLD2 abacbs_profile.config -A3

Output:

 withName: 'RUN_ALPHAFOLD2' {
     container = '/scratch/references/abacbs2025/containers/alphafold2-single.sif'
     time = { 12.h }
 }

Reference data

Recall that structure prediction relies on collecting homologous proteins in a multiple sequence alignment (MSA) to identify coevolutionary information.

These homologs are identified from enormous reference sequence databases (>1TB).

Check that these databases are available on Setonix:

 tree /scratch/references/abacbs2025/databases/ -L 1

You should see that the required AlphaFold2 databases and model parameters are available here:
```
 databases/
     ├── mgnify
     ├── pdb70
     ├── pdb_mmcif
     ├── pdb_seqres
     ├── small_bfd
     ├── uniprot
     └── uniref90 
```
Note
- Today, we are using miniature versions of the databases to reduce execution time for the purpose of the workshop.
- These databases will NOT generate high-quality predictions for other protein targets.
- Full size databases are available at /scratch/references/alphafold_feb2024/databases/.

Key Points

Software designed for GPU execution is often compiled for NVIDIA GPUs.

Custom images can be built to support execution using AMD GPUs.

Container images (i.e. Singularity) can be cached in a shared directory to avoid duplication and speed up workflows.

previous episode

Leveraging peak Australian compute to enable workflows for predictive structural biology at scale -- 2025 ABACBS workshop

next episode

Setup nf-core/proteinfold on Setonix

Overview

Setup

nf-core/proteinfold

Result

Setup environment

Background: Why containers?

Background: Why Environment Variables?

AMD-compatible images

Note

Reference data

Note

Key Points

previous episode

next episode