Galaxy Australia is capable of de novo assembling genomes based on PacBio high fidelity reads built from circular consensus sequence HiFi reads.
This How-to-Guide will describe the steps required to assemble your genome on the Galaxy Australia platform, using multiple workflows (see Fig 1) developed in consultations between the Bioplatforms Australia Threatened Species Initiative, Galaxy Australia, and the Australian BioCommons.
Quick start guide
- Login to Galaxy Australia
- Create a new history
- Upload your HiFi
ccs.bam
data files to your Galaxy history - Load and execute workflows (links included below), using required options
- FILE CONVERSION workflow:
BAM to FASTQ + QC v1.0
optional - ASSEMBLY workflow:
PacBio HiFi genome assembly workflow v2.1
- PURGE DUPLICATES workflow:
Purge duplicates from hifiasm assembly v1.0
optional
- FILE CONVERSION workflow:
- Review workflow report and perform additional QC as needed
- Re-run workflows, or individual tools, as needed
How to cite the workflows
Price, G. (2022). BAM to FASTQ + QC v1.0. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.220.2
Price, G., & Farquharson, K. (2022). PacBio HiFi genome assembly using hifiasm v2.1. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.221.3
Price, G. (2022). Purge duplicates from hifiasm assembly v1.0. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.237.2
The overall workflow
Further to this, a summary of the different elements of this assembly approach are detailed below:
Process name | Workflow name | Description | Inputs | Outputs |
---|---|---|---|---|
UPLOAD FILES | Not applicable | See the different upload options. | BAM (start at step 3 in Fig 1), or FASTQ (start at step 4 in Fig 1) | Uploaded data! |
FILE CONVERSION | BAM to FASTQ + QC | Conversion of files from BAM to FASTQ, including a FASTQC quality control (QC) step. | FASTQ file, FASTQC report HTML file | |
ASSEMBLY | PacBio HiFi genome assembly using hifiasm | Assembly using the hifiasm tool, including Bandage visualisation and QC. | Assembly file in FASTA format, FASTA metrics, Assembly report file | |
PURGE DUPLICATES | Purge duplicates from hifiasm assembly | Optional workflow to purge duplicates from the contig assembly. | Purged primary sequences from draft assembly (FASTA), Purged haplotig sequences from draft assembly (FASTA) |
The WorkflowHub entries are all available in this collection.
In-depth workflow guide
Register and login
- To register for Galaxy Australia, visit the login page.
- Click the
Register here
link, as shown in Fig 2. - Complete the registration wizard and click
Create
. - Login to your account!
Upload data file(s)
- Select your data using the Bioplatforms Data Portal
- The files you need for the assembly are
.css.bam
format - Fig 3 shows a HiFi data set selected in the data portal browser interface.
- The files you need for the assembly are
-
Click
Access
and selectCopy Download URL
in the drop down menu (see Fig. 3).- Note: This will copy a download URL to your clipboard.
- The URL is time sensitive and will expire after 10 minutes.
- Please note you only need to instigate the download within this 10 minute window. The import itself can take longer than 10 minutes.
-
In Galaxy Australia, create a new history
- Now, perform the steps outlined in Fig. 4
- Step 1: Click on
Upload Data
- Step 2: Select
Paste/Fetch data
- Step 3: Paste the URL you obtained from the data portal into the content box.
- Step 4: Select
Start
- Step 1: Click on
Other data transfer options are also available
Self-managed (download & upload)
- Download required HiFi data to your personal computer, then
- Upload / transfer to Galaxy Australia (see Fig 5)
- Note: the following Galaxy Training Network tutorial provides guidance on how to upload files via URL. The same mechanism can be used to upload local files, by selecting
Choose local files
(see Fig 5).
- Note: the following Galaxy Training Network tutorial provides guidance on how to upload files via URL. The same mechanism can be used to upload local files, by selecting
Supported
- Contact the Galaxy Australia Support team for data chaperoning.
OPTIONAL STEP: convert BAM files to FASTQ
This workflow is not needed if files are already in FASTQ format
You must do this step if your files are in
ccs.bam
format
You will need to complete this workflow for each BAM file
- Make sure you are logged into Galaxy Australia
- Visit this link to:
- retrieve the workflow for file conversion,
- add it to your Galaxy Australia workflows list, and
- open your workflows list (which can also be reached by clicking the
Workflow
tab, highlighted by a red box in Fig 6, in the Galaxy interface)
Edit
.- Once you have reached the workflow screen, select the
play
button (highlighted by a red box in Fig 7) for theBAM to FASTQ + QC
workflow.
- The workflow invocation window will open. Select the BAM file that you previously loaded into your Galaxy history using the drop-down menu (step 1 in Fig 8).
- Click
Run workflow
(step 2 in Fig 8).
- The workflow will produce
- A
FASTQ file
that will be the input for the assembly workflow covered in the next section (Fig 9a), and - A
FastQC report
which you can view in the Galaxy user interface (Fig 9b).
- A
- If you only have a single BAM file, stop here! If you have multiple BAM files, repeat this entire section of the tutorial for each BAM file:
Concatenate datasets tail-to-head
(Galaxy Version 0.1.1) can be used for this purpose.OPTIONAL STEP: joining multiple FASTQ files (e.g. from multiple flow cells)
Concatenate datasets tail-to-head
tool in Galaxy Australia (see below).
Note that you can insert multiple data sets (i.e. FASTQs), and should only concatenate files with identical formats.
- Search for the
Concatenate datasets tail-to-head
tool in the Galaxy Australia browser interface (step 1 in Fig 10). - Select the tool from the search results (step 2 in Fig 10).
- In the tool menu, select the first FASTQ data set (step 3 in Fig 11).
- Insert additional FASTQ data sets by selecting the
+ Insert Dataset
button (step 4 in Fig 11). - When all data sets are selected, press
Execute
(step 5 in Fig 11).
Assembly workflow
HiFiAdapterFilt
identifies .ccs
reads containing adapter sequences using the same method as NCBI and removes the entire read prior to genome assembly
to avoid such misassemblies.- Make sure you are logged into Galaxy Australia
- Visit this link to:
- retrieve v2.1 of the assembly workflow,
- add it to your Galaxy Australia workflows list, and
- open your workflows list (which can also be reached by clicking the
Workflow
tab [highlighted by a red box in Fig 6] in the Galaxy interface)
- Select the play button (highlighted by a red box in Fig 7) for the
PacBio HiFi genome assembly using hifiasm v2.1
workflow (the workflow is shown in Fig 12).
- The workflow invocation window will open.
- Select the FASTQ file that was produced by the
BAM to FASTQ + QC
workflow using the drop-down menu. - Select correct workflow input parameters
- Click the
Run workflow
button (as in step 2 of Fig 8). - The
PacBio HiFi genome assembly using hifiasm
workflow produces the following outputs (Fig 13):- Bandage info and images for:
- Primary assembly contig graph
- Alternate assembly contig graph
- Processed unitig graph
- Haplotype resolved raw unitig graph
- FASTA file for the primary assembly contig GFA file
- Fasta statistics for primary assembly contig FASTA file
- Bandage info and images for:
OPTIONAL STEP: Purge duplicates from hifiasm assembly
- Make sure you are logged into Galaxy Australia
- Visit this link to:
- retrieve the purge duplicates workflow,
- add it to your Galaxy Australia workflows list, and
- open your workflows list (which can also be reached by clicking the
Workflow
tab [highlighted by a red box in Fig 6] in the Galaxy interface)
- Select the play button (highlighted by a red box in Fig 7) for the
Purge duplicates from hifiasm assembly
workflow (the workflow is shown in Fig 14).
- The workflow invocation window will open.
- Select the raw reads in FASTQ format, and hifiasm primary contig assembly file (FASTA format) using the drop-down menu.
PacBio HiFi genome assembly using hifiasm
workflow.- Select correct workflow input parameters
- Click the
Run workflow
button (as in step 2 of Fig 8). - The
Purge duplicates from hifiasm assembly
workflow produces the following outputs:- Purged primary sequences from draft assembly (FASTA)
- Purged haplotig sequences from draft assembly (FASTA)
Review Workflow Report
- Review quality control contents of the workflow report (Fig 15).
- Inspect bandage image (Fig 16).
- Review fasta statistics (Fig 17).
OPTIONAL STEP: Post-assembly quality control workflow
A genome assembly quality control workflow guide is available.
Appendices
Revealing hidden files
- In your history panel, click on
hidden files
(as shown by the red box in Fig S1). - If you would like to unhide the data set, click on
Unhide it
in the expanded history panel that appears following step 1 (as shown by the red box in Fig S2).