global             <- config::get(config = "default")

here::i_am("SinglePlex_ReadProcessing_RemotelyHCC.Rmd")
source(here::here(global$setup))

theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")

opts_chunk$set(message = FALSE,
               warning = FALSE,
               echo    = FALSE,
               include = TRUE,
               eval    = TRUE,
               comment = "")

Before you begin

You should use this script if you are processing raw sequencing files on remotely through the Swan System on the Holland Computing Center Cluster. If you plan to run these scripts locally on the lab’s laptop, you should switch to the script version that ends in _Locally.Rmd.

Inspect Run Report

In the data directory that MinKNOW created for your run, open the html report file and examine the graphs to see where a reasonable threshold for a quality score cutoff would be. Once you decide on a threshold, update it in the config file under quality_parms$minQC so that the parameter automatically applies in the scripts that follow.

If this is your first time

Start with the script read_processing/RemoteHCC_FirstUse.Rmd script. This will guide you through the steps to transfer your data and create virtual environments you will need to run the jobs below. Once you go through that process the first time, you can skip to the steps in this script.

This Workflow

This is the current recommended pipeline for processing raw reads obtained from the MinION Sequencer. The ONT sequencers basically measure electrical signals as strands of DNA pass through each nanopore. The MinKNOW software uses an algorithm to convert each signal to its estimated sequence of A’s/G’s/C’s/T’s in real time, but those on-the-fly basecalls are quite messy. It is standard practice to take the original signal data after the run has completed and use a slower, more precise algorithm to re-basecall the data with greater precision.

Transfer sequencing data

Sync Results to the Google Drive

You should do this first step from the sequencing laptop immediately after the sequencing run. Your data will automatically back up to the lab’s google drive, at which point you can switch to any computer for moving the file into the HCC and remotely processing it.

You could do this direct from the command line, but sometimes I just prefer doing things by icons. Go to the directory icon and enter the following path in the box:

/var/lib/minknow/data

From there, find the parent directory specific to your most recent sequencing run. Right click the directory icon, and then select Copy_to to copy the entire directory to the following path:

/Home/GitRepos/read_processing/seqRuns

Now you should rename the data directory for your sequencing run with the less cumbersome seqrun name that matches the syntax you used in the params of this document.

Sync Results to HCC Work Directory

Now you sync use your cloned directory on the HCC to this working directory so that your updated dataset is available there.

Read Processing

Basecalling

This step can be the most glitchy and time-consuming because of the more intensive memory requirements of the Dorado basecaller (see here). We can’t use just any of the nodes on the Swan sever - we have to use a GPU node, and so we must also load a specific version of the dorado package provided by the HCC team.

Include DNA Modification

If you did not include any amplification steps during library prep, you can use dorado’s basecalling algorithm to also detect DNA modification patterns. A modified nucleotide base is a nucleotide in which the canonical base (ACGTU) has undergone some chemical change. Modified nucleotide bases play crucial roles in various biological processes, including gene expression regulation, DNA repair, and the immune response.

Depending on the basecalling model implemented, dorado can detect the following modifications:

Mod Name CHEBI
4mC N(4)-methylcytosine CHEBI:21839
5mC 5-Methylcytosine CHEBI:27551
5hmC 5-Hydroxymethylcytosine CHEBI:76792
6mA 6-Methyladenine CHEBI:28871

Here are the available calls for base modification detections in dorado:

Modified model arguments:
  --modified-bases            A **space separated** list of modified base codes. Choose from:
                                pseU, 5mCG_5hmCG, 5mC, 6mA, 5mCG, m6A_DRACH, m6A, 5mC_5hmC, 4mC_5mC.
                                // More modified base model may be available

  --modified-bases-models     A **comma separated** list of modified base model paths.
  --modified-bases-threshold  The minimum predicted methylation probability for a modified base
                                to be emitted in an all-context model, [0, 1].

In many bacterial genomes, the two modifications you’re most likely to encounter and that are functionally significant are:

  • 6mA (6-Methyladenine):
    • This is one of the most common modifications in bacteria. It plays roles in gene regulation, DNA replication, and is often part of restriction-modification systems, helping the cell distinguish its own DNA from foreign sequences.
  • 4mC (N(4)-methylcytosine):
    • Also prevalent in bacteria, 4mC is similarly involved in restriction-modification systems and can influence gene expression. Its presence is well-documented in various bacterial species.

In contrast, while 5mC (5-Methylcytosine) is a major player in eukaryotic epigenetics, its occurrence in bacteria is less frequent, and 5hmC (5-Hydroxymethylcytosine) is rarely observed in prokaryotes.

Therefore, if you’re prioritizing which modifications to detect for our bacterial isolates, we should focus on 6mA and 4mC. These modifications are most relevant to bacterial physiology and are more likely to yield insights into aspects like restriction-modification systems and potential regulatory mechanisms that could be linked to stress responses, including lithium tolerance.

To use a script that detects 6mA and 4mC, run the chunk below and then copy and paste the text that it creates to replace the text in the file batch_scripts/basecall_singleplex_mod_remote.sh

#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=logs/basecall_%A_%a.err
#SBATCH --output=logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1

module purge
module load dorado-gpu/0.7

seqrun=salci1
sampleset=isolates
qscore=10

cd /work/richlab/aliciarich/read_processing

mkdir -p reads/filtered/$sampleset

dorado basecaller sup,6mA,4mC_5mC \
    "seqRuns/$seqrun/pod5" \
    --recursive \
    --minqscore $qscore \
    > "reads/filtered/$sampleset/$seqrun.bam" && \
dorado summary "reads/filtered/$sampleset/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/basecall_singleplex_mod_remote.sh

Without DNA Modification

If you are running the basecalling without including any DNA modification detection (the only option if you included an amplification step like PCR during library prep), you should run the script below.

Run the chunk below and then copy and paste the text that it creates to replace the text in the file batch_scripts/basecall_singleplex_remote.sh

#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=logs/basecall_%A_%a.err
#SBATCH --output=logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1

module purge
module load dorado-gpu/0.7

seqrun=salci1
sampleset=isolates
qscore=10

cd /work/richlab/aliciarich/read_processing

mkdir -p reads/filtered/$sampleset

dorado basecaller sup \
    "seqRuns/$seqrun/pod5" \
    --recursive \
    --minqscore $qscore \
    > "reads/filtered/$sampleset/$seqrun.bam" && \
dorado summary "reads/filtered/$sampleset/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/basecall_singleplex_remote.sh

Compiling Base Modification Result Files

Modkit is ONT’s program for working with modified bases. We will use that to convert modBAM to bedMethyl files using best practices, to manipulate modBAM files, and to generate overall summary statistics.

Again, run the chunk below and then copy and paste this into a script that you will run. Name this script modkit_remote.sh.

#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=modkit
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/modkit_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/modkit_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1

module purge
module load anaconda

conda activate modkit

cd /work/richlab/aliciarich/read_processing

seqrun=salci1
sampleset=isolates

mkdir -p processed/$sampleset/modkit

modkit pileup --log-filepath processed/$sampleset/modkit/$seqrun_pileup.log \
  reads/filtered/$sampleset/$seqrun.bam \
  processed/$sampleset/modkit/$seqrun_pileup.bed \

conda deactivate

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/modkit_remote.sh

Alignment, Assembly, Annotation

Now we will combine all three of these steps by implementing wf-bacterial-genomes with Nextflow.

Again, run the chunk below and then copy and paste this into a script that you will run. Name this script wfbacgen_remote.sh.

#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=wfbacgen
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/wfbacgen_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/wfbacgen_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=350GB

module purge
module load nextflow

cd /work/richlab/aliciarich/read_processing

seqrun=salci1
sampleset=isolates

mkdir -p processed/$sampleset/$seqrun

nextflow run epi2me-labs/wf-bacterial-genomes \
    -profile singularity \
    --bam "reads/filtered/$sampleset/$seqrun.bam" \
    --isolates \
    --sample "$seqrun" \
  --out_dir "processed/$sampleset/$seqrun" \
  --threads 32

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/wfbacgen_remote.sh

Next Steps

Once this run completes, make sure you transfer the directory processed to your local read_processing directory and push everything to the github repository. You will move onto analyses scripts that draw from files located in read_processing/processed to create new files in bioinformatics_stats.