global <- config::get(config = "default")
here::i_am("SinglePlex_ReadProcessing_RemotelyHCC.Rmd")
source(here::here(global$setup))
theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")
opts_chunk$set(message = FALSE,
warning = FALSE,
echo = FALSE,
include = TRUE,
eval = TRUE,
comment = "")
You should use this script if you are processing raw sequencing files on remotely through the Swan System on the Holland Computing Center Cluster. If you plan to run these scripts locally on the lab’s laptop, you should switch to the script version that ends in
_Locally.Rmd.
In the data directory that MinKNOW created for your run, open the
html report file and examine the graphs to see where a reasonable
threshold for a quality score cutoff would be. Once you decide on a
threshold, update it in the config file under
quality_parms$minQC so that the parameter automatically
applies in the scripts that follow.
Start with the script
read_processing/RemoteHCC_FirstUse.Rmd script. This will
guide you through the steps to transfer your data and create virtual
environments you will need to run the jobs below. Once you go through
that process the first time, you can skip to the steps in this
script.
This is the current recommended pipeline for processing raw reads obtained from the MinION Sequencer. The ONT sequencers basically measure electrical signals as strands of DNA pass through each nanopore. The MinKNOW software uses an algorithm to convert each signal to its estimated sequence of A’s/G’s/C’s/T’s in real time, but those on-the-fly basecalls are quite messy. It is standard practice to take the original signal data after the run has completed and use a slower, more precise algorithm to re-basecall the data with greater precision.
You should do this first step from the sequencing laptop immediately after the sequencing run. Your data will automatically back up to the lab’s google drive, at which point you can switch to any computer for moving the file into the HCC and remotely processing it.
You could do this direct from the command line, but sometimes I just prefer doing things by icons. Go to the directory icon and enter the following path in the box:
/var/lib/minknow/data
From there, find the parent directory specific to your most recent
sequencing run. Right click the directory icon, and then select
Copy_to to copy the entire directory to the following
path:
/Home/GitRepos/read_processing/seqRuns
Now you should rename the data directory for your sequencing run with
the less cumbersome seqrun name that matches the syntax you
used in the params of this document.
Now you sync use your cloned directory on the HCC to this working directory so that your updated dataset is available there.
This step can be the most glitchy and time-consuming because of the more intensive memory requirements of the Dorado basecaller (see here). We can’t use just any of the nodes on the Swan sever - we have to use a GPU node, and so we must also load a specific version of the dorado package provided by the HCC team.
If you did not include any amplification steps during library prep, you can use dorado’s basecalling algorithm to also detect DNA modification patterns. A modified nucleotide base is a nucleotide in which the canonical base (ACGTU) has undergone some chemical change. Modified nucleotide bases play crucial roles in various biological processes, including gene expression regulation, DNA repair, and the immune response.
Depending on the basecalling model implemented, dorado can detect the following modifications:
| Mod | Name | CHEBI |
|---|---|---|
| 4mC | N(4)-methylcytosine | CHEBI:21839 |
| 5mC | 5-Methylcytosine | CHEBI:27551 |
| 5hmC | 5-Hydroxymethylcytosine | CHEBI:76792 |
| 6mA | 6-Methyladenine | CHEBI:28871 |
Here are the available calls for base modification detections in dorado:
Modified model arguments:
--modified-bases A **space separated** list of modified base codes. Choose from:
pseU, 5mCG_5hmCG, 5mC, 6mA, 5mCG, m6A_DRACH, m6A, 5mC_5hmC, 4mC_5mC.
// More modified base model may be available
--modified-bases-models A **comma separated** list of modified base model paths.
--modified-bases-threshold The minimum predicted methylation probability for a modified base
to be emitted in an all-context model, [0, 1].
In many bacterial genomes, the two modifications you’re most likely to encounter and that are functionally significant are:
In contrast, while 5mC (5-Methylcytosine) is a major player in eukaryotic epigenetics, its occurrence in bacteria is less frequent, and 5hmC (5-Hydroxymethylcytosine) is rarely observed in prokaryotes.
Therefore, if you’re prioritizing which modifications to detect for our bacterial isolates, we should focus on 6mA and 4mC. These modifications are most relevant to bacterial physiology and are more likely to yield insights into aspects like restriction-modification systems and potential regulatory mechanisms that could be linked to stress responses, including lithium tolerance.
To use a script that detects 6mA and 4mC, run the chunk below and
then copy and paste the text that it creates to replace the text in the
file batch_scripts/basecall_singleplex_mod_remote.sh
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=logs/basecall_%A_%a.err
#SBATCH --output=logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1
module purge
module load dorado-gpu/0.7
seqrun=salci1
sampleset=isolates
qscore=10
cd /work/richlab/aliciarich/read_processing
mkdir -p reads/filtered/$sampleset
dorado basecaller sup,6mA,4mC_5mC \
"seqRuns/$seqrun/pod5" \
--recursive \
--minqscore $qscore \
> "reads/filtered/$sampleset/$seqrun.bam" && \
dorado summary "reads/filtered/$sampleset/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"
Once you ensure this script has transferred to the
read_processing/batch_scripts path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/basecall_singleplex_mod_remote.sh
If you are running the basecalling without including any DNA modification detection (the only option if you included an amplification step like PCR during library prep), you should run the script below.
Run the chunk below and then copy and paste the text that it creates
to replace the text in the file
batch_scripts/basecall_singleplex_remote.sh
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=logs/basecall_%A_%a.err
#SBATCH --output=logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1
module purge
module load dorado-gpu/0.7
seqrun=salci1
sampleset=isolates
qscore=10
cd /work/richlab/aliciarich/read_processing
mkdir -p reads/filtered/$sampleset
dorado basecaller sup \
"seqRuns/$seqrun/pod5" \
--recursive \
--minqscore $qscore \
> "reads/filtered/$sampleset/$seqrun.bam" && \
dorado summary "reads/filtered/$sampleset/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"
Once you ensure this script has transferred to the
read_processing/batch_scripts path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/basecall_singleplex_remote.sh
Modkit is
ONT’s program for working with modified bases. We will use that to
convert modBAM to bedMethyl files using best practices, to manipulate
modBAM files, and to generate overall summary statistics.
Again, run the chunk below and then copy and paste this into a script
that you will run. Name this script modkit_remote.sh.
#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=modkit
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/modkit_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/modkit_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1
module purge
module load anaconda
conda activate modkit
cd /work/richlab/aliciarich/read_processing
seqrun=salci1
sampleset=isolates
mkdir -p processed/$sampleset/modkit
modkit pileup --log-filepath processed/$sampleset/modkit/$seqrun_pileup.log \
reads/filtered/$sampleset/$seqrun.bam \
processed/$sampleset/modkit/$seqrun_pileup.bed \
conda deactivate
Once you ensure this script has transferred to the
read_processing/batch_scripts path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/modkit_remote.sh
Now we will combine all three of these steps by implementing wf-bacterial-genomes
with Nextflow.
Again, run the chunk below and then copy and paste this into a script
that you will run. Name this script wfbacgen_remote.sh.
#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=wfbacgen
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/wfbacgen_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/wfbacgen_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=350GB
module purge
module load nextflow
cd /work/richlab/aliciarich/read_processing
seqrun=salci1
sampleset=isolates
mkdir -p processed/$sampleset/$seqrun
nextflow run epi2me-labs/wf-bacterial-genomes \
-profile singularity \
--bam "reads/filtered/$sampleset/$seqrun.bam" \
--isolates \
--sample "$seqrun" \
--out_dir "processed/$sampleset/$seqrun" \
--threads 32
Once you ensure this script has transferred to the
read_processing/batch_scripts path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/wfbacgen_remote.sh
Once this run completes, make sure you transfer the directory
processed to your local read_processing
directory and push everything to the github repository. You will move
onto analyses scripts that draw from files located in
read_processing/processed to create new files in
bioinformatics_stats.