global <- config::get(config = "default")
here::i_am("16s_ReadProcessing_RemotelyHCC.Rmd")
source(here::here(global$setup))
theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")
opts_chunk$set(message = FALSE,
warning = FALSE,
echo = FALSE,
include = TRUE,
eval = TRUE,
comment = "")
You should use this script if you are processing raw, multiplexed sequencing files remotely through the Swan System on the Holland Computing Center Cluster. If you plan to run these scripts locally on the lab’s laptop, you should switch to the script version that ends in
_Locally.Rmd
.
Start with the script
read_processing/RemoteHCC_FirstUse.Rmd
script. This will
guide you through the steps to transfer your data and create virtual
environments you will need to run the jobs below. Once you go through
that process the first time, you can skip to the steps in this
script.
You should do this first step from the sequencing laptop immediately after the sequencing run. Your data will automatically back up to the lab’s google drive, at which point you can switch to any computer for moving the file into the HCC and remotely processing it.
You could do this direct from the command line, but sometimes I just prefer doing things by icons. Go to the directory icon and enter the following path in the box:
/var/lib/minknow/data
From there, find the parent directory specific to your most recent
sequencing run. Right click the directory icon, and then select
Copy_to
to copy the entire directory to the following
path:
/Home/GitRepos/read_processing/seqRuns
Now you should rename the data directory for your sequencing run with
the less cumbersome seqrun
name that matches the syntax you
used in the params of this document.
Now you sync use your cloned directory on the HCC to this working directory so that your updated dataset is available there.
This step can be the most glitchy and time-consuming because of the more intensive memory requirements of the Dorado basecaller (see here). We can’t use just any of the nodes on the Swan sever - we have to use a GPU node, and so we must also load a specific version of the dorado package provided by the HCC team.
Run the chunk below and then copy and paste the text that it creates
to replace the text in the file
batch_scripts/basecall_multiplex_remote.sh
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/basecall_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1
module purge
module load dorado-gpu/0.7
seqrun=NWR1
cd /work/richlab/aliciarich/read_processing
mkdir -p reads/basecalled
dorado basecaller sup \
--models-directory "dorado_models" \
"seqRuns/$seqrun/pod5" \
--recursive \
--no-trim \
> "reads/basecalled/$seqrun.bam" && \
dorado summary "reads/basecalled/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"
Once you ensure this script has transferred to the
read_processing/batch_scripts
path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/basecall_multiplex_remote.sh
Now we need to trim away the adapters and barcodes that we attached to our libraries before sequencing. The code below will also demultiplex your reads, but keep in mind that this is only necessary if you have pooled multiple barcoded samples into a single sequencing run (multiplexing). Dorado has algorithms that know the sequence of every ONT kit’s barcodes attached to our libraries before pooling, and now it will use that information to assign a SampleID to every read.
You do not need to use the GPU nodes for this or any of the other jobs in this workflow (though it is fine if you already are in one). It is faster and easier to grab any of the available nodes and use the CPUs instead.
Again, run the chunk below and then copy and paste the text that it
creates to replace the text in the file
batch_scripts/demux_trim_remote.sh
#!/bin/bash
#SBATCH --time=00:45:00
#SBATCH --job-name=demux_trim
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/demux_trim_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/demux_trim_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=100GB
seqrun=NWR1
samplesheet=samples/sample_sheets/bats/nwr1_sample_sheet.csv
cd /work/richlab/aliciarich/read_processing
mkdir -p "reads/trimmed/$seqrun"
module load dorado
dorado demux "reads/basecalled/$seqrun.bam" \
--output-dir "reads/trimmed/$seqrun" \
--kit-name "'SQK-16S114-24'" \
--sample-sheet "$samplesheet" \
--emit-fastq --emit-summary
Once you ensure this script has transferred to the
read_processing/batch_scripts
path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/demux_trim_remote.sh
Now we need to trim away the messy ends of our reads (they are kind of like the first pancake - never as precise as the middle regions) and filter out any reads with a lower quality score than our threshold. Again, these parameters will be important things to note in the methods of any results you publish/present, so I keep a record of my working parameters in the yaml header or my config file, which you will see below.
Type the following into the HCC terminal to open an interactive job.
srun --partition=guest --nodes=1 --ntasks-per-node=1 --job-name=qc_multiplex --mem=100GB --time=0:45:00 --pty $SHELL
Copy and paste the following code.
module load anaconda
conda activate filter
cd /work/richlab/aliciarich/read_processing
seqrun=NWR1
trimmed=/work/richlab/aliciarich/read_processing/reads/trimmed/$seqrun
filtered=/work/richlab/aliciarich/read_processing/reads/filtered/$seqrun
mkdir -p $filtered
cd $trimmed
for file in ${trimmed}/*.fastq; do
if [ -f "$file" ]; then
base_filename=$(basename "$file")
chopper --maxlength 2000 --minlength 1000 --quality 7 --input "$file" > "$filtered/$base_filename"
echo "Processed $file"
else
echo "Error: File $file does not exist or is not a regular file."
fi
done
This part just reorganizes the files so that they are in the proper structure for the EPI2ME Workflows that we use next to easily locate them.
cd $filtered
for file in "$filtered"/*.fastq; do
if [ -f "$file" ]; then
base_filename=$(basename "$file" .fastq)
mkdir -p "$filtered/$base_filename"
mv "$file" "$filtered/$base_filename/${base_filename}.fastq"
echo "Organized $file"
else
echo "Error: File $file does not exist or is not a regular file."
fi
done
conda deactivate
cd /work/richlab/aliciarich/read_processing
We will use the wf-16s pipeline by EPI2ME to do this. You should have a look at all the options and steps described here if you use this workflow. Work out which options I have selected or left to default in the script provided, because you will need to mention those details in the methods of anything published/presented.
Again, run the chunk below and then copy and paste this into a script
that you will run. Name this script wf16s_remote.sh
.
Notice that this is a more memory- and time-intensive job than the others. You will need to scale up time for more samples/reads or longer reads.
#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=wf16s
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/wf16s_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/wf16s_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=350GB
module purge
module load nextflow
cd /work/richlab/aliciarich/read_processing
seqrun=NWR1
mkdir -p processed/$seqrun
nextflow run epi2me-labs/wf-16s \
-profile singularity \
--fastq "reads/filtered/$seqrun" \
--taxonomic_rank S \
--keep_bam \
--minimap2_by_reference \
--include_read_assignments \
--out_dir "processed/$seqrun" \
--min_len 1000 \
--max_len 2000 \
--abundance_threshold 0 \
--min_read_qual 7 \
--min_percent_identity 85 \
--min_ref_coverage 80 \
--n_taxa_barplot 12 \
--threads 32
Once you ensure this script has transferred to the
read_processing/batch_scripts
path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/wf16s_remote.sh
Once this run completes, make sure you transfer the directory
processed
to your local read_processing
directory and push everything to the github repository. You will move
onto analyses scripts that draw from files located in
read_processing/processed
to create new files in
bioinformatics_stats
.