global             <- config::get(config = "default")

here::i_am("16s_ReadProcessing_RemotelyHCC.Rmd")
source(here::here(global$setup))

theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")

opts_chunk$set(message = FALSE,
               warning = FALSE,
               echo    = FALSE,
               include = TRUE,
               eval    = TRUE,
               comment = "")

Before you begin

You should use this script if you are processing raw, multiplexed sequencing files remotely through the Swan System on the Holland Computing Center Cluster. If you plan to run these scripts locally on the lab’s laptop, you should switch to the script version that ends in _Locally.Rmd.

If this is your first time

Start with the script read_processing/RemoteHCC_FirstUse.Rmd script. This will guide you through the steps to transfer your data and create virtual environments you will need to run the jobs below. Once you go through that process the first time, you can skip to the steps in this script.

Transfer sequencing data

Sync Results to the Google Drive

You should do this first step from the sequencing laptop immediately after the sequencing run. Your data will automatically back up to the lab’s google drive, at which point you can switch to any computer for moving the file into the HCC and remotely processing it.

You could do this direct from the command line, but sometimes I just prefer doing things by icons. Go to the directory icon and enter the following path in the box:

/var/lib/minknow/data

From there, find the parent directory specific to your most recent sequencing run. Right click the directory icon, and then select Copy_to to copy the entire directory to the following path:

/Home/GitRepos/read_processing/seqRuns

Now you should rename the data directory for your sequencing run with the less cumbersome seqrun name that matches the syntax you used in the params of this document.

Sync Results to HCC Work Directory

Now you sync use your cloned directory on the HCC to this working directory so that your updated dataset is available there.

Read Processing

Basecalling

This step can be the most glitchy and time-consuming because of the more intensive memory requirements of the Dorado basecaller (see here). We can’t use just any of the nodes on the Swan sever - we have to use a GPU node, and so we must also load a specific version of the dorado package provided by the HCC team.

Run the chunk below and then copy and paste the text that it creates to replace the text in the file batch_scripts/basecall_multiplex_remote.sh

#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --job-name=basecall
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/basecall_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/basecall_%A_%a.out
#SBATCH --partition=gpu,guest_gpu
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#SBATCH --gres=gpu:1

module purge
module load dorado-gpu/0.7

seqrun=NWR1

cd /work/richlab/aliciarich/read_processing

mkdir -p reads/basecalled

dorado basecaller sup \
    --models-directory "dorado_models" \
    "seqRuns/$seqrun/pod5" \
    --recursive \
    --no-trim \
    > "reads/basecalled/$seqrun.bam" && \
dorado summary "reads/basecalled/$seqrun.bam" > "logs/$seqrun_basecall_summary.tsv"

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/basecall_multiplex_remote.sh

Demultiplex and Trim

Now we need to trim away the adapters and barcodes that we attached to our libraries before sequencing. The code below will also demultiplex your reads, but keep in mind that this is only necessary if you have pooled multiple barcoded samples into a single sequencing run (multiplexing). Dorado has algorithms that know the sequence of every ONT kit’s barcodes attached to our libraries before pooling, and now it will use that information to assign a SampleID to every read.

You do not need to use the GPU nodes for this or any of the other jobs in this workflow (though it is fine if you already are in one). It is faster and easier to grab any of the available nodes and use the CPUs instead.

Again, run the chunk below and then copy and paste the text that it creates to replace the text in the file batch_scripts/demux_trim_remote.sh

#!/bin/bash
#SBATCH --time=00:45:00
#SBATCH --job-name=demux_trim
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/demux_trim_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/demux_trim_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=100GB

seqrun=NWR1
samplesheet=samples/sample_sheets/bats/nwr1_sample_sheet.csv

cd /work/richlab/aliciarich/read_processing

mkdir -p "reads/trimmed/$seqrun"

module load dorado

dorado demux "reads/basecalled/$seqrun.bam" \
    --output-dir "reads/trimmed/$seqrun" \
    --kit-name "'SQK-16S114-24'" \
    --sample-sheet "$samplesheet" \
    --emit-fastq --emit-summary

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/demux_trim_remote.sh

Quality Control

Now we need to trim away the messy ends of our reads (they are kind of like the first pancake - never as precise as the middle regions) and filter out any reads with a lower quality score than our threshold. Again, these parameters will be important things to note in the methods of any results you publish/present, so I keep a record of my working parameters in the yaml header or my config file, which you will see below.

Open Interactive Job

Type the following into the HCC terminal to open an interactive job.

srun --partition=guest --nodes=1 --ntasks-per-node=1 --job-name=qc_multiplex --mem=100GB --time=0:45:00 --pty $SHELL
Part 1

Copy and paste the following code.

module load anaconda

conda activate filter

cd /work/richlab/aliciarich/read_processing

seqrun=NWR1
trimmed=/work/richlab/aliciarich/read_processing/reads/trimmed/$seqrun
filtered=/work/richlab/aliciarich/read_processing/reads/filtered/$seqrun

mkdir -p $filtered

cd $trimmed

for file in ${trimmed}/*.fastq; do

    if [ -f "$file" ]; then
       
        base_filename=$(basename "$file")

        chopper --maxlength 2000 --minlength 1000 --quality 7 --input "$file" > "$filtered/$base_filename"

        echo "Processed $file"
    else
        echo "Error: File $file does not exist or is not a regular file."
    fi
done
Part 2

This part just reorganizes the files so that they are in the proper structure for the EPI2ME Workflows that we use next to easily locate them.

cd $filtered

for file in "$filtered"/*.fastq; do
    if [ -f "$file" ]; then
        base_filename=$(basename "$file" .fastq)
        
        mkdir -p "$filtered/$base_filename"
        
        mv "$file" "$filtered/$base_filename/${base_filename}.fastq"
        
        echo "Organized $file"
    else
        echo "Error: File $file does not exist or is not a regular file."
    fi
done

conda deactivate

cd /work/richlab/aliciarich/read_processing

Alignment, Taxonomy, Abundance

We will use the wf-16s pipeline by EPI2ME to do this. You should have a look at all the options and steps described here if you use this workflow. Work out which options I have selected or left to default in the script provided, because you will need to mention those details in the methods of anything published/presented.

Again, run the chunk below and then copy and paste this into a script that you will run. Name this script wf16s_remote.sh.

Notice that this is a more memory- and time-intensive job than the others. You will need to scale up time for more samples/reads or longer reads.

#!/bin/bash
#SBATCH --time=03:00:00
#SBATCH --job-name=wf16s
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/wf16s_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/wf16s_%A_%a.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=350GB

module purge
module load nextflow

cd /work/richlab/aliciarich/read_processing

seqrun=NWR1

mkdir -p processed/$seqrun

nextflow run epi2me-labs/wf-16s \
    -profile singularity \
    --fastq "reads/filtered/$seqrun" \
  --taxonomic_rank S \
  --keep_bam \
  --minimap2_by_reference \
  --include_read_assignments \
  --out_dir "processed/$seqrun" \
  --min_len 1000 \
  --max_len 2000 \
  --abundance_threshold 0 \
  --min_read_qual 7 \
  --min_percent_identity 85 \
  --min_ref_coverage 80 \
  --n_taxa_barplot 12 \
  --threads 32

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/wf16s_remote.sh

Next Steps

Once this run completes, make sure you transfer the directory processed to your local read_processing directory and push everything to the github repository. You will move onto analyses scripts that draw from files located in read_processing/processed to create new files in bioinformatics_stats.