library(conflicted)
library(tidyverse)
library(downloadthis)
library(fs)
source("setup/conflicted.R")
source("setup/knit_engines_simple.R")
knitr::opts_chunk$set(message = FALSE,
warning = FALSE,
echo = TRUE,
include = TRUE,
eval = TRUE,
comment = "")
Background
Read more about this workflow here: PRONAME:
a user-friendly pipeline to process long-read nanopore metabarcoding
data by generating high-quality consensus sequences .
Integrating Dorado for trimming & demux
We can continue to use the Dorado basecalling → demux steps, then
hand off the demultiplexed FASTQ files to PRONAME Here’s how it
works:
Dorado does the heavy lifting
GPU-accelerated SUP base-calling
Barcode demultiplex (with your ONT 16S kit’s barcodes)
Outputs clean, per-sample FASTQs.
PRONAME’s 4-step Pipeline:
proname_import
We will skip the optional trimming, which we already let dorado
handle automatically.
We use the --duplex 'yes' argument to optimize for V14
kit chemistry.
PRONAME provides length-vs-quality scatterplots that we can use to
inform our QC parameters.
proname_filter
Set the optimal filtering thresholds based on results in the
previous step and visualize the filtering impact on results.
C. proname_refine
Now PRONAME polishes reads via medaka, performs read
clustering, removes chimeric sequences, and and generates
error-corrected consensus sequences.
Files may be exported directly to QIIME2 to adapt
standard Illumina-type workflows.
proname_taxonomy
The files generated while gathering the consensus sequences and the
standard reference databases are used to perform a taxnonomic
analysis.
This step produces a taxonomy file and a taxa barplot.
Post-PRONAME
After this, we can use standard QIIME2 pipelines that
easily integrate into R packages like phyloseq.
For example, we can use qiime phylogeny to produce a
phylogenetic diversity analysis (No more need to fetch refernces
directly from GenBank ourselves! ).
Setting up your Swan Workspace
Dorado Usage
You can skip these steps if you already have dorado up
and running on swan, including the most up-to-date sup basecalling
model.
SUP Model Download
You should have some basic understanding of which models Dorado
provides for basecalling ONT reads by looking over this page .
I use the config package or parameters in the yaml header to track and
source the models that I am using. You will need to report details like
this in the methods section of any paper produced by your results.
We will almost always choose the newest SUP model available
on the HCC with the 10.4.1. kit chemistry.
For some reason dorado’s automatic sourcing and use of models does
not seem to work from the GPU nodes on the HCC, so we will download a
stable version of our current model options. - This file needs to be
the path below within your working directory for it to automatically be
located by the code I have written in other scripts.
You should at least download the newest sup model
available, but you may also download the newest hac and
fast models if you would like. The code in other scripts
will search this directory for whichever of these three models you
specify at that time.
Run the chunk below after replacing with the model of choice (and
your directory paths) and then transfer the file
batch_scripts/dorado_models.sh to your repo mirror on Swan
work.
Download Script
# batch_scripts/dorado_setup.sh
#!/bin/bash
#SBATCH --job-name=dorado_model
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --partition=gpu,guest_gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=20GB
module load apptainer
cd /work/richlab/aliciarich/read_processing
apptainer exec docker://nanoporetech/dorado:latest dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.2.0 --directory dorado_models
Once you ensure this script has transferred to the
read_processing/batch_scripts path on your HCC directory,
run the code below to submit the job.
cd read_processing
sbatch batch_scripts/dorado_setup.sh
---
title: "First-Use Setup: Microbiome Read Processing via PRONAME"
author: "Alicia M. Rich, Ph.D."
date: "`r Sys.Date()`"
output:
  html_document:
    theme:
      bslib: true
    toc: true
    toc_depth: 3
    toc_float: true
    df_print: paged
    css: journal.css
    code_download: true
  
---

```{r setup, message=FALSE, comment=""}
library(conflicted)
library(tidyverse)
library(downloadthis)
library(fs)

source("setup/conflicted.R")
source("setup/knit_engines_simple.R")

knitr::opts_chunk$set(message = FALSE,
                      warning = FALSE,
                      echo    = TRUE,
                      include = TRUE,
                      eval    = TRUE,
                      comment = "")


```


# Background

Read more about this workflow here: [*PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences*](https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2024.1483255/full).


## Integrating Dorado for trimming & demux

We can continue to use the Dorado basecalling → demux steps, then hand off the demultiplexed FASTQ files to PRONAME Here’s how it works:

1.	Dorado does the heavy lifting
	- GPU-accelerated SUP base-calling
	- Barcode demultiplex (with your ONT 16S kit’s barcodes)
	  - Outputs clean, per-sample FASTQs.
	  
2.	PRONAME's 4-step Pipeline:

A.  `proname_import`
  
  - We will skip the optional trimming, which we already let dorado handle automatically.
  - We use the `--duplex 'yes'` argument to optimize for V14 kit chemistry.
  - PRONAME provides length-vs-quality scatterplots that we can use to inform our QC parameters.
  
B.  `proname_filter`
  
  - Set the optimal filtering thresholds based on results in the previous step and visualize the filtering impact on results.
    
C. `proname_refine`
  
  - Now PRONAME polishes reads via `medaka`, performs read clustering, removes chimeric sequences, and and generates error-corrected consensus sequences.
  - Files may be exported directly to `QIIME2` to adapt standard Illumina-type workflows.
  
D.  `proname_taxonomy`
  
  - The files generated while gathering the consensus sequences and the standard reference databases are used to perform a taxnonomic analysis.
  - This step produces a taxonomy file and a taxa barplot.
    
3.  Post-PRONAME

  - After this, we can use standard `QIIME2` pipelines that easily integrate into R packages like phyloseq.
    - For example, we can use `qiime phylogeny` to produce a phylogenetic diversity analysis (*No more need to fetch refernces directly from GenBank ourselves!*).
    

# Setting up your Swan Workspace

## Dorado Usage

You can skip these steps if you already have `dorado` up and running on swan, including the most up-to-date sup basecalling model.

### SUP Model Download

You should have some basic understanding of which models Dorado provides for basecalling ONT reads by looking over [this page](https://github.com/nanoporetech/dorado#dna-models). I use the config package or parameters in the yaml header to track and source the models that I am using. You will need to report details like this in the methods section of any paper produced by your results.  
  
>**We will almost always choose the newest SUP model available on the HCC with the 10.4.1. kit chemistry.**  
  
For some reason dorado's automatic sourcing and use of models does not seem to work from the GPU nodes on the HCC, so we will download a stable version of our current model options. - *This file needs to be the path below within your working directory for it to automatically be located by the code I have written in other scripts.*  
You should at least download the newest `sup` model available, but you may also download the newest `hac` and `fast` models if you would like. The code in other scripts will search this directory for whichever of these three models you specify at that time.

Run the chunk below after replacing with the model of choice (and your directory paths) and then transfer the file `batch_scripts/dorado_models.sh` to your repo mirror on Swan work.

```{cat, engine.opts=list(file='batch_scripts/dorado_setup.sh')}
#!/bin/bash
#SBATCH --job-name=dorado_model
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --partition=gpu,guest_gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=20GB
```


```{cat, engine.opts=list(file='batch_scripts/dorado_setup.sh', append=TRUE)}
module load apptainer

cd /work/richlab/aliciarich/read_processing

apptainer exec docker://nanoporetech/dorado:latest dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.2.0 --directory dorado_models

```

```{r, echo=FALSE}
download_file(
  path = "batch_scripts/dorado_setup.sh",
  output_name    = "dorado_setup",
  button_label   = "Download Script",
  button_type    = "danger",
  has_icon       = TRUE,
  icon           = "fa fa-save",
  self_contained = TRUE
  
)
cat(
  "# batch_scripts/dorado_setup.sh\n",
  read_lines("batch_scripts/dorado_setup.sh"), sep = "\n")
```


Once you ensure this script has transferred to the `read_processing/batch_scripts` path on your HCC directory, run the code below to submit the job.

```{terminal, echo=FALSE}
cd read_processing
sbatch batch_scripts/dorado_setup.sh
```

# Compression Tools

```{terminal, echo=FALSE}
module load anaconda

conda create -n pigzip pigz=2.8
```

## PRONAME Usage

### Local Docker Image

```{cat, engine.opts=list(file='batch_scripts/proname_setup.sh')}
#!/bin/bash
#SBATCH --job-name=proname_setup
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --partition=guest
#SBATCH --mem=100GB

cd /work/richlab/aliciarich/read_processing

mkdir -p containers
cd containers

module load apptainer

apptainer pull docker://benn888/proname:v2.0.1-amd64

apptainer inspect proname_v2.0.1-amd64.sif

```
```{r, echo=FALSE}
download_file(
  path = "batch_scripts/proname_setup.sh",
  output_name    = "proname_setup",
  button_label   = "Download Script",
  button_type    = "danger",
  has_icon       = TRUE,
  icon           = "fa fa-save",
  self_contained = TRUE
  
)
cat(
  "# batch_scripts/proname_setup.sh\n",
  read_lines("batch_scripts/proname_setup.sh"), sep = "\n")
```



