library(conflicted)
library(tidyverse)
library(downloadthis)
library(fs)

source("setup/conflicted.R")
source("setup/knit_engines_simple.R")

knitr::opts_chunk$set(message = FALSE,
                      warning = FALSE,
                      echo    = TRUE,
                      include = TRUE,
                      eval    = TRUE,
                      comment = "")

Background

Read more about this workflow here: PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences.

Integrating Dorado for trimming & demux

We can continue to use the Dorado basecalling → demux steps, then hand off the demultiplexed FASTQ files to PRONAME Here’s how it works:

  1. Dorado does the heavy lifting
    • GPU-accelerated SUP base-calling
    • Barcode demultiplex (with your ONT 16S kit’s barcodes)
      • Outputs clean, per-sample FASTQs.
  2. PRONAME’s 4-step Pipeline:
  1. proname_import
  • We will skip the optional trimming, which we already let dorado handle automatically.
  • We use the --duplex 'yes' argument to optimize for V14 kit chemistry.
  • PRONAME provides length-vs-quality scatterplots that we can use to inform our QC parameters.
  1. proname_filter
  • Set the optimal filtering thresholds based on results in the previous step and visualize the filtering impact on results.

C. proname_refine

  • Now PRONAME polishes reads via medaka, performs read clustering, removes chimeric sequences, and and generates error-corrected consensus sequences.
  • Files may be exported directly to QIIME2 to adapt standard Illumina-type workflows.
  1. proname_taxonomy
  • The files generated while gathering the consensus sequences and the standard reference databases are used to perform a taxnonomic analysis.
  • This step produces a taxonomy file and a taxa barplot.
  1. Post-PRONAME
  • After this, we can use standard QIIME2 pipelines that easily integrate into R packages like phyloseq.
    • For example, we can use qiime phylogeny to produce a phylogenetic diversity analysis (No more need to fetch refernces directly from GenBank ourselves!).

Setting up your Swan Workspace

Dorado Usage

You can skip these steps if you already have dorado up and running on swan, including the most up-to-date sup basecalling model.

SUP Model Download

You should have some basic understanding of which models Dorado provides for basecalling ONT reads by looking over this page. I use the config package or parameters in the yaml header to track and source the models that I am using. You will need to report details like this in the methods section of any paper produced by your results.

We will almost always choose the newest SUP model available on the HCC with the 10.4.1. kit chemistry.

For some reason dorado’s automatic sourcing and use of models does not seem to work from the GPU nodes on the HCC, so we will download a stable version of our current model options. - This file needs to be the path below within your working directory for it to automatically be located by the code I have written in other scripts.
You should at least download the newest sup model available, but you may also download the newest hac and fast models if you would like. The code in other scripts will search this directory for whichever of these three models you specify at that time.

Run the chunk below after replacing with the model of choice (and your directory paths) and then transfer the file batch_scripts/dorado_models.sh to your repo mirror on Swan work.

# batch_scripts/dorado_setup.sh

#!/bin/bash
#SBATCH --job-name=dorado_model
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --partition=gpu,guest_gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=20GB
module load apptainer

cd /work/richlab/aliciarich/read_processing

apptainer exec docker://nanoporetech/dorado:latest dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.2.0 --directory dorado_models

Once you ensure this script has transferred to the read_processing/batch_scripts path on your HCC directory, run the code below to submit the job.

cd read_processing
sbatch batch_scripts/dorado_setup.sh

Compression Tools

module load anaconda

conda create -n pigzip pigz=2.8

PRONAME Usage

Local Docker Image

# batch_scripts/proname_setup.sh

#!/bin/bash
#SBATCH --job-name=proname_setup
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --partition=guest
#SBATCH --mem=100GB

cd /work/richlab/aliciarich/read_processing

mkdir -p containers
cd containers

module load apptainer

apptainer pull docker://benn888/proname:v2.0.1-amd64

apptainer inspect proname_v2.0.1-amd64.sif
---
title: "First-Use Setup: Microbiome Read Processing via PRONAME"
author: "Alicia M. Rich, Ph.D."
date: "`r Sys.Date()`"
output:
  html_document:
    theme:
      bslib: true
    toc: true
    toc_depth: 3
    toc_float: true
    df_print: paged
    css: journal.css
    code_download: true
  
---

```{r setup, message=FALSE, comment=""}
library(conflicted)
library(tidyverse)
library(downloadthis)
library(fs)

source("setup/conflicted.R")
source("setup/knit_engines_simple.R")

knitr::opts_chunk$set(message = FALSE,
                      warning = FALSE,
                      echo    = TRUE,
                      include = TRUE,
                      eval    = TRUE,
                      comment = "")


```


# Background

Read more about this workflow here: [*PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences*](https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2024.1483255/full).


## Integrating Dorado for trimming & demux

We can continue to use the Dorado basecalling → demux steps, then hand off the demultiplexed FASTQ files to PRONAME Here’s how it works:

1.	Dorado does the heavy lifting
	- GPU-accelerated SUP base-calling
	- Barcode demultiplex (with your ONT 16S kit’s barcodes)
	  - Outputs clean, per-sample FASTQs.
	  
2.	PRONAME's 4-step Pipeline:

A.  `proname_import`
  
  - We will skip the optional trimming, which we already let dorado handle automatically.
  - We use the `--duplex 'yes'` argument to optimize for V14 kit chemistry.
  - PRONAME provides length-vs-quality scatterplots that we can use to inform our QC parameters.
  
B.  `proname_filter`
  
  - Set the optimal filtering thresholds based on results in the previous step and visualize the filtering impact on results.
    
C. `proname_refine`
  
  - Now PRONAME polishes reads via `medaka`, performs read clustering, removes chimeric sequences, and and generates error-corrected consensus sequences.
  - Files may be exported directly to `QIIME2` to adapt standard Illumina-type workflows.
  
D.  `proname_taxonomy`
  
  - The files generated while gathering the consensus sequences and the standard reference databases are used to perform a taxnonomic analysis.
  - This step produces a taxonomy file and a taxa barplot.
    
3.  Post-PRONAME

  - After this, we can use standard `QIIME2` pipelines that easily integrate into R packages like phyloseq.
    - For example, we can use `qiime phylogeny` to produce a phylogenetic diversity analysis (*No more need to fetch refernces directly from GenBank ourselves!*).
    

# Setting up your Swan Workspace

## Dorado Usage

You can skip these steps if you already have `dorado` up and running on swan, including the most up-to-date sup basecalling model.

### SUP Model Download

You should have some basic understanding of which models Dorado provides for basecalling ONT reads by looking over [this page](https://github.com/nanoporetech/dorado#dna-models). I use the config package or parameters in the yaml header to track and source the models that I am using. You will need to report details like this in the methods section of any paper produced by your results.  
  
>**We will almost always choose the newest SUP model available on the HCC with the 10.4.1. kit chemistry.**  
  
For some reason dorado's automatic sourcing and use of models does not seem to work from the GPU nodes on the HCC, so we will download a stable version of our current model options. - *This file needs to be the path below within your working directory for it to automatically be located by the code I have written in other scripts.*  
You should at least download the newest `sup` model available, but you may also download the newest `hac` and `fast` models if you would like. The code in other scripts will search this directory for whichever of these three models you specify at that time.

Run the chunk below after replacing with the model of choice (and your directory paths) and then transfer the file `batch_scripts/dorado_models.sh` to your repo mirror on Swan work.

```{cat, engine.opts=list(file='batch_scripts/dorado_setup.sh')}
#!/bin/bash
#SBATCH --job-name=dorado_model
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/dorado_model.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --constraint='gpu_v100|gpu_t4'
#SBATCH --partition=gpu,guest_gpu
#SBATCH --gres=gpu:2
#SBATCH --mem=20GB
```


```{cat, engine.opts=list(file='batch_scripts/dorado_setup.sh', append=TRUE)}
module load apptainer

cd /work/richlab/aliciarich/read_processing

apptainer exec docker://nanoporetech/dorado:latest dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.2.0 --directory dorado_models

```

```{r, echo=FALSE}
download_file(
  path = "batch_scripts/dorado_setup.sh",
  output_name    = "dorado_setup",
  button_label   = "Download Script",
  button_type    = "danger",
  has_icon       = TRUE,
  icon           = "fa fa-save",
  self_contained = TRUE
  
)
cat(
  "# batch_scripts/dorado_setup.sh\n",
  read_lines("batch_scripts/dorado_setup.sh"), sep = "\n")
```


Once you ensure this script has transferred to the `read_processing/batch_scripts` path on your HCC directory, run the code below to submit the job.

```{terminal, echo=FALSE}
cd read_processing
sbatch batch_scripts/dorado_setup.sh
```

# Compression Tools

```{terminal, echo=FALSE}
module load anaconda

conda create -n pigzip pigz=2.8
```

## PRONAME Usage

### Local Docker Image

```{cat, engine.opts=list(file='batch_scripts/proname_setup.sh')}
#!/bin/bash
#SBATCH --job-name=proname_setup
#SBATCH --output=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.out
#SBATCH --error=/work/richlab/aliciarich/read_processing/logs/proname_setup.%j.err
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --partition=guest
#SBATCH --mem=100GB

cd /work/richlab/aliciarich/read_processing

mkdir -p containers
cd containers

module load apptainer

apptainer pull docker://benn888/proname:v2.0.1-amd64

apptainer inspect proname_v2.0.1-amd64.sif

```
```{r, echo=FALSE}
download_file(
  path = "batch_scripts/proname_setup.sh",
  output_name    = "proname_setup",
  button_label   = "Download Script",
  button_type    = "danger",
  has_icon       = TRUE,
  icon           = "fa fa-save",
  self_contained = TRUE
  
)
cat(
  "# batch_scripts/proname_setup.sh\n",
  read_lines("batch_scripts/proname_setup.sh"), sep = "\n")
```



