global             <- config::get(config = "default")

here::i_am("RemoteHCC_FirstUse.Rmd")
source(here::here(global$setup))

theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")

opts_chunk$set(message = FALSE,
               warning = FALSE,
               echo    = FALSE,
               include = TRUE,
               eval    = TRUE,
               comment = "")

Background

MinION Data

The ONT sequencers basically measure electrical signals as strands of DNA pass through each nanopore. The MinKNOW software uses an algorithm to convert each signal to its estimated sequence of A’s/G’s/C’s/T’s in real time, but those on-the-fly basecalls are quite messy. It is standard practice to take the original signal data after the run has completed and use a slower, more precise algorithm to re-basecall the data with greater precision.

Files to Use

After a sequencing run you MinKNOW will drop many different files and subdirectories into a parent directory labeled with the name of your run. Here is what you should do with some of the main files you need to carry forward:

Relative Paths Description Use Destination
**/pod5/*.pod5 Uncalled reads that passed initial quality control parameters on the MinKNOW local software. Your input for the basecalling step you take first. Working directory on HCC/Swan that you will call in your script.
barcode_alignment_*.tsv Output stats for multiplexed/barcoded sequencing runs summarized by barcode We change the filename with your Library Code (e.g. hdz1) as a prefix and remove the really long string after 'alignment_' before folding the file into our SampleInventory script for preparing the demultiplexing sample sheets. These files must be properly renamed and then place in the bioinformatics-stats repository under the subdirectory dataframes/barcodes/.
**/dataframes/sample_sheet/*/*_sample_sheet.csv Dorado-formatted samplesheets created in the SampleInventory workflow using the barcode_alignment tables. After the SampleInventory workflow, you should have one sample sheet for each sequencing run (with the run name/code as the file prefix) that dorado can use for matching sample/sequence names to barcodes (note: this is only necessary if you are working with multiplexed/barcoded libraries). Working directory on HCC/Swan that you will call in your script.

Syntax of this Workflow

If you see a step involving the code “sbatch”, that means I am referencing a separate file with the extension .sh as a complete batch script that runs from the HCC’s SLURM server. You should transfer your version of that script to the local working directory before running the sbatch code. Sometimes I also run shorter jobs as an interactive job. You can read more about that on the HCC manual if you want to try it.

I also use a custom language engine in this script that I named “terminal.” If you see a chunk of code with {terminal, warning = FALSE} written where you would usually see {r} at the top of the chunk, then running the chunk should only print that code as a text string in this document. This just makes it easier for me to copy and paste the code directly into the terminal panel that I use in my R Studio window when running code through a remote server instead of my local R console. There are ways to set R Studio up to run code through multiple servers, but I find this the simplest way to switch back and forth while still keeping a record of the code that have used or changes I have made to it.

Methods Parameters

I keep track of my parameters for workflows using the config package and file. Below is my full-length config.yml file. Scroll down to “methods_16s” to see the parameters I use for 16S microbiome sequences generated with ONT’s rapid 16S kit. This allows us to modify parameters in one central location that we reference in all other scripts, avoiding any inconsistences across bioinformatic stages and giving us one place to check when reporting the methods for manuscripts or presentations.

default:
  setup: "setup/setup.R"
  conflicts: "setup/conflicts.R"
  functions: "setup/functions.R"
  packages: "setup/packages.R"
  inputs: "setup/inputs.R"
  knit_engines: "setup/knit_engines.R"
  fonts: "setup/fonts/FA6_Free-Solid-900.otf"
  tmp_tsv: "tmp/tmp_table.tsv"
  tmp_downloads: "tmp/downloads/"
  tmp_fetch: "tmp/fetch_references.txt"
  tmp_fasta3: "tmp/tmp3.fasta"
  tmp_fasta4: "tmp/tmp4.fasta"
  
scripts:
  local:
    basecall: "batch_scripts/basecall_local.sh"
    trimWG: "batch_scripts/trimWG_local.sh"
  
isolates:
  samplesets: "salci" 
  minQC: 10

bats:
  inventories:
    all_stages: "samples/inventories/bats/compilation_bats.tsv"
    collection: "samples/inventories/bats/samples_bats.csv"
    extraction: "samples/inventories/bats/extracts_bats.csv"
    libraries: "samples/inventories/bats/libraries_bats.csv"
    seqruns: "samples/inventories/bats/seqruns_bats.csv"

loris:
  day1: "2023-10-26"
  last: "2024-10-25"
  sequencing:
    coverage: "visuals/loris_depth_summary.html"
    depth_plot: "visuals/loris_depth_hist.html"
  metadata: 
    scripts: !expr list("metadata/loris/colors.R", "metadata/loris/metadata_key.R", "metadata/loris/nutrition.R", "metadata/loris/hdz_loris_log.R", "metadata/loris/diet_tables.R")
    bristol: "metadata/loris/bristols.tsv"
    studbook: "metadata/loris/subjects_loris.csv"
    summary: "metadata/loris/samples_metadata.tsv"
    key: "metadata/loris/metadata_key.R"
    factors: "metadata/loris/factors.R"
    foods: "metadata/loris/foods.tsv"
    proteins: "metadata/loris/proteins.tsv"
    fats: "metadata/loris/fats.tsv"
    CHOs: "metadata/loris/CHOs.tsv"
    Ash: "metadata/loris/Ash.tsv"
    vitamins: "metadata/loris/vitamins.tsv"
    reactable: "metadata/loris/loris_metadata_summary.html"
    sample_table: 
      identifiers: "metadata/loris/identifier_key.tsv"
      main: "metadata/loris/sample_table.tsv"
      merged: "metadata/loris/sample_table_merged.tsv"
  inventories:
    all_stages: "dataframes/sample_inventories/compilation_loris.tsv"
    collection: "dataframes/sample_inventories/samples_loris.csv"
    extraction: "dataframes/sample_inventories/extracts_loris.csv"
    libraries: "dataframes/sample_inventories/libraries_loris.csv"
    seqruns: "dataframes/sample_inventories/seqruns_loris.csv"
  outputs_wf16s: "data/loris/outputs_wf16s/"
  barcodes_output: "dataframes/barcodes/loris/"
  read_alignments: "data/loris/read_alignments"
  taxa_reps:
    aligned: "data/loris/taxonomy/refseqs_aligned.fasta"
    tree: "data/loris/taxonomy/refseqs_tree.newick"
    table: "data/loris/taxonomy/tax_table.tsv"
  abundance_wf16s: "data/loris/wf16s_abundance/"
  microeco: 
    dataset:
      main:
        keg: "microeco/loris/datasets/main/keg"
        njc: "microeco/loris/datasets/main/njc"
        fpt: "microeco/loris/datasets/main/fpt"
        tax: "microeco/loris/datasets/main"
      culi:
        keg: "microeco/loris/datasets/culi/keg"
        njc: "microeco/loris/datasets/culi/njc"
        fpt: "microeco/loris/datasets/culi/fpt"
        tax: "microeco/loris/datasets/culi"
      warb:
        keg: "microeco/loris/datasets/warble/keg"
        njc: "microeco/loris/datasets/warble/njc"
        fpt: "microeco/loris/datasets/warble/fpt"
        tax: "microeco/loris/datasets/warble"
    abund:
      main:
        keg: "microeco/loris/abundance/main/keg"
        fpt: "microeco/loris/abundance/main/fpt"
        njc: "microeco/loris/abundance/main/njc"
        tax: "microeco/loris/abundance/main"
      culi:
        keg: "microeco/loris/abundance/culi/keg"
        fpt: "microeco/loris/abundance/culi/fpt"
        njc: "microeco/loris/abundance/culi/njc"
        tax: "microeco/loris/abundance/culi"
      warb:
        keg: "microeco/loris/abundance/warble/keg"
        fpt: "microeco/loris/abundance/warble/fpt"
        njc: "microeco/loris/abundance/warble/njc"
        tax: "microeco/loris/abundance/warble"
    alpha:
      main: "microeco/loris/alphadiversity/main"
      culi: "microeco/loris/alphadiversity/culi"
      warb: "microeco/loris/alphadiversity/warble"
    beta:
      main:
        kegg: "microeco/loris/betadiversity/main/keg"
        fpt: "microeco/loris/betadiversity/main/fpt"
        njc: "microeco/loris/betadiversity/main/njc"
        tax: "microeco/loris/betadiversity/main"
      culi:
        kegg: "microeco/loris/betadiversity/culi/keg"
        fpt:  "microeco/loris/betadiversity/culi/fpt"
        njc:  "microeco/loris/betadiversity/culi/njc"
        tax: "microeco/loris/betadiversity/culi"
      warb:
        kegg: "microeco/loris/betadiversity/warble/keg"
        fpt:  "microeco/loris/betadiversity/warble/fpt"
        njc:  "microeco/loris/betadiversity/warble/njc"
        tax: "microeco/loris/betadiversity/warble"
    data:
      main:
        feature: "microeco/loris/datasets/main/feature_table.tsv"
        tree:    "microeco/loris/datasets/main/phylo_tree.tre"
        fasta:   "microeco/loris/datasets/main/rep_fasta.fasta"
        samples: "microeco/loris/datasets/main/sample_table.tsv"
        taxa:    "microeco/loris/datasets/main/tax_table.tsv"
      culi: 
        feature: "microeco/loris/datasets/culi/feature_table.tsv"
        tree:    "microeco/loris/datasets/culi/phylo_tree.tre"
        fasta:   "microeco/loris/datasets/culi/rep_fasta.fasta"
        samples: "microeco/loris/datasets/culi/sample_table.tsv"
        taxa:    "microeco/loris/datasets/culi/tax_table.tsv"
      warb:
        feature: "microeco/loris/datasets/warb/feature_table.tsv"
        tree:    "microeco/loris/datasets/warb/phylo_tree.tre"
        fasta:   "microeco/loris/datasets/warb/rep_fasta.fasta"
        samples: "microeco/loris/datasets/warb/sample_table.tsv"
        taxa:    "microeco/loris/datasets/warb/tax_table.tsv"


sample_sheets:
  compilations:
    bats:    "samples/sample_sheets/bats/nwr_combined_sample_sheet.csv"
    loris:    "dataframes/sample_sheet/loris/hdz_combined_sample_sheet.csv"
    marmoset: "dataframes/sample_sheet/marmoset/cm_combined_sample_sheet.csv"
  nwr1: "samples/sample_sheets/bats/nwr1_sample_sheet.csv"
  hdz1:  "samples/sample_sheets/loris/hdz1_sample_sheet.csv"
  hdz2:  "samples/sample_sheets/loris/hdz2_sample_sheet.csv"
  hdz3:  "samples/sample_sheets/loris/hdz3_sample_sheet.csv"
  hdz4:  "samples/sample_sheets/loris/hdz4_sample_sheet.csv"
  hdz5:  "samples/sample_sheets/loris/hdz5_sample_sheet.csv"
  hdz6:  "samples/sample_sheets/loris/hdz6_sample_sheet.csv"
  hdz7:  "samples/sample_sheets/loris/hdz7_sample_sheet.csv"
  hdz8:  "samples/sample_sheets/loris/hdz8_sample_sheet.csv"
  hdz9:  "samples/sample_sheets/loris/hdz9_sample_sheet.csv"
  hdz10: "samples/sample_sheets/loris/hdz10_sample_sheet.csv"
  hdz11: "samples/sample_sheets/loris/hdz11_sample_sheet.csv"
  hdz12: "samples/sample_sheets/loris/hdz12_sample_sheet.csv"
  hdz13: "samples/sample_sheets/loris/hdz13_sample_sheet.csv"
  hdz14: "samples/sample_sheets/loris/hdz14_sample_sheet.csv"
  hdz15: "samples/sample_sheets/loris/hdz15_sample_sheet.csv"
  hdz16: "samples/sample_sheets/loris/hdz16_sample_sheet.csv"
  hdz17: "samples/sample_sheets/loris/hdz17_sample_sheet.csv"
  hdz18: "samples/sample_sheets/loris/hdz18_sample_sheet.csv"
  cm001: "dataframes/sample_sheet/marmoset/cm001_sample_sheet.csv"
  cm002: "dataframes/sample_sheet/marmoset/cm002_sample_sheet.csv"
  cm003: "dataframes/sample_sheet/marmoset/cm003_sample_sheet.csv"
  cm004: "dataframes/sample_sheet/marmoset/cm004_sample_sheet.csv"
  cm005: "dataframes/sample_sheet/marmoset/cm005_sample_sheet.csv"
  cm006: "dataframes/sample_sheet/marmoset/cm006_sample_sheet.csv"
  cm007: "dataframes/sample_sheet/marmoset/cm007_sample_sheet.csv"
  cm008: "dataframes/sample_sheet/marmoset/cm008_sample_sheet.csv"
  cm009: "dataframes/sample_sheet/marmoset/cm009_sample_sheet.csv"
  cm010: "dataframes/sample_sheet/marmoset/cm010_sample_sheet.csv"
  cm011: "dataframes/sample_sheet/marmoset/cm011_sample_sheet.csv"

barcode_alignments:
  compilations:
    loris:    "samples/barcode_alignments/loris/hdz_combined_barcode_alignment.tsv"
    bats:    "samples/barcode_alignments/bats/nwr_combined_barcode_alignment.tsv"
    marmoset: "../labwork/minion_data/barcode_alignments/marmoset/cm_combined_barcode_alignment.tsv"
  nwr1: "samples/barcode_alignments/bats/nwr1_barcode_alignment.tsv"
  hdz1:  "samples/barcode_alignments/loris/hdz1_barcode_alignment.tsv"
  hdz2:  "samples/barcode_alignments/loris/hdz2_barcode_alignment.tsv"
  hdz3:  "samples/barcode_alignments/loris/hdz3_barcode_alignment.tsv"
  hdz4:  "samples/barcode_alignments/loris/hdz4_barcode_alignment.tsv"
  hdz5:  "samples/barcode_alignments/loris/hdz5_barcode_alignment.tsv"
  hdz6:  "samples/barcode_alignments/loris/hdz6_barcode_alignment.tsv"
  hdz7:  "samples/barcode_alignments/loris/hdz7_barcode_alignment.tsv"
  hdz8:  "samples/barcode_alignments/loris/hdz8_barcode_alignment.tsv"
  hdz9:  "samples/barcode_alignments/loris/hdz9_barcode_alignment.tsv"
  hdz10: "samples/barcode_alignments/loris/hdz10_barcode_alignment.tsv"
  hdz11: "samples/barcode_alignments/loris/hdz11_barcode_alignment.tsv"
  hdz12: "samples/barcode_alignments/loris/hdz12_barcode_alignment.tsv"
  hdz13: "samples/barcode_alignments/loris/hdz13_barcode_alignment.tsv"
  hdz14: "samples/barcode_alignments/loris/hdz14_barcode_alignment.tsv"
  hdz15: "samples/barcode_alignments/loris/hdz15_barcode_alignment.tsv"
  hdz16: "samples/barcode_alignments/loris/hdz16_barcode_alignment.tsv"
  hdz17: "samples/barcode_alignments/loris/hdz17_barcode_alignment.tsv"
  hdz18: "samples/barcode_alignments/loris/hdz18_barcode_alignment.tsv"
  cm001: "../labwork/minion_data/barcode_alignments/marmoset/cm001_barcode_alignment.tsv"
  cm002: "../labwork/minion_data/barcode_alignments/marmoset/cm002_barcode_alignment.tsv"
  cm003: "../labwork/minion_data/barcode_alignments/marmoset/cm003_barcode_alignment.tsv"
  cm004: "../labwork/minion_data/barcode_alignments/marmoset/cm004_barcode_alignment.tsv"
  cm005: "../labwork/minion_data/barcode_alignments/marmoset/cm005_barcode_alignment.tsv"
  cm006: "../labwork/minion_data/barcode_alignments/marmoset/cm006_barcode_alignment.tsv"
  cm007: "../labwork/minion_data/barcode_alignments/marmoset/cm007_barcode_alignment.tsv"
  cm008: "../labwork/minion_data/barcode_alignments/marmoset/cm008_barcode_alignment.tsv"
  cm009: "../labwork/minion_data/barcode_alignments/marmoset/cm009_barcode_alignment.tsv"
  cm010: "../labwork/minion_data/barcode_alignments/marmoset/cm010_barcode_alignment.tsv"
  cm011: "../labwork/minion_data/barcode_alignments/marmoset/cm011_barcode_alignment.tsv"

abund_wf16s_files:
  hdz1:  "data/loris/wf16s_abundance/hdz1_abundance_table_species.tsv"
  hdz2:  "data/loris/wf16s_abundance/hdz2_abundance_table_species.tsv"
  hdz3:  "data/loris/wf16s_abundance/hdz3_abundance_table_species.tsv"
  hdz4:  "data/loris/wf16s_abundance/hdz4_abundance_table_species.tsv"
  hdz5:  "data/loris/wf16s_abundance/hdz5_abundance_table_species.tsv"
  hdz6:  "data/loris/wf16s_abundance/hdz6_abundance_table_species.tsv"
  hdz7:  "data/loris/wf16s_abundance/hdz7_abundance_table_species.tsv"
  hdz8:  "data/loris/wf16s_abundance/hdz8_abundance_table_species.tsv"
  hdz9:  "data/loris/wf16s_abundance/hdz9_abundance_table_species.tsv"
  hdz10: "data/loris/wf16s_abundance/hdz10_abundance_table_species.tsv"
  hdz11: "data/loris/wf16s_abundance/hdz11_abundance_table_species.tsv"
  hdz12: "data/loris/wf16s_abundance/hdz12_abundance_table_species.tsv"
  hdz13: "data/loris/wf16s_abundance/hdz13_abundance_table_species.tsv"
  hdz14: "data/loris/wf16s_abundance/hdz14_abundance_table_species.tsv"
  hdz15: "data/loris/wf16s_abundance/hdz15_abundance_table_species.tsv"
  hdz16: "data/loris/wf16s_abundance/hdz16_abundance_table_species.tsv"
  hdz17: "data/loris/wf16s_abundance/hdz17_abundance_table_species.tsv"
  hdz18: "data/loris/wf16s_abundance/hdz18_abundance_table_species.tsv"
  cm001: "data/marmoset/wf16s_abundance/cm001_abundance_table_species.tsv"
  cm002: "data/marmoset/wf16s_abundance/cm002_abundance_table_species.tsv"
  cm003: "data/marmoset/wf16s_abundance/cm003_abundance_table_species.tsv"
  cm004: "data/marmoset/wf16s_abundance/cm004_abundance_table_species.tsv"
  cm005: "data/marmoset/wf16s_abundance/cm005_abundance_table_species.tsv"
  cm006: "data/marmoset/wf16s_abundance/cm006_abundance_table_species.tsv"
  cm007: "data/marmoset/wf16s_abundance/cm007_abundance_table_species.tsv"
  cm008: "data/marmoset/wf16s_abundance/cm008_abundance_table_species.tsv"
  cm009: "data/marmoset/wf16s_abundance/cm009_abundance_table_species.tsv"
  cm010: "data/marmoset/wf16s_abundance/cm010_abundance_table_species.tsv"
  cm011: "data/marmoset/wf16s_abundance/cm011_abundance_table_species.tsv"

methods_16s:
  libprep_workflow: "'rapid16s'"
  dorado_model: "'dna_r10.4.1_e8.2_400bps_sup@v5.0.0'"
  min_length: 1000
  max_length: 2000
  min_qual: 7
  min_id: 85
  min_cov: 80
  kit_name: "'SQK-16S114-24'"
  tax_rank: "S"
  n_taxa_barplot: 12
  abund_threshold: 0
  loris:
    rarefy: 4500
    norm: "SRS"
    min_abund: 0.00001
    min_freq: 1
    include_lowest: TRUE
    unifrac: TRUE
    betadiv: "aitchison"
    alpha_pd: TRUE
    tax4fun_db: "Ref99NR"
    loris_rarefy: 4500
    keg_minID: 97
    

Intro to High Performance Computing (HPC)

The basecalling step is the most memory-intensive stage of our bioinformatic pipelines. Most local hard drives cannot handle the task, especially for the maximum-precision algorithm we use. High-Performance Computing (HPC) systems and remote clusters (like the Holland Computing Center’s (HCC) remote server known as Swan) are powerful computers designed to handle big tasks that are too demanding for personal computers. These systems allow us to connect remotely and use their resources to process large amounts of data quickly and efficiently. For steps like ONT basecalling, which require a lot of memory and processing power, HPC systems are essential to get the job done without overwhelming our own computers.

Getting Started with the Holland Computing Center (HCC)

Any UNO/UNMC/UNL student affiliated with an established HCC research group can sign up for a free account to access the system. Follow the instructions here and enter “richlab” as your group. I will receive an email to approve your membership, and then your account will become active. You should look through the rest of the HCC’s manual to learn some of the basics before you begin using it. Begin by working through the following:

  1. Connecting to the Clusters
  • My preferred method for step 2 (Open a terminal or SSH client) is to use the terminal window inside R Studio. That makes it easier to move smoothly between code languages and servers within a single R Markdown script like this one.
  1. Handling Data
  • The most important storage guidelines are those explaining when/how to use your home vs. work directories and the overall limits that we share as a group across every member’s accounts.
  • Data transfer can initially trip people up, and you have several options. I used to use cyber duck for uploading/downloading files, but more recently I switched to Globus Connect.
  1. Submitting Jobs
  • This workflow references modularized bash scripts that you can upload to your working directory and submit according to the instructions on the HCC site. You do not need to worry as much about understanding how to find/select/use different apps/modules on the server (I have taken care of that for our pipelines), but you do need to know how to take the scripts I have written, edit a few of the parameters, and tell the HCC when/how to run it for you.

Set up your Directories

Load Reads to HCC Directory for Processing

You can use any number of FTPs (File transfer protocols) to move directories and files from a local hard drive to your working directory on the HCC. I use Globus Connect Personal. Follow the installation and use instructions to transfer your local copy of this entire repository to a location within your personal work directory on the HCC. Then you can simply use the sync option each time you begin working to ensure all your relative paths and directories are available in both locations.

If you set your working directory on swan to the same as the working directory for this R project, then all the paths used here will also work in any scripts you run there.

First Time Steps: Setting Up your Working Swan Environment

Download Dorado Model

You should have some basic understanding of which models Dorado provides for basecalling ONT reads by looking over this page. I use the config package or parameters in the yaml header to track and source the models that I am using. You will need to report details like this in the methods section of any paper produced by your results.

We will almost always choose the newest SUP model available on the HCC with the 10.4.1. kit chemistry.

For some reason dorado’s automatic sourcing and use of models does not seem to work from the GPU nodes on the HCC, so we will download a stable version of our current model options. - This file needs to be the path below within your working directory for it to automatically be located by the code I have written in other scripts.

You should at least download the newest sup model available, but you may also download the newest hac and fast models if you would like. The code in other scripts will search this directory for whichever of these three models you specify at that time.

cd read_processing

module load dorado

dorado download --list

This will give you a list of the models that are available for download. Select the model where you see sup in the name with the highest number after @v. At the time I am writing this, that is dna_r10.4.1_e8.2_400bps_sup@v5.0.0 Copy and paste that name below.

model="<pastehere>"
dorado download --model $model --models-directory dorado_models

Download Nextflow Workflows

We will create a local copy of the scripts necessary for our current nextflow pipelines, but first we have to open an interactive job, because HCC does not permit nextflow commands from the login node.

srun --partition=guest --nodes=1 --ntasks-per-node=1 --job-name=nextflow_download --mem=150GB --time=00:30:00 --cpus-per-task=8 --pty $SHELL

Below: Replace wf-bacterial-genomes with the workflow name of your choice.

module load nextflow

nextflow run epi2me-labs/wf-bacterial-genomes --help

Create Conda Environments

We need to use some modules/packages that the HCC does not pre-install for us on Swan. We will create a mirror of those packages using the storage handler program Anaconda. Once you create an environment and load all the necessary packages, you can call and reopen that environment any time, and the same group of packages will be available to you. We will create two environments to reuse for our quality control and data filtering steps: pycoQC and filter. I named the former for its only package, but the latter actually uses the package chopper.

pycoQC

module load anaconda
conda create -n pycoQC python=3.6
conda activate pycoQC
mamba install pycoqc
conda deactivate

filter

conda create -n filter
conda activate filter
mamba install chopper
conda deactivate

Modkit

conda create -n modkit
conda activate modkit
conda install nanoporetech::modkit
conda deactivate

Next Steps

Now that you have all your necessary directory structures and virtual environments, you should be able to start your Remote Read Processing steps with one of the other remote processing scripts each time. Continue to one of those now to use the environment you just constructed.