global <- config::get(config = "default")
here::i_am("RemoteHCC_FirstUse.Rmd")
source(here::here(global$setup))
theme_set(theme_classic())
thematic_rmd()
thematic_on(accent = "#8785B2FF", fg = "black")
opts_chunk$set(message = FALSE,
warning = FALSE,
echo = FALSE,
include = TRUE,
eval = TRUE,
comment = "")
The ONT sequencers basically measure electrical signals as strands of DNA pass through each nanopore. The MinKNOW software uses an algorithm to convert each signal to its estimated sequence of A’s/G’s/C’s/T’s in real time, but those on-the-fly basecalls are quite messy. It is standard practice to take the original signal data after the run has completed and use a slower, more precise algorithm to re-basecall the data with greater precision.
After a sequencing run you MinKNOW will drop many different files and subdirectories into a parent directory labeled with the name of your run. Here is what you should do with some of the main files you need to carry forward:
Relative Paths | Description | Use | Destination |
---|---|---|---|
**/pod5/*.pod5 | Uncalled reads that passed initial quality control parameters on the MinKNOW local software. | Your input for the basecalling step you take first. | Working directory on HCC/Swan that you will call in your script. |
barcode_alignment_*.tsv | Output stats for multiplexed/barcoded sequencing runs summarized by barcode | We change the filename with your Library Code (e.g. hdz1) as a prefix and remove the really long string after 'alignment_' before folding the file into our SampleInventory script for preparing the demultiplexing sample sheets. | These files must be properly renamed and then place in the bioinformatics-stats repository under the subdirectory dataframes/barcodes/. |
**/dataframes/sample_sheet/*/*_sample_sheet.csv | Dorado-formatted samplesheets created in the SampleInventory workflow using the barcode_alignment tables. | After the SampleInventory workflow, you should have one sample sheet for each sequencing run (with the run name/code as the file prefix) that dorado can use for matching sample/sequence names to barcodes (note: this is only necessary if you are working with multiplexed/barcoded libraries). | Working directory on HCC/Swan that you will call in your script. |
If you see a step involving the code “sbatch”, that means I am referencing a separate file with the extension .sh as a complete batch script that runs from the HCC’s SLURM server. You should transfer your version of that script to the local working directory before running the sbatch code. Sometimes I also run shorter jobs as an interactive job. You can read more about that on the HCC manual if you want to try it.
I also use a custom language engine in this script that I named “terminal.” If you see a chunk of code with {terminal, warning = FALSE} written where you would usually see {r} at the top of the chunk, then running the chunk should only print that code as a text string in this document. This just makes it easier for me to copy and paste the code directly into the terminal panel that I use in my R Studio window when running code through a remote server instead of my local R console. There are ways to set R Studio up to run code through multiple servers, but I find this the simplest way to switch back and forth while still keeping a record of the code that have used or changes I have made to it.
I keep track of my parameters for workflows using the config package and file. Below is my full-length config.yml file. Scroll down to “methods_16s” to see the parameters I use for 16S microbiome sequences generated with ONT’s rapid 16S kit. This allows us to modify parameters in one central location that we reference in all other scripts, avoiding any inconsistences across bioinformatic stages and giving us one place to check when reporting the methods for manuscripts or presentations.
default: setup: "setup/setup.R" conflicts: "setup/conflicts.R" functions: "setup/functions.R" packages: "setup/packages.R" inputs: "setup/inputs.R" knit_engines: "setup/knit_engines.R" fonts: "setup/fonts/FA6_Free-Solid-900.otf" tmp_tsv: "tmp/tmp_table.tsv" tmp_downloads: "tmp/downloads/" tmp_fetch: "tmp/fetch_references.txt" tmp_fasta3: "tmp/tmp3.fasta" tmp_fasta4: "tmp/tmp4.fasta" scripts: local: basecall: "batch_scripts/basecall_local.sh" trimWG: "batch_scripts/trimWG_local.sh" isolates: samplesets: "salci" minQC: 10 bats: inventories: all_stages: "samples/inventories/bats/compilation_bats.tsv" collection: "samples/inventories/bats/samples_bats.csv" extraction: "samples/inventories/bats/extracts_bats.csv" libraries: "samples/inventories/bats/libraries_bats.csv" seqruns: "samples/inventories/bats/seqruns_bats.csv" loris: day1: "2023-10-26" last: "2024-10-25" sequencing: coverage: "visuals/loris_depth_summary.html" depth_plot: "visuals/loris_depth_hist.html" metadata: scripts: !expr list("metadata/loris/colors.R", "metadata/loris/metadata_key.R", "metadata/loris/nutrition.R", "metadata/loris/hdz_loris_log.R", "metadata/loris/diet_tables.R") bristol: "metadata/loris/bristols.tsv" studbook: "metadata/loris/subjects_loris.csv" summary: "metadata/loris/samples_metadata.tsv" key: "metadata/loris/metadata_key.R" factors: "metadata/loris/factors.R" foods: "metadata/loris/foods.tsv" proteins: "metadata/loris/proteins.tsv" fats: "metadata/loris/fats.tsv" CHOs: "metadata/loris/CHOs.tsv" Ash: "metadata/loris/Ash.tsv" vitamins: "metadata/loris/vitamins.tsv" reactable: "metadata/loris/loris_metadata_summary.html" sample_table: identifiers: "metadata/loris/identifier_key.tsv" main: "metadata/loris/sample_table.tsv" merged: "metadata/loris/sample_table_merged.tsv" inventories: all_stages: "dataframes/sample_inventories/compilation_loris.tsv" collection: "dataframes/sample_inventories/samples_loris.csv" extraction: "dataframes/sample_inventories/extracts_loris.csv" libraries: "dataframes/sample_inventories/libraries_loris.csv" seqruns: "dataframes/sample_inventories/seqruns_loris.csv" outputs_wf16s: "data/loris/outputs_wf16s/" barcodes_output: "dataframes/barcodes/loris/" read_alignments: "data/loris/read_alignments" taxa_reps: aligned: "data/loris/taxonomy/refseqs_aligned.fasta" tree: "data/loris/taxonomy/refseqs_tree.newick" table: "data/loris/taxonomy/tax_table.tsv" abundance_wf16s: "data/loris/wf16s_abundance/" microeco: dataset: main: keg: "microeco/loris/datasets/main/keg" njc: "microeco/loris/datasets/main/njc" fpt: "microeco/loris/datasets/main/fpt" tax: "microeco/loris/datasets/main" culi: keg: "microeco/loris/datasets/culi/keg" njc: "microeco/loris/datasets/culi/njc" fpt: "microeco/loris/datasets/culi/fpt" tax: "microeco/loris/datasets/culi" warb: keg: "microeco/loris/datasets/warble/keg" njc: "microeco/loris/datasets/warble/njc" fpt: "microeco/loris/datasets/warble/fpt" tax: "microeco/loris/datasets/warble" abund: main: keg: "microeco/loris/abundance/main/keg" fpt: "microeco/loris/abundance/main/fpt" njc: "microeco/loris/abundance/main/njc" tax: "microeco/loris/abundance/main" culi: keg: "microeco/loris/abundance/culi/keg" fpt: "microeco/loris/abundance/culi/fpt" njc: "microeco/loris/abundance/culi/njc" tax: "microeco/loris/abundance/culi" warb: keg: "microeco/loris/abundance/warble/keg" fpt: "microeco/loris/abundance/warble/fpt" njc: "microeco/loris/abundance/warble/njc" tax: "microeco/loris/abundance/warble" alpha: main: "microeco/loris/alphadiversity/main" culi: "microeco/loris/alphadiversity/culi" warb: "microeco/loris/alphadiversity/warble" beta: main: kegg: "microeco/loris/betadiversity/main/keg" fpt: "microeco/loris/betadiversity/main/fpt" njc: "microeco/loris/betadiversity/main/njc" tax: "microeco/loris/betadiversity/main" culi: kegg: "microeco/loris/betadiversity/culi/keg" fpt: "microeco/loris/betadiversity/culi/fpt" njc: "microeco/loris/betadiversity/culi/njc" tax: "microeco/loris/betadiversity/culi" warb: kegg: "microeco/loris/betadiversity/warble/keg" fpt: "microeco/loris/betadiversity/warble/fpt" njc: "microeco/loris/betadiversity/warble/njc" tax: "microeco/loris/betadiversity/warble" data: main: feature: "microeco/loris/datasets/main/feature_table.tsv" tree: "microeco/loris/datasets/main/phylo_tree.tre" fasta: "microeco/loris/datasets/main/rep_fasta.fasta" samples: "microeco/loris/datasets/main/sample_table.tsv" taxa: "microeco/loris/datasets/main/tax_table.tsv" culi: feature: "microeco/loris/datasets/culi/feature_table.tsv" tree: "microeco/loris/datasets/culi/phylo_tree.tre" fasta: "microeco/loris/datasets/culi/rep_fasta.fasta" samples: "microeco/loris/datasets/culi/sample_table.tsv" taxa: "microeco/loris/datasets/culi/tax_table.tsv" warb: feature: "microeco/loris/datasets/warb/feature_table.tsv" tree: "microeco/loris/datasets/warb/phylo_tree.tre" fasta: "microeco/loris/datasets/warb/rep_fasta.fasta" samples: "microeco/loris/datasets/warb/sample_table.tsv" taxa: "microeco/loris/datasets/warb/tax_table.tsv" sample_sheets: compilations: bats: "samples/sample_sheets/bats/nwr_combined_sample_sheet.csv" loris: "dataframes/sample_sheet/loris/hdz_combined_sample_sheet.csv" marmoset: "dataframes/sample_sheet/marmoset/cm_combined_sample_sheet.csv" nwr1: "samples/sample_sheets/bats/nwr1_sample_sheet.csv" hdz1: "samples/sample_sheets/loris/hdz1_sample_sheet.csv" hdz2: "samples/sample_sheets/loris/hdz2_sample_sheet.csv" hdz3: "samples/sample_sheets/loris/hdz3_sample_sheet.csv" hdz4: "samples/sample_sheets/loris/hdz4_sample_sheet.csv" hdz5: "samples/sample_sheets/loris/hdz5_sample_sheet.csv" hdz6: "samples/sample_sheets/loris/hdz6_sample_sheet.csv" hdz7: "samples/sample_sheets/loris/hdz7_sample_sheet.csv" hdz8: "samples/sample_sheets/loris/hdz8_sample_sheet.csv" hdz9: "samples/sample_sheets/loris/hdz9_sample_sheet.csv" hdz10: "samples/sample_sheets/loris/hdz10_sample_sheet.csv" hdz11: "samples/sample_sheets/loris/hdz11_sample_sheet.csv" hdz12: "samples/sample_sheets/loris/hdz12_sample_sheet.csv" hdz13: "samples/sample_sheets/loris/hdz13_sample_sheet.csv" hdz14: "samples/sample_sheets/loris/hdz14_sample_sheet.csv" hdz15: "samples/sample_sheets/loris/hdz15_sample_sheet.csv" hdz16: "samples/sample_sheets/loris/hdz16_sample_sheet.csv" hdz17: "samples/sample_sheets/loris/hdz17_sample_sheet.csv" hdz18: "samples/sample_sheets/loris/hdz18_sample_sheet.csv" cm001: "dataframes/sample_sheet/marmoset/cm001_sample_sheet.csv" cm002: "dataframes/sample_sheet/marmoset/cm002_sample_sheet.csv" cm003: "dataframes/sample_sheet/marmoset/cm003_sample_sheet.csv" cm004: "dataframes/sample_sheet/marmoset/cm004_sample_sheet.csv" cm005: "dataframes/sample_sheet/marmoset/cm005_sample_sheet.csv" cm006: "dataframes/sample_sheet/marmoset/cm006_sample_sheet.csv" cm007: "dataframes/sample_sheet/marmoset/cm007_sample_sheet.csv" cm008: "dataframes/sample_sheet/marmoset/cm008_sample_sheet.csv" cm009: "dataframes/sample_sheet/marmoset/cm009_sample_sheet.csv" cm010: "dataframes/sample_sheet/marmoset/cm010_sample_sheet.csv" cm011: "dataframes/sample_sheet/marmoset/cm011_sample_sheet.csv" barcode_alignments: compilations: loris: "samples/barcode_alignments/loris/hdz_combined_barcode_alignment.tsv" bats: "samples/barcode_alignments/bats/nwr_combined_barcode_alignment.tsv" marmoset: "../labwork/minion_data/barcode_alignments/marmoset/cm_combined_barcode_alignment.tsv" nwr1: "samples/barcode_alignments/bats/nwr1_barcode_alignment.tsv" hdz1: "samples/barcode_alignments/loris/hdz1_barcode_alignment.tsv" hdz2: "samples/barcode_alignments/loris/hdz2_barcode_alignment.tsv" hdz3: "samples/barcode_alignments/loris/hdz3_barcode_alignment.tsv" hdz4: "samples/barcode_alignments/loris/hdz4_barcode_alignment.tsv" hdz5: "samples/barcode_alignments/loris/hdz5_barcode_alignment.tsv" hdz6: "samples/barcode_alignments/loris/hdz6_barcode_alignment.tsv" hdz7: "samples/barcode_alignments/loris/hdz7_barcode_alignment.tsv" hdz8: "samples/barcode_alignments/loris/hdz8_barcode_alignment.tsv" hdz9: "samples/barcode_alignments/loris/hdz9_barcode_alignment.tsv" hdz10: "samples/barcode_alignments/loris/hdz10_barcode_alignment.tsv" hdz11: "samples/barcode_alignments/loris/hdz11_barcode_alignment.tsv" hdz12: "samples/barcode_alignments/loris/hdz12_barcode_alignment.tsv" hdz13: "samples/barcode_alignments/loris/hdz13_barcode_alignment.tsv" hdz14: "samples/barcode_alignments/loris/hdz14_barcode_alignment.tsv" hdz15: "samples/barcode_alignments/loris/hdz15_barcode_alignment.tsv" hdz16: "samples/barcode_alignments/loris/hdz16_barcode_alignment.tsv" hdz17: "samples/barcode_alignments/loris/hdz17_barcode_alignment.tsv" hdz18: "samples/barcode_alignments/loris/hdz18_barcode_alignment.tsv" cm001: "../labwork/minion_data/barcode_alignments/marmoset/cm001_barcode_alignment.tsv" cm002: "../labwork/minion_data/barcode_alignments/marmoset/cm002_barcode_alignment.tsv" cm003: "../labwork/minion_data/barcode_alignments/marmoset/cm003_barcode_alignment.tsv" cm004: "../labwork/minion_data/barcode_alignments/marmoset/cm004_barcode_alignment.tsv" cm005: "../labwork/minion_data/barcode_alignments/marmoset/cm005_barcode_alignment.tsv" cm006: "../labwork/minion_data/barcode_alignments/marmoset/cm006_barcode_alignment.tsv" cm007: "../labwork/minion_data/barcode_alignments/marmoset/cm007_barcode_alignment.tsv" cm008: "../labwork/minion_data/barcode_alignments/marmoset/cm008_barcode_alignment.tsv" cm009: "../labwork/minion_data/barcode_alignments/marmoset/cm009_barcode_alignment.tsv" cm010: "../labwork/minion_data/barcode_alignments/marmoset/cm010_barcode_alignment.tsv" cm011: "../labwork/minion_data/barcode_alignments/marmoset/cm011_barcode_alignment.tsv" abund_wf16s_files: hdz1: "data/loris/wf16s_abundance/hdz1_abundance_table_species.tsv" hdz2: "data/loris/wf16s_abundance/hdz2_abundance_table_species.tsv" hdz3: "data/loris/wf16s_abundance/hdz3_abundance_table_species.tsv" hdz4: "data/loris/wf16s_abundance/hdz4_abundance_table_species.tsv" hdz5: "data/loris/wf16s_abundance/hdz5_abundance_table_species.tsv" hdz6: "data/loris/wf16s_abundance/hdz6_abundance_table_species.tsv" hdz7: "data/loris/wf16s_abundance/hdz7_abundance_table_species.tsv" hdz8: "data/loris/wf16s_abundance/hdz8_abundance_table_species.tsv" hdz9: "data/loris/wf16s_abundance/hdz9_abundance_table_species.tsv" hdz10: "data/loris/wf16s_abundance/hdz10_abundance_table_species.tsv" hdz11: "data/loris/wf16s_abundance/hdz11_abundance_table_species.tsv" hdz12: "data/loris/wf16s_abundance/hdz12_abundance_table_species.tsv" hdz13: "data/loris/wf16s_abundance/hdz13_abundance_table_species.tsv" hdz14: "data/loris/wf16s_abundance/hdz14_abundance_table_species.tsv" hdz15: "data/loris/wf16s_abundance/hdz15_abundance_table_species.tsv" hdz16: "data/loris/wf16s_abundance/hdz16_abundance_table_species.tsv" hdz17: "data/loris/wf16s_abundance/hdz17_abundance_table_species.tsv" hdz18: "data/loris/wf16s_abundance/hdz18_abundance_table_species.tsv" cm001: "data/marmoset/wf16s_abundance/cm001_abundance_table_species.tsv" cm002: "data/marmoset/wf16s_abundance/cm002_abundance_table_species.tsv" cm003: "data/marmoset/wf16s_abundance/cm003_abundance_table_species.tsv" cm004: "data/marmoset/wf16s_abundance/cm004_abundance_table_species.tsv" cm005: "data/marmoset/wf16s_abundance/cm005_abundance_table_species.tsv" cm006: "data/marmoset/wf16s_abundance/cm006_abundance_table_species.tsv" cm007: "data/marmoset/wf16s_abundance/cm007_abundance_table_species.tsv" cm008: "data/marmoset/wf16s_abundance/cm008_abundance_table_species.tsv" cm009: "data/marmoset/wf16s_abundance/cm009_abundance_table_species.tsv" cm010: "data/marmoset/wf16s_abundance/cm010_abundance_table_species.tsv" cm011: "data/marmoset/wf16s_abundance/cm011_abundance_table_species.tsv" methods_16s: libprep_workflow: "'rapid16s'" dorado_model: "'dna_r10.4.1_e8.2_400bps_sup@v5.0.0'" min_length: 1000 max_length: 2000 min_qual: 7 min_id: 85 min_cov: 80 kit_name: "'SQK-16S114-24'" tax_rank: "S" n_taxa_barplot: 12 abund_threshold: 0 loris: rarefy: 4500 norm: "SRS" min_abund: 0.00001 min_freq: 1 include_lowest: TRUE unifrac: TRUE betadiv: "aitchison" alpha_pd: TRUE tax4fun_db: "Ref99NR" loris_rarefy: 4500 keg_minID: 97
The basecalling step is the most memory-intensive stage of our bioinformatic pipelines. Most local hard drives cannot handle the task, especially for the maximum-precision algorithm we use. High-Performance Computing (HPC) systems and remote clusters (like the Holland Computing Center’s (HCC) remote server known as Swan) are powerful computers designed to handle big tasks that are too demanding for personal computers. These systems allow us to connect remotely and use their resources to process large amounts of data quickly and efficiently. For steps like ONT basecalling, which require a lot of memory and processing power, HPC systems are essential to get the job done without overwhelming our own computers.
Any UNO/UNMC/UNL student affiliated with an established HCC research group can sign up for a free account to access the system. Follow the instructions here and enter “richlab” as your group. I will receive an email to approve your membership, and then your account will become active. You should look through the rest of the HCC’s manual to learn some of the basics before you begin using it. Begin by working through the following:
You can use any number of FTPs (File transfer protocols) to
move directories and files from a local hard drive to your working
directory on the HCC. I use Globus Connect
Personal. Follow the installation and use instructions to transfer
your local copy of this entire repository to a location within your
personal work
directory on the HCC. Then you can simply use
the sync option each time you begin working to ensure all your relative
paths and directories are available in both locations.
If you set your working directory on swan to the same as the working directory for this R project, then all the paths used here will also work in any scripts you run there.
You should have some basic understanding of which models Dorado provides for basecalling ONT reads by looking over this page. I use the config package or parameters in the yaml header to track and source the models that I am using. You will need to report details like this in the methods section of any paper produced by your results.
We will almost always choose the newest SUP model available on the HCC with the 10.4.1. kit chemistry.
For some reason dorado’s automatic sourcing and use of models does not seem to work from the GPU nodes on the HCC, so we will download a stable version of our current model options. - This file needs to be the path below within your working directory for it to automatically be located by the code I have written in other scripts.
You should at least download the newest sup
model
available, but you may also download the newest hac
and
fast
models if you would like. The code in other scripts
will search this directory for whichever of these three models you
specify at that time.
cd read_processing
module load dorado
dorado download --list
This will give you a list of the models that are available for download. Select the model where you see
sup
in the name with the highest number after@v
. At the time I am writing this, that isdna_r10.4.1_e8.2_400bps_sup@v5.0.0
Copy and paste that name below.
model="<pastehere>"
dorado download --model $model --models-directory dorado_models
We will create a local copy of the scripts necessary for our current nextflow pipelines, but first we have to open an interactive job, because HCC does not permit nextflow commands from the login node.
srun --partition=guest --nodes=1 --ntasks-per-node=1 --job-name=nextflow_download --mem=150GB --time=00:30:00 --cpus-per-task=8 --pty $SHELL
Below: Replace
wf-bacterial-genomes
with the workflow name of your choice.
module load nextflow
nextflow run epi2me-labs/wf-bacterial-genomes --help
We need to use some modules/packages that the HCC does not pre-install for us on Swan. We will create a mirror of those packages using the storage handler program Anaconda. Once you create an environment and load all the necessary packages, you can call and reopen that environment any time, and the same group of packages will be available to you. We will create two environments to reuse for our quality control and data filtering steps: pycoQC and filter. I named the former for its only package, but the latter actually uses the package chopper.
pycoQC
module load anaconda
conda create -n pycoQC python=3.6
conda activate pycoQC
mamba install pycoqc
conda deactivate
filter
conda create -n filter
conda activate filter
mamba install chopper
conda deactivate
Modkit
conda create -n modkit
conda activate modkit
conda install nanoporetech::modkit
conda deactivate
Now that you have all your necessary directory structures and virtual environments, you should be able to start your Remote Read Processing steps with one of the other remote processing scripts each time. Continue to one of those now to use the environment you just constructed.