The ONT sequencers basically measure electrical signals as strands of DNA pass through each nanopore. The MinKNOW software uses an algorithm to convert each signal to its estimated sequence of A’s/G’s/C’s/T’s in real time, but those on-the-fly basecalls are quite messy. It is standard practice to take the original signal data after the run has completed and use a slower, more precise algorithm to re-basecall the data with greater precision.
If you see a step involving the code “sbatch”, that means I am referencing a separate file with the extension .sh as a complete batch script that runs from the HCC’s SLURM server. You should transfer your version of that script to the local working directory before running the sbatch code. Sometimes I also run shorter jobs as an interactive job. You can read more about that on the HCC manual if you want to try it.
I also use a custom language engine in these scripts that I named “terminal.” If you see a chunk of code with {terminal, warning = FALSE} written where you would usually see {r} at the top of the chunk, then running the chunk should only print that code as a text string in this document. This just makes it easier for me to copy and paste the code directly into the terminal panel that I use in my R Studio window when running code through a remote server instead of my local R console. There are ways to set R Studio up to run code through multiple servers, but I find this the simplest way to switch back and forth while still keeping a record of the code that have used or changes I have made to it.
I keep track of my parameters for workflows and paths to different directories or files using the config package and file. You will find my configuration file in the base directory of whichever repository I have written it for. It is always listed as config.yml
. If you are struggling to find the path to a file that I reference in any script, every path listed in that config file directs you to its location relative to the base of this repository. Read more about how config works here.
The basecalling step is the most memory-intensive stage of our bioinformatic pipelines. Most local hard drives cannot handle the task, especially for the maximum-precision algorithm we use. High-Performance Computing (HPC) systems and remote clusters (like the Holland Computing Center’s (HCC) remote server known as Swan) are powerful computers designed to handle big tasks that are too demanding for personal computers. These systems allow us to connect remotely and use their resources to process large amounts of data quickly and efficiently. For steps like ONT basecalling, which require a lot of memory and processing power, HPC systems are essential to get the job done without overwhelming our own computers.
Any UNO/UNMC/UNL student affiliated with an established HCC research group can sign up for a free account to access the system. Follow the instructions here and enter “richlab” as your group. I will receive an email to approve your membership, and then your account will become active. You should look through the rest of the HCC’s manual to learn some of the basics before you begin using it. Begin by working through the following:
You can use any number of FTPs (File transfer protocols) to move directories and files from a local hard drive to your working directory on the HCC. I use Globus Connect Personal. Follow the installation and use instructions to transfer your local copy of this entire repository to a location within your personal work
directory on the HCC. Then you can simply use the sync option each time you begin working to ensure all your relative paths and directories are available in both locations.
If you set your working directory on swan to the same as the working directory for this R project, then all the paths used here will also work in any scripts you run there.
The first time you process any reads on the HCC, you should work through the script linked below. You may download the R markdown file to work from, or open the html version in a browser for a more readable tutorial. You only need to work through this script once. After that you should be able to start directly from one of the Read Processing scripts.