bioinformatics_stats

Bioinformatics and Statistical Analysis

This repository contains the scripts, tutorials, and templates for the Rich Lab’s mainstream bioinformatic workflows. Before you begin working with any data or samples, you should first make sure you understand the DataInventory and Metadata workflows.

To view the webpage version of this repository, click here: Rich Lab Bioinformatics/Stats
To view the base repository, click here: Rich Lab Bioinformatics/Stats

Overview

If you are working with any molecular samples in the lab, you should begin with the following workflows (in order):

  1. SampleInventory.Rmd & SampleInventory.html - a script to streamline and standardize the information we maintain about every sample’s process of purification and analysis for reliable downstream integration.
  2. MetadataSetup.Rmd & MetadataSetup.html - a tutorial that models some best practices for brainstorming, listing, and organizing key potential independent variables for a given study and matching a score for each variable to each sample in one dataframe.
  3. MinIONReadProcessing.Rmd & MinIONReadProcessing.html - a tutorial to use the HCC cluster to process the raw sequencing data (.pod5 files) that we get from the ONT MinION sequencer directly after any sequencing run.
    • Note: with the new Linux laptop we may be able to shift some of these scripts to local runs rather than using the high performance nodes on the HCC. If this works, I will add a modified script to run the pipeline locally.
  4. Data_Notes.Rmd & Data_Notes.html - some notes, thoughts, and examples on the options we most often use for testing different hypotheses with the types of data we generate.

If you are working with microbiome data, then you should follow these tutorials with the MicroEco Setup:

  1. microbiome_new_data.Rmd & microbiome_new_data.html - tutorial for microbiome data cleaning and prep that uses the MicroEco package in R for downstream stats and visualization.
  2. ExploreResults_LorisMicrobiome.Rmd & ExploreResults_LorisMicrobiome.html - in prep tutorial to use microeco for exploratory analysis and summary stats using loris microbiome data. Note: you must follow the MicroEcoDataPrep script to prepare datasets that will work with this pipeline.

The raw scripts with chunk of code that you can run directly (assuming you download the entire repository with necessary dependencies) are in .Rmd format. I also use the knitr package to create “prettier” (but read-only) versions of those files, which are easier to read and study on their own.

Once you look through these, you should work on your own MetadataSetup script for the hypotheses you are trying to test. You can download the tutorial as a template and edit it from there.

I will add more tutorials and guides later, including details on how to best use github and R Studio to push and pull your own repositories to this site.

Dependencies

Description of files in parent directory (richlab_main/bioinformatics_stats/)

Main R-Markdown (.Rmd) Files to Start From (each also available as a knitted html format):

R Markdown File Knitted HTML Link purpose
SampleInventory.Rmd SampleInventory.html Maintain records of all Rich Lab samples and export samplesheets for Dorado.
MetadataSetup.Rmd MetadataSetup.html Create predictor variables, code them, and then match to samples.
MinIONReadProcessing.Rmd MinIONReadProcessing.html Basecall, demultiplex, filter, clean, and organize raw ONT sequence data.
microbiome_new_data.Rmd microbiome_new_data.html Prepare aligned reads and other wf-16s outputs for analysis using MicroEco.
Data_Notes.Rmd Data_Notes.html Review basic statistical options available for some of our main datasets.

Scripts not specific to R languages:

filename description
README.md Text file (markdown format) description of the project.
config.yml directory paths and other parameters to ensure reproducible code pipelines.

Subdirectories:

directory name purpose
data/ Intermediate or raw-stage datasets in table form. Subdirectories organized by sampleset.
dataframes/ Data tables produced and used by other Rmd scripts in this repository.
metadata/ Data table files and R scripts to generate tibbles/dataframes with metadata.
microeco/ Datasets and results produced and used directly by the microeco package.
setup/ Modularized .R scripts with all parameters, functions, packages, and other dependencies.

Samplesets Used to Organize and Categorize Data for Current Projects

Each of these may appear as subdirectories within those listed above to organize the files for each set of projects. Below are the main samplesets in use.

sampleset shorthand description
loris Pygmy loris genetic, microbial, and behavioral data collected with Henry Doorly Zoo
marmoset Gut microbiome data gathered from the UNO Research Colony for Shayda Azadmanesh’s thesis.
bats Genetic and gut microbiome data gathered by collaborators at trapping sites across N. America
environmental Samples gathered from opportunistic environmental sources for Thomas Raad’s thesis.
isolates Purified DNA from bacterial isolates grown in the Ayayeye lab for whole genome sequencing in the Rich Lab.

Highlighted Summaries or Graphics

These links take you to full-page summary tables or graphics compiled for some of the in-progress analyses.

Helpful Manuals, Cheatsheets, and Tutorials

Here are some of the online resources for the packages or platforms that I use the most for different workflows and scripts in the lab.