This is the current recommended pipeline for processing full-length 16S reads and count data generated from the EPI2ME Labs wf-16s workflow.
Microbiome data analysis has rapidly evolved into a cornerstone of biological and ecological research, offering insights into how microbial communities influence everything from human health to environmental ecosystems. However, this type of analysis often involves multiple complex steps: data normalization, diversity calculations, community composition comparisons, and advanced visualizations.
For more information on some of the statistical tests I often use/recommend, see the tutorial in this directory called Data_Notes.
I am writing this script with a version of our lab’s pygmy loris microbiome data. If you are working on one of our other microbiome projects, it should be fairly simple to adapt the original .Rmd script you are reading to a different dataset, especially if you make use of the params settings in the yaml header at the top.
You should have already completed the SampleInventory and MetadataSetup workflows, which prepared formatted files that you can import here to begin connecting your outcome metrics to your independent variables.
You also should have used one of the Read
Processing scripts to basecall, demultiplex, filter, and trim your
16s reads. The 16S script also walks you through use of the
Nextflow
wf-16s
pipeline, which aligns those
reads and assigns taxonomy using minimap2
and
NCBI
databases. While the pipeline produces summary
statistics for those alignments, this script actually backtraces a bit
to give us more flexibility with the raw data. Mainly, we want to have
the ability to produce our own .fasta
file with references
that we gather from GenBank for each of our OTUs, rather than the
traditional route of ASVs generated by QIIME2
for short
read sequencing.
source("metadata/loris/factors.R")
metadata <- read_tsv("metadata/loris/samples_metadata.tsv", show_col_types = FALSE) %>%
filter(steps_remaining == "sample extracted and sequenced") %>%
filter(!is.na(collection_date)) %>%
mutate(across(ends_with("date"), ~ymd(.)))
metadata
sample.list <- metadata %>%
distinct(identifier, subject, collection_date) %>%
filter(subject != "unknown") %>%
arrange(subject, collection_date) %>%
distinct(identifier) %>%
map(\(x) as.list(x)) %>%
list_flatten(name_spec = "")
samples <- sample.list %>% unlist()
samp.list.culi <- metadata %>%
filter(subject == "culi") %>%
distinct(identifier, collection_date) %>%
arrange(collection_date) %>%
distinct(identifier) %>%
map(\(x) as.list(x)) %>%
list_flatten(name_spec = "")
samples.culi <- samp.list.culi %>% unlist()
samp.list.warb <- metadata %>%
filter(subject == "warble") %>%
distinct(identifier, collection_date) %>%
arrange(collection_date) %>%
distinct(identifier) %>%
map(\(x) as.list(x)) %>%
list_flatten(name_spec = "")
samples.warb <- samp.list.warb %>% unlist()
libraries <- metadata %>%
arrange(seq_run_date) %>%
select(library_id, identifier) %>%
group_by(library_id) %>%
group_map(~ {
setNames(list(as.list(.x$identifier)), .y$library_id)
}, .keep = TRUE) %>%
flatten()
working_libraries <- libraries %>%
keep_at(paste0(params$seqrun)) %>%
list_c()
You will need the following files from the output directory generated by running wf-16s at the end of the Read Processing script: