Intro

This Workflow

This is the current recommended pipeline for processing full-length 16S reads and count data generated from the EPI2ME Labs wf-16s workflow.

Some Background

Microbiome data analysis has rapidly evolved into a cornerstone of biological and ecological research, offering insights into how microbial communities influence everything from human health to environmental ecosystems. However, this type of analysis often involves multiple complex steps: data normalization, diversity calculations, community composition comparisons, and advanced visualizations.

For more information on some of the statistical tests I often use/recommend, see the tutorial in this directory called Data_Notes.

The Data Used Here

I am writing this script with a version of our lab’s pygmy loris microbiome data. If you are working on one of our other microbiome projects, it should be fairly simple to adapt the original .Rmd script you are reading to a different dataset, especially if you make use of the params settings in the yaml header at the top.

Previous Scripts

You should have already completed the SampleInventory and MetadataSetup workflows, which prepared formatted files that you can import here to begin connecting your outcome metrics to your independent variables.

You also should have used one of the Read Processing scripts to basecall, demultiplex, filter, and trim your 16s reads. The 16S script also walks you through use of the Nextflow wf-16s pipeline, which aligns those reads and assigns taxonomy using minimap2 and NCBI databases. While the pipeline produces summary statistics for those alignments, this script actually backtraces a bit to give us more flexibility with the raw data. Mainly, we want to have the ability to produce our own .fasta file with references that we gather from GenBank for each of our OTUs, rather than the traditional route of ASVs generated by QIIME2 for short read sequencing.


Workflow

Load Metadata

source("metadata/loris/factors.R")
metadata <- read_tsv("metadata/loris/samples_metadata.tsv", show_col_types = FALSE)  %>%
  filter(steps_remaining == "sample extracted and sequenced") %>%
  filter(!is.na(collection_date)) %>%
  mutate(across(ends_with("date"), ~ymd(.)))

metadata

Ordered Sample Lists for Subsetting

sample.list <- metadata %>% 
               distinct(identifier, subject, collection_date) %>%
               filter(subject != "unknown") %>%
               arrange(subject, collection_date) %>%
               distinct(identifier) %>%
                   map(\(x) as.list(x)) %>%
  list_flatten(name_spec = "")

samples <- sample.list %>% unlist()

samp.list.culi  <- metadata %>% 
                   filter(subject == "culi") %>%
                   distinct(identifier, collection_date) %>%
                   arrange(collection_date) %>%
                   distinct(identifier) %>%
                   map(\(x) as.list(x)) %>%
  list_flatten(name_spec = "")

samples.culi       <- samp.list.culi %>% unlist()



samp.list.warb  <- metadata %>% 
                   filter(subject == "warble") %>%
                   distinct(identifier, collection_date) %>%
                   arrange(collection_date) %>%
                   distinct(identifier) %>%
                   map(\(x) as.list(x)) %>%
  list_flatten(name_spec = "")

samples.warb <- samp.list.warb %>% unlist()

libraries <- metadata %>%
  arrange(seq_run_date) %>%
  select(library_id, identifier) %>%
  group_by(library_id) %>%
  group_map(~ {
    setNames(list(as.list(.x$identifier)), .y$library_id)
  }, .keep = TRUE) %>%
  flatten()

working_libraries <- libraries %>%
  keep_at(paste0(params$seqrun)) %>%
  list_c()

Load 16S Data

You will need the following files from the output directory generated by running wf-16s at the end of the Read Processing script:

Species-Level Abundance Table

We are going to read all current versions of abundance tables produced by wf-16s and bind them into a single table. We really just want to use this to inventory the names and other data for every sample according to the wf-16s syntax though. We will calculate abundances on our own after we clean, filter, and manipulate the data a bit more.

For this step, make sure you have all abundance tables and only your abundance files in the subdirectory path identified in your config.yml file. They should also be named beginning with the sequencing run id or library_id followed by _abundance_table_species.tsv.

abundance <- map(seqruns, ~ {
  seqrun <- .x
  path   <- abund_wf16s_files[[paste0(seqrun)]]
  
  df <- read_tsv(path) %>%
        rename_with(~str_replace_all(., "\\.", "-")) %>%
        select(-c(starts_with("total"))) %>%
        mutate(across(where(is.character), ~str_squish(str_remove_all(., "'|\\[|\\]"))))
  
  df.list <- list(df) %>% set_names(seqrun)
  
  return(df.list)
}) %>% list_flatten(name_spec = "{inner}")

seqids <- imap(abundance, ~ {
  seqrun   <- .y
  df       <- .x
  seqids   <- names(select(df, -tax))

  
  as.list(seqids)
  
})

Read Alignment Tables

This step is pretty annoying, but until the developers update this gap I use the following workaround to import the raw alignment data for each sample and process it directly here in R.

First, we will use the list of IDs that we just generated from the abundance tables and the metadata table as well as the seqrun id from the yaml header to create a table for renaming the csv files we download from the Wf-16S Report.

Generate Filenames

filenames <- seqids %>% 
             keep_at(params$seqrun) %>% 
             unlist() %>%
             enframe() %>%
             select(alias = value) %>%
  mutate(file_append = (row_number() - 1)) %>%
  mutate(old_name    = if_else(file_append == 0,
                               "wf-metagenomics-alignment.csv",
                      str_glue("wf-metagenomics-alignment (",
                               "{file_append}", ").csv")),
         new_name    = str_glue("{alias}",
                                "_wf-metagenomics-alignment.csv"), 
                                .keep = "none")

write.table(filenames,  "tmp/tmp_table.tsv",
              sep       = "\t",
              quote     = FALSE,
              row.names = FALSE,
              col.names = FALSE)

Download all Individual Alignment Tables

Now comes the somewhat messy workaround…

  1. Make sure you have a completely clean temporary directory available for downloading.
  • I generally just empty out my Downloads folder each time before I do this and download to there first.
  1. Go to the html workflow report in your browser window and scroll down to the Alignment Statistics section.
  • Click Export CSV. Then use the dropdown box to select the next sample and repeat the Export CSV process.
  • Keep doing this for all 24 samples from your multiplexed run.
  • This part is important: You must be sure to download each table sequentially in the order it appears in the dropdown.
    • The code chunk that I wrote below uses an ordered list of those sample IDs to rename the files in your downloads folder based on the order in which you downloaded them (e.g., wf-metagenomics-alignment (2).csv and wf-metagenomics-alignment (3).csv may become hdz-489-s391_wf-metagenomics-alignment.csv and hdz-488-s390_wf-metagenomics-alignment.csv).
  1. Manually transfer all your newly downloaded but not yet renamed files to the following sub directory (found at global$tmp_downloads in my config.yml):
  • “../bioinformatics_stats/tmp/downloads/”
  • NOTE: MAKE SURE THIS FOLDER IS EMPTY FIRST
  1. Only after you have downloaded all 24 csv files and transferred them to that directory: Run the code below from the Terminal tab.

Rename and Transfer Files from the Local Terminal

Again, only do this after you have those 24 new csv files in the tmp_downloads directory beginning in wf-metagenomics-alignment and you have the filename table we created above sitting in the tmp_lists directory. - The code will simultaneously rename all 24 files and transfer them to the subdirectory path$read_alignments. - The rest of the script below uses this filename system to import and merge the files, matching every aligned read to its sequence_id - If it worked, then all the files should disappear from the temp directory and you should see 24 new csv files in your data directory ending in wf-metagenomics-alignment.csv


filenames="../tmp_table.tsv"
target_dir="../../data/loris/read_alignments"
downloads="tmp/downloads/"

cd $downloads

while IFS=$'\t' read -r old_name new_name; do
    mv "$old_name" "$target_dir/$new_name"
done < "$filenames"

Read All Files In

Now we will go ahead and read in those new tables as well as any others to update one comprehensive dataframe with all read alignments for this dataset to date. This code assumes that you have all properly named and structured alignment tables in a single subdirectory together and makes use of the config file’s path to that subdirectory.

alignment.files.list <- list.files(path       = path$read_alignments, 
                                   pattern    = "*_wf-metagenomics-alignment.csv$", 
                                   full.names = TRUE)
alignment.filenames  <- list.files(path       = path$read_alignments, 
                                   pattern    = "*_wf-metagenomics-alignment.csv$", 
                                   full.names = FALSE)
alignment.files      <- lapply(alignment.files.list, read_csv)
alias                <- str_remove(alignment.filenames, "_wf-metagenomics-alignment.csv")

alignment.files <- Map(function(df, id) {
  if (nrow(df) > 0) {
    df$alias <- id
  } else {
    warning(paste("Data frame for", id, "is empty. Alias not assigned."))
  }
  return(df)
}, alignment.files, alias)

In the next step, note that I filter the reads based on the minimum coverage value for the methods_16s records in my config file. This is an example of how using the config file helps us avoid any discrepancies in our parameters over the course of the study and gives us a centralized location to update those parameters or compile them for our reporting/publications at any time.

alignments.long     <- bind_rows(alignment.files) %>%
                        as_tibble()  %>%
  rename_with(~str_replace_all(., "\\s", "_")) %>%
  mutate(across(where(is.character), ~str_squish(str_remove_all(., "'|\\[|\\]")))) %>% 
                        select(ref,
                               taxid,
                               species,
                               genus,
                               family,
                               order,
                               class,
                               phylum,
                               superkingdom,
                               identifier = alias,
                               coverage,
                               n_reads = number_of_reads)    %>%
                        left_join(select(metadata, identifier), 
                                  by = join_by(identifier)) %>%
                        filter(coverage >= methods_16s$min_cov & !is.na(identifier)) %>%
                        group_by(ref,
                                 taxid,
                                 species,
                                 genus,
                                 family,
                                 order,
                                 class,
                                 phylum,
                                 superkingdom,
                                 identifier) %>%
                        summarize(samp_cov     = mean(coverage),
                                  samp_n_reads = mean(n_reads)) %>% ungroup() %>%
                        mutate(identifier = factor(identifier, levels = unique(samples))) %>%
                        arrange(identifier) %>%
                        arrange(superkingdom,
                                phylum,
                                class,
                                order,
                                family,
                                genus,
                                species)

Next, we will simplify this into a table with just one row per reference represented in our dataset. This will be important for fetching representative fasta files from GenBank a few steps below.

alignments.refs <- alignments.long %>% 
                        group_by(ref,
                                 taxid,
                                 species,
                                 genus,
                                 family,
                                 order,
                                 class,
                                 phylum,
                                 superkingdom) %>%
                        summarize(ref_n_reads   = round(sum(samp_n_reads),  digits = 2),
                                  ref_mean_cov  = round(mean(samp_cov),     digits = 2)) %>%
                        ungroup() %>% group_by(species,
                                               genus,
                                               family,
                                               order,
                                               class,
                                               phylum,
                                               superkingdom) %>%
                        arrange(species, desc(ref_n_reads), desc(ref_mean_cov)) %>%
                        mutate(ref_order       = row_number(),
                               tax_total_count = sum(ref_n_reads)) %>%
                        ungroup() %>%
                        filter(ref_order == 1) %>%
                        select(-ref_order)  %>%
                        arrange(superkingdom,
                                phylum,
                                class,
                                order,
                                family,
                                genus,
                                species)

Now, we will take the original long table and wrangle it into a structure that we can use for calculating relative abundances after we do some further quality control and normalization on the data.

alignments    <- alignments.long %>% 
                        group_by(superkingdom,
                                 phylum,
                                 class,
                                 order,
                                 family,
                                 genus,
                                 species,
                                 identifier) %>%
                          summarize(counts = sum(samp_n_reads)) %>%
                          ungroup() %>%
                          left_join(alignments.refs,
                    by = join_by(superkingdom,
                                 phylum,
                                 class,
                                 order,
                                 family,
                                 genus,
                                 species)) %>%
                    pivot_wider(names_from  = identifier,
                                values_from = counts,
                                values_fill = 0) %>%
                        arrange(superkingdom,
                                phylum,
                                class,
                                order,
                                family,
                                genus,
                                species) %>%
                    mutate(organism = as.character(str_glue("txid", "{taxid}"))) 

write_tsv(alignments, "microeco/loris/alignments.tsv")

save(alignments, file = "microeco/loris/alignments.RData")

Data Check

It is a good idea to look at our read counts across our dataset at this point so we can decide how best to normalize our data later.

depth <- alignments.long %>%
  filter(!is.na(identifier)) %>%
  select(identifier,
         samp_cov,
         samp_n_reads) %>%
  group_by(identifier) %>%
  summarize(depth         = round(sum(samp_n_reads), digits = 0),
            mean_coverage = round(mean(samp_cov), digits = 2)) %>%
  mutate(mean_coverage = mean_coverage/100) %>%
  ungroup() %>%
  left_join(metadata, by = join_by(identifier)) %>%
  distinct()

write_csv(depth, "data/loris/loris_seq_depth_16s.csv")

OTU Table

otu.table <- alignments %>% 
          select(organism,
                 any_of(samples)) %>%
                 column_to_rownames("organism")

otu.sample.list <- as.list(names(otu.table))


otu.table %>%
  rownames_to_column("organism") %>%
write_tsv("microeco/loris/otu_table.tsv")

save(otu.table, file = "microeco/loris/otu_table.RData")

otu.table

Check Workflow

This is also a good point to check for discrepencies in tables to see whether any data have been erroneously excluded or to notice any patterns in excluded samples. I will also check the metadata table for any duplicated sample_ids, as this will give us problems in the next step.

samples.filtered <- setdiff(sample.list, otu.sample.list)

excluded <- metadata %>% filter(sample_id %in% samples.filtered)

excluded
samples.duplicated <- metadata %>% 
                filter(identifier %in% otu.sample.list) %>%
  select(-c(
    starts_with("seq_run"),
    starts_with("extract"),
    starts_with("library"),
    ends_with("_id")
    )) %>%
                distinct() %>%
                      group_by(identifier) %>%
                      summarize(count = n()) %>%
                      filter(count > 1) %>%
                      left_join(metadata)

samples.duplicated

Sample Table

Now I am going to refine the metadata table to include the versions of each variable that I may want to include in my downstream analyses.

Most importantly, I need to update my metadata table so that I have one row per column in the OTU table.

sample.table <- metadata  %>% 
                filter(identifier %in% otu.sample.list) %>%
                arrange(collection_day, subject) %>%
  select(
 identifier,    
 starts_with("subject"),
 starts_with("collection"),
 starts_with("environment"),
 starts_with("repro"),
 starts_with("bristol"),
 starts_with("supplement"),
 starts_with("diet")       
  ) %>%
                column_to_rownames("identifier")

sample.table %>%
  rownames_to_column("identifier") %>%
  write_tsv("microeco/loris/sample_table.tsv")  


save(sample.table, file = "microeco/loris/sample_table.RData")

sample.table

Taxonomy Table

tax.table <- alignments %>% 
             column_to_rownames("organism") %>%
             select(superkingdom,
                     phylum,
                     class,
                     order,
                     family,
                     genus,
                     species) %>%
             arrange(superkingdom,
                     phylum,
                     class,
                     order,
                     family,
                     genus,
                     species) %>%
             rename_with(~str_to_title(.)) %>%
             rename(Kingdom = Superkingdom) %>%
             relocate(Kingdom) %>%
             distinct() %>%
             tidy_taxonomy()

tax.table %>%
  rownames_to_column("organism") %>%
  write_tsv("microeco/loris/taxonomy_table.tsv")

save(tax.table, file = "microeco/loris/tax_table.RData")

tax.table

Now we need to update our working taxonomy file for future updates after sequencing runs, so I will also write a version of this to a tsv file.

taxonomy.list <- alignments %>% 
                 select(organism, 
                        superkingdom,
                        phylum,
                        class,
                        order,
                        family,
                        genus,
                        species) %>% 
             arrange(superkingdom,
                     phylum,
                     class,
                     order,
                     family,
                     genus,
                     species) %>%
                 rename_with(~str_to_title(.)) %>%
             rename(Kingdom = Superkingdom) %>%
                 relocate(Organism, Kingdom)

write_tsv(taxonomy.list, path$taxa_reps$table)

taxonomy.list

Next Steps

Now you have the following files ready for use in one of the data analysis workflows using the microeco package:

  1. Sample Table
  2. OTU Table
  3. Taxonomy Table

You should proceed with the microbiome_references script to prepare the following files for a complete microeco dataset:

  1. Phylogenetic Tree
  2. FASTA with Genbank Reference Sequences for each row in the Taxonomy Table

Note: We actually only need the three files prepared here for many of the basic operations in microeco, so you could proceed with this if you just want to generate basic stats. Many options depend on reference sequences and/or a phylogenetic tree though, so I recommend running the phylogeny workflow first to expand your available options for downstream analysis.