Here are some of the current predictor variables to consider in this
dataset:
ind_vars <- tribble(
~var_name , ~var_descr , ~recorded_for, ~relevant_to, ~quantitative_version, ~qualitative_version, ~binary, ~category ,
"collection" , "date the sample was collected on" , "both" , "both" , "discrete" , "ordinal" , "n" , "temporal" ,
"subject" , "subject collected from" , "both" , "both" , "discrete" , "nominal" , "y" , "identity" ,
"ster_mgpml" , "steroid concentration administered in mg/ml" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"ster_dose_ml", "steroid volume administered per day in ml" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"ster_npday" , "steroid frequency per day as count" , "both" , "culi" , "discrete" , "ordinal" , "n" , "medications" ,
"ster_mgpday" , "steroid total mg per day administered" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"diet_trial" , "diet trial phase (see other table for more info)" , "both" , "culi" , "discrete" , "nominal" , "n" , "nutrition" ,
"location" , "old enclosure (before transfer) or new enclosure" , "both" , "both" , "discrete" , "nominal" , "y" , "environment" ,
"med_type" , "medication type given (not steroid or probiotic)" , "both" , "culi" , "discrete" , "nominal" , "n" , "medications" ,
"med_name" , "medication name given (not steroid or probiotic)" , "both" , "culi" , "discrete" , "nominal" , "n" , "medications" ,
"mgpml" , "medication (not steroid or probiotic) concentration in mg/ml" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"dose_ml" , "medication (not steroid or probiotic) volume in ml" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"dose_npday" , "medication (not steroid or probiotic) dose per day as count" , "both" , "culi" , "discrete" , "ordinal" , "n" , "medications" ,
"mg_p_day" , "medication (not steroid or probiotic) dose in total mg per day" , "both" , "culi" , "continuous" , "ordinal" , "n" , "medications" ,
"warble_cycle", "is warble in estrus?" , "both" , "warble" , "discrete" , "nominal" , "y" , "reproduction" ,
"warble_preg" , "is warble pregnant?" , "both" , "warble" , "discrete" , "nominal" , "y" , "reproduction" ,
"access" , "are they able to access each other or separated by caging?" , "both" , "both" , "discrete" , "nominal" , "y" , "reproduction" ,
"breed" , "was a breeding event observed today?" , "both" , "both" , "discrete" , "nominal" , "y" , "reproduction" ,
"culi_trial" , "which diet trial phase is culi in?" , "both" , "culi" , "discrete" , "nominal" , "n" , "nutrition" ,
"bristol_min" , "minimum bristol fecal score recorded for the day" , "culi" , "culi" , "discrete" , "ordinal" , "n" , "symptom" ,
"bristol_max" , "maximum bristol fecal score recorded for the day" , "culi" , "culi" , "discrete" , "ordinal" , "n" , "symptom" ,
"bristol_mean", "mean bristol fecal score recorded for the day" , "culi" , "culi" , "discrete" , "ordinal" , "n" , "symptom" ,
"note_type" , "other note on the day - category" , "both" , "both" , "none" , "nominal" , "n" , "other" ,
"note_detail" , "other note on the day - detailed note" , "both" , "both" , "none" , "nominal" , "n" , "other"
)
ind_vars_gt <- ind_vars %>%
mutate(category = if_else(category == "temporal", "timing", category),
var_descr = str_to_sentence(var_descr)) %>%
arrange(category,
relevant_to,
quantitative_version,
qualitative_version,
binary,
recorded_for,
var_name
) %>%
gt(groupname_col = "category", rowname_col = "var_name") %>%
tab_spanner(label = "For which subject(s) is it:",
columns = c("recorded_for", "relevant_to")) %>%
tab_spanner(label = "Format when analyzed as:",
columns = c("quantitative_version", "qualitative_version")) %>%
cols_label(
var_descr = "Description",
recorded_for = "Recorded",
relevant_to = "Relevant",
quantitative_version = "Quantitative",
qualitative_version = "Qualitative",
binary = "Binary?"
) %>%
opt_all_caps(locations = "row_group") %>%
tab_header(title = "Summary of Basic Metadata Variables", subtitle = "Project Loris Microbiome") %>%
opt_align_table_header("left") %>%
cols_align("left") %>%
tab_style(
style = cell_text(transform = "capitalize"),
locations = cells_body(columns = !c("var_name", "var_descr"))
) %>%
tab_source_note(paste0("last updated on ", ymd(today()))) %>%
opt_stylize(style = 3, color = "cyan")
ind_vars_gt
Summary of Basic Metadata Variables | ||||||
Project Loris Microbiome | ||||||
Description |
For which subject(s) is it:
|
Format when analyzed as:
|
Binary? | |||
---|---|---|---|---|---|---|
Recorded | Relevant | Quantitative | Qualitative | |||
environment | ||||||
location | Old enclosure (before transfer) or new enclosure | both | both | discrete | nominal | y |
identity | ||||||
subject | Subject collected from | both | both | discrete | nominal | y |
medications | ||||||
dose_ml | Medication (not steroid or probiotic) volume in ml | both | culi | continuous | ordinal | n |
mg_p_day | Medication (not steroid or probiotic) dose in total mg per day | both | culi | continuous | ordinal | n |
mgpml | Medication (not steroid or probiotic) concentration in mg/ml | both | culi | continuous | ordinal | n |
ster_dose_ml | Steroid volume administered per day in ml | both | culi | continuous | ordinal | n |
ster_mgpday | Steroid total mg per day administered | both | culi | continuous | ordinal | n |
ster_mgpml | Steroid concentration administered in mg/ml | both | culi | continuous | ordinal | n |
med_name | Medication name given (not steroid or probiotic) | both | culi | discrete | nominal | n |
med_type | Medication type given (not steroid or probiotic) | both | culi | discrete | nominal | n |
dose_npday | Medication (not steroid or probiotic) dose per day as count | both | culi | discrete | ordinal | n |
ster_npday | Steroid frequency per day as count | both | culi | discrete | ordinal | n |
nutrition | ||||||
culi_trial | Which diet trial phase is culi in? | both | culi | discrete | nominal | n |
diet_trial | Diet trial phase (see other table for more info) | both | culi | discrete | nominal | n |
other | ||||||
note_detail | Other note on the day - detailed note | both | both | none | nominal | n |
note_type | Other note on the day - category | both | both | none | nominal | n |
reproduction | ||||||
access | Are they able to access each other or separated by caging? | both | both | discrete | nominal | y |
breed | Was a breeding event observed today? | both | both | discrete | nominal | y |
warble_cycle | Is warble in estrus? | both | warble | discrete | nominal | y |
warble_preg | Is warble pregnant? | both | warble | discrete | nominal | y |
symptom | ||||||
bristol_max | Maximum bristol fecal score recorded for the day | culi | culi | discrete | ordinal | n |
bristol_mean | Mean bristol fecal score recorded for the day | culi | culi | discrete | ordinal | n |
bristol_min | Minimum bristol fecal score recorded for the day | culi | culi | discrete | ordinal | n |
timing | ||||||
collection | Date the sample was collected on | both | both | discrete | ordinal | n |
last updated on 2025-01-08 |
gtsave(ind_vars_gt, "meta_variables_loris.png", paste0(global$summaries$metadata))
Here are some ideas for variables we could eventually extract from
the video behavior files:
Here are some of the dependent variables we can extract from the
results:
How we calculated microbial abundance: Remember that
we constructed sequencing libraries that amplified the entire 1500 bp
region of the 16S gene for any and all available microbial DNA present
in our DNA extracts. Then, during sequencing, each available template
was sequenced to produce an individual “read”. Most of the time we
sequenced pooled libraries with 23 samples and 1 negative control until
we used up all available pores on one Flongle Flow Cell. Then we binned
all the reads into folders for each sample and matched every read from a
sample to all available sequences in our NCBI reference database. If a
read matched a reference sequence to a minimum threshold, then we
assigned that taxonomic identity to the read. Then we counted the number
of reads assigned to each taxon per sample. This result became our raw
count data, which we then rarefied for
normalization across all samples.
There is a complex literature available on all the pros and cons of
options for when and how to normalize 16S microbiome data. Rarefaction
is still the most commonly used option, so that’s what we have done
here, especially because microeco makes it easy to automate the process.
We did, however, apply a modified version of
rarefaction that is growing in popularity.
What rarefaction does: Our count data for each
sample are confounded by variation in the total number of reads we
started with for that sample after demultiplexing our libraries. To
correct for this, we choose a threshold read count as our
rarefaction threshold. Then we only keep samples for
which we generated at least that number of reads, and we only keep that
number of reads for all samples with more than the threshold.
Which reads to keep/lose: The next issue is how to
select which reads to keep and ignore without introducing more bias to
our data. We used the microeco option SRS,
which performs an alternative rarefaction method by scaling with ranked
subsampling.
SRS Steps:
For now, we used 7,000 as our threshold. I did this based on some of
the early rarefaction curves coming out of the data, and because the
standard in traditional short read 16S data is often somewhere between
6K and 12K reads. Nanopore sequencing generates fewer relative reads,
especially when we multiplex on a flongle flow cell, so we could
reconsider using a lower rarefaction threshold once we have more data
and look at the trends again. The trick is to balance the tradeoff
between not losing too much detail in our low-abundance taxa by choosing
too low of a read count and not losing too much detail from our sample
sizes by choosing too high of a read count.
We already filtered reads during our earlier pipelines based on their
length and quality, but then we filtered taxa again to minimize the
confounding impacts of taxa with very few representative reads in our
data.
This is fairly standard practice, as low abundance taxa across all
samples are more likely to appear as a result of false taxonomic
assignment, especially with increasing sample sizes or sequencing
depths.
I started with very generous filtering thresholds as we were looking
for subtle patterns with a very small subset of our total samples, but
over time I have incrementally adjusted this to get a clearer picture of
the key patterns in our data.
Right now we are filtering out:
Analyzing the entire 16S gene through long read sequencing affords us
a finer taxonomic resolution than traditional analysis of a
hypervariable region of the 16S gene through short read sequencing. That
means we can safely analyze our abundances to as fine as the Species
level.
Still, sometimes trends at the higher taxonomic scales are still more
informative, depending on the question. The microeco package makes it
pretty easy to merge abundance data to a given higher taxonomic level
after we construct our microbiome dataset object and normalize/filter
the reads and taxa. That is why I generally just compute our basic
abundance summary statistics at each of the following levels:
After performing some other calculations like correlation
coefficients, I merged those taxonomic scales back into our summary
tables/figures for ease of comparison. We should still take care to keep
in mind that the stats we are looking at were computed individually at
each taxonomic level.
Alpha diversity measures the overall taxonomic diversity within each
sample. We have already generated the following Alpha Diversity
metrics:
Abundance-based Coverage Estimator
Beta diversity measures the dissimilarity in composition between
samples.
Before we map betadiversity patterns using the methods in the next
section, we have to calculate a distance matrix that provides the
dissimilarity index for each pair of samples. Within MicroEco we can use
the following metrics to calculate the matrix:
This is one of the most commonly used dissimilarity indices in
ecological studies. It calculates the dissimilarity between two samples
based on the abundance of shared species (or taxa).
Takes into account both the presence/absence and abundance of
taxa.
0 (identical communities) to 1 (completely different
communities).
Quantifying differences in community composition when you are
interested in both the presence and relative abundance of taxa.
This index calculates dissimilarity based solely on the presence or
absence of taxa between two samples.
Measures community dissimilarity based on species (taxa)
presence/absence.
0 (identical communities in terms of presence/absence) to 1 (no
shared species between communities).
Cases where you are only interested in the presence or absence of
taxa, and not their abundance.
UniFrac distances are phylogenetic metrics that measure the
dissimilarity between microbial communities based on the evolutionary
distances between taxa.
Phylogenetic diversity; unweighted focuses on presence/absence, while
weighted takes abundance into account.
0 (identical communities with no evolutionary difference) to 1
(completely different communities).
When evolutionary relationships between microbial taxa are important
in determining community dissimilarity.
Canberra distance is a dissimilarity measure that gives more weight
to differences in low-abundance taxa than high-abundance taxa. It is
sensitive to small differences in the relative abundance of
species.
More sensitive to the abundance of rare taxa.
0 (identical communities) to 1 (completely different
communities).
When small differences in the abundance of rare taxa are of
particular interest.
Euclidean distance is a straight-line measure between two points in a
multidimensional space (here, representing microbial abundances).
Measures absolute differences in taxa abundances between
samples.
0 (identical communities) to a positive value, depending on the
magnitude of difference.
Basic comparison of absolute differences in community composition,
though it’s less commonly used in ecological contexts due to its
sensitivity to large differences.
Aitchison distance is used specifically for compositional data, where
relative abundances matter more than absolute counts. It is based on
log-ratio transformations.
Appropriate for comparing compositional data (relative abundances)
while avoiding issues related to the “compositional nature” of
microbiome data (e.g., total sum constraint).
Positive values, where larger values indicate greater
dissimilarity.
When analyzing microbial community compositions, especially when
addressing issues like compositionality in microbiome data.
This index is similar to the Jaccard index but gives more weight to
shared species (taxa). It measures the proportion of shared taxa between
two samples.
Presence/absence data, but more sensitive to shared taxa than
Jaccard.
0 (completely similar) to 1 (completely dissimilar).
Assessing beta diversity based on the presence/absence of taxa,
giving more weight to shared taxa.
Chao distance is an abundance-based metric that adjusts for unseen
species by estimating how many rare or unseen taxa are likely shared
between samples.
Accounts for unseen or rare taxa, improving the comparison of samples
with low-abundance taxa.
0 (identical communities) to 1 (completely different
communities).
When dealing with datasets where rare or unobserved taxa might skew
results.
Gower distance is a flexible metric that can handle different types
of data (e.g., categorical and continuous variables). For microbiome
data, it’s often used for combining different types of diversity
measures.
Measures similarity between samples while accounting for multiple
types of data or weighting schemes.
0 (complete similarity) to 1 (complete dissimilarity).
When you need a flexible approach to handling multiple types of data
within the same matrix.
Ordination methods are dimensionality reduction
techniques that reduce complex data (e.g., community composition across
multiple samples) into fewer dimensions for visualization and
interpretation. These techniques can highlight patterns of similarity or
difference between samples.
Group distance metrics calculate and compare the distances between groups of samples. This allows for the quantification of the beta diversity between predefined groups (e.g., treatment vs. control, or different geographic locations).
Clustering methods group samples together based on
their similarities in microbial composition, forming clusters that
represent similar microbial communities.
MANOVA is a statistical test that evaluates whether there are differences in the means of several dependent variables (e.g., microbial taxa) between different groups (e.g., treatment groups, environmental conditions).
Inferential statistics and mixed models are invaluable for analyzing microbiome datasets, particularly when dealing with the inherent complexities of microbial abundance data. These methods help us account for multiple sources of variability, hierarchical data structures, and overdispersion.
GLMMs are an extension of generalized linear models (GLMs), allowing for the inclusion of random effects to account for hierarchical or nested data structures.
A GLMM has the form:
\(g(E(y)) = X\beta + Zb\)
A GLMM can handle situations like:
LMEs are ideal when your response variable is continuous and approximately normally distributed. They are helpful when analyzing diversity indices or microbial abundances that have already been transformed.
An LME has the form:
\(y = X\beta + Zb + \epsilon\)
For example, suppose you want to see how Shannon diversity changes
with a new diet:
- The fixed effect could be the diet phase. - Random effects could
account for repeated measures within individuals or cages.
Microbiome data often have a high proportion of zeros due to undetected or absent taxa in samples. Zero-inflated models explicitly account for these excess zeros.
Imagine you’re modeling the abundance of a rare taxon:
- Part of the model handles whether the taxon is present or absent
(binary). - The other part models its abundance if present (e.g., using
a Poisson distribution).
They are particularly valuable for rare taxa that appear in only some samples but may still have biological importance.
Beta regression is designed for modeling response variables that are proportions or percentages (e.g., relative abundances).
If you want to analyze the relative abundance of a genus across
different treatment groups:
- Beta regression models the proportion of that genus in each sample,
accounting for group differences.
It’s well-suited for microbiome data because it handles proportional values (0 to 1) and avoids the pitfalls of linear regression on proportions.
model_selection <- tibble(
Data_Type = c(
"Raw counts",
"Presence/absence",
"Diversity indices",
"Relative abundances",
"Zero-heavy data"
),
Example = c(
"Abundance of specific taxa",
"Presence of rare taxa in samples",
"Shannon or Simpson diversity",
"Proportion of taxa across groups",
"Rare taxa abundances"
), Recommended_Model = c(
"GLMM with Poisson or Negative Binomial",
"GLMM with logistic regression",
"LME",
"Beta Regression",
"Zero-Inflated Model"
)
) %>%
gt(rowname_col = "Data_Type") %>%
cols_label(
Recommended_Model = "Recommended Model"
) %>%
tab_stubhead("Data Type") %>%
opt_stylize(style = 3, color = "pink")
model_selection
Data Type | Example | Recommended Model |
---|---|---|
Raw counts | Abundance of specific taxa | GLMM with Poisson or Negative Binomial |
Presence/absence | Presence of rare taxa in samples | GLMM with logistic regression |
Diversity indices | Shannon or Simpson diversity | LME |
Relative abundances | Proportion of taxa across groups | Beta Regression |
Zero-heavy data | Rare taxa abundances | Zero-Inflated Model |