Intro

This Workflow

This workflow takes an excerpt from the microbiome_new_data script with instructions for constructing a representative phylogenetic tree from your 16S sequencing results and expands on that to increase the flexibility of your phylogenetic tree data for downstream processing.

Phylo Trees and ONT Data

Some of the standard microbiome profiling metrics rely on on an OTU approach developed for the more traditional short-read sequences produced by Illumina and other nextgen platforms. Constructing representative sequences for each OTU enables phylogenetic analyses that can be particularly useful in visualizations.

ONT Reads are generally still too messy and long to smoothly produce consensus sequences for each taxon. I use a workaround to enable us to still generate some phylogenetic estimates and graphics, though we will really be visualizing the relationships between the reference sequences for each taxon that our reads align to. The first step of this will use Entrez Direct to source a reference fasta for each of the taxids from our minimap2 results.

The Data Used Here

I am writing this script with a version of our lab’s pygmy loris microbiome data. If you are working on one of our other microbiome projects, it should be fairly simple to adapt the original .Rmd script you are reading to a different dataset, especially if you make use of the params settings in the yaml header at the top.

The Workflow

Fetch Reference FASTA Sequences

Taxonomy List

First, we will read in our raw alignment data gathered from the wf-16s output.

alignments <- read_tsv("microeco/loris/alignments.tsv")
refseqs <- distinct(alignments, ref, taxid)
write_tsv(refseqs, "microeco/loris/fetch_references.txt")

Transfer the fetch_references.txt file from your local directory to the proper directory on Swan and then run the script below.

Tip

I create a mirror of my repository/directory on Swan so that I can just keep updating repositories as I move between swan and my local R consol.


Fetch New References Using Entrez Direct

Now you need to switch to the terminal and log into the HCC. We will use the entrez-direct package to interface with NCBI’s databases adn fetch our reference sequences. If you have not used the HCC before, then refer to the readprocessing_multiplex_16s script for more details.

Note: I recently discovered the R package taxreturn which looks like an appealing alternative to do this straight from the R console, so feel free to experiment with it if you are feeling adventurous.

Batch Script

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --job-name=fetch_refs
#SBATCH --error=/work/richlab/aliciarich/bioinformatics_stats/logs/fetch_refs_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/bioinformatics_stats/logs/fetch_refs_%A_%a.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=100GB
#SBATCH --partition=guest

module purge
module load entrez-direct
module load seqkit 

uid_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/fetch_references.txt"
accessions_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/accessions.txt"
fasta_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_unaligned.fasta"


> "$fasta_file"

cut -f1 "$uid_file" > "$accessions_file"

if efetch -db nuccore -input "$accessions_file" -format fasta -email "aliciarich@unomaha.edu" > "$fasta_file"; then
    echo "Sequences fetched successfully."
else
    echo "Error fetching sequences." >&2
    exit 1
fi

Submit Batch Script

cd "/work/richlab/aliciarich/bioinformatics_stats/batch_scripts"
sbatch "fetch_refs.sh"

Once the script finishes running, transfer the refs_unaligned.fasta file from the directory on Swan to the local matching directory to run the next step in R.


Relabel Sequences

refs_unaligned <- treeio::read.fasta("microeco/loris/refs_unaligned.fasta")

refs_labels <- enframe(labels(refs_unaligned), name = NULL, value = "label") %>%
  mutate(accession = str_extract(label, "^\\w{2}_\\d+\\.\\d(?=\\s.+)")) %>%
  left_join(select(alignments, superkingdom:taxid), by = join_by(accession == ref)) %>%
  mutate(taxid = as.character(str_glue("txid{taxid}"))) %>%
  arrange(superkingdom,
          phylum,
          class,
          order,
          family,
          genus,
          species,
          accession) %>%
  select(
    superkingdom:species,
    taxid,
    accession,
    label
  )
write_tsv(refs_labels, "microeco/loris/reference_key.tsv")
refs_rename <- refs_labels %>%
  select(label, taxid)

rename.fasta("microeco/loris/refs_unaligned.fasta", refs_rename, "microeco/loris/refs_renamed.fasta")
microeco/loris/refs_renamed.fasta has been saved to  /Users/aliciamrich/GitRepos/richlab_main/bioinformatics_stats 

Align Sequences and Build ML Tree

Batch Script

#!/bin/bash
#SBATCH --job-name=mafft_iqtree
#SBATCH --error=/work/richlab/aliciarich/bioinformatics_stats/logs/mafft_iqtree_%A.err
#SBATCH --output=/work/richlab/aliciarich/bioinformatics_stats/logs/mafft_iqtree_%A.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8         
#SBATCH --mem=64G                 
#SBATCH --time=6:00:00            

module purge
module load mafft/7.526
module load iqtree/2.3


input_fasta="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_renamed.fasta"
aligned_fasta="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_aligned_mafft.fasta"
out_prefix="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_tree"

mafft \
  --thread 8 \
  --retree 2 \
  --maxiterate 1000 \
  "$input_fasta" > "$aligned_fasta"
  

iqtree -s $aligned_fasta \
        -m MFP+MERGE \
        -B 1000 \
        -T $SLURM_CPUS_PER_TASK \
        -pre "$out_prefix" \
        -redo
  

Submit Batch Script

cd "/work/richlab/aliciarich/bioinformatics_stats/batch_scripts"
sbatch "mafft_iqtree.sh"

Once the script finishes running, transfer the refs_aligned_mafft.fasta and refs_tree.treefile files from the directory on Swan to the local matching directory to run the next step in R.

Note the Final Model Used

I also like to extract the bottom of the log file produced to keep this note on the alignment model that the MAFFT algorithm selected. It just makes it easier for copying and pasting into manuscripts later.

alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
8 thread(s)

Strategy:
 FFT-NS-i (Accurate but slow)
 Iterative refinement method (max. 16 iterations)

Next Steps

Proceed to the microbiome_preprocessing tutorial to merge these with your other microbiome results before cleaning and filtering your dataset.

---
title: "Working with 16S Data and Reference Sequences"
author: "Alicia M. Rich, Ph.D."
date: "`r Sys.Date()`"
output:
  html_document:
    theme:
      bslib: true
    toc: true
    toc_depth: 3
    df_print: paged
    css: journal.css
    code_download: true
  
---

```{r setup, include=FALSE}
library(conflicted)
library(tidyverse)
library(bslib)
library(htmltools)
library(microeco)
library(seqinr)
library(ape)
library(gtExtras)
library(ggtree)
library(treedataverse)
library(phylotools)
library(TreeTools)


source("setup/knit_engines_simple.R")
source("setup/conflicted.R")

```

# Intro

## This Workflow

This workflow takes an excerpt from the `microbiome_new_data` script with instructions for constructing a representative phylogenetic tree from your 16S sequencing results and expands on that to increase the flexibility of your phylogenetic tree data for downstream processing. 

## Phylo Trees and ONT Data

Some of the standard microbiome profiling metrics rely on on an OTU approach developed for the more traditional short-read sequences produced by Illumina and other nextgen platforms. Constructing representative sequences for each OTU enables phylogenetic analyses that can be particularly useful in visualizations.  
  
ONT Reads are generally still too messy and long to smoothly produce consensus sequences for each taxon. I use a workaround to enable us to still generate some phylogenetic estimates and graphics, though we will really be visualizing the relationships between the reference sequences for each taxon that our reads align to. The first step of this will use Entrez Direct to source a reference fasta for each of the taxids from our minimap2 results.  


## The Data Used Here

I am writing this script with a version of our lab's pygmy loris microbiome data. If you are working on one of our other microbiome projects, it should be fairly simple to adapt the original .Rmd script you are reading to a different dataset, especially if you make use of the params settings in the yaml header at the top.

## Recommended Reading

For a helpful guide, I recommend you look at this online book: [Data Integration, Manipulation and Visualization of Phylogenetic Trees](https://yulab-smu.top/treedata-book/index.html).

# The Workflow

## Fetch Reference FASTA Sequences

### Taxonomy List

First, we will read in our raw alignment data gathered from the wf-16s output.

```{r}
alignments <- read_tsv("microeco/loris/alignments.tsv")
refseqs <- distinct(alignments, ref, taxid)
write_tsv(refseqs, "microeco/loris/fetch_references.txt")
```


Transfer the fetch_references.txt file from your local directory to the proper directory on Swan and then run the script below.  
  
<details>
<summary>***Tip***</summary>
<p>I create a mirror of my repository/directory on Swan so that I can just keep updating repositories as I move between swan and my local R consol.</p>
</details>

---

## Fetch New References Using Entrez Direct

Now you need to switch to the terminal and log into the HCC. We will use the entrez-direct package to interface with NCBI's databases adn fetch our reference sequences. If you have not used the HCC before, then refer to the [readprocessing_multiplex_16s](https://rich-molecular-health-lab.github.io/tutorials/readprocessing_multiplex_16s.html) script for more details.

>*Note: I recently discovered the R package [`taxreturn`](https://alexpiper.github.io/taxreturn/index.html) which looks like an appealing alternative to do this straight from the R console, so feel free to experiment with it if you are feeling adventurous.*

### Batch Script

```{terminal, warning = FALSE, echo = FALSE}
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --job-name=fetch_refs
#SBATCH --error=/work/richlab/aliciarich/bioinformatics_stats/logs/fetch_refs_%A_%a.err
#SBATCH --output=/work/richlab/aliciarich/bioinformatics_stats/logs/fetch_refs_%A_%a.out
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=100GB
#SBATCH --partition=guest

module purge
module load entrez-direct
module load seqkit 

uid_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/fetch_references.txt"
accessions_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/accessions.txt"
fasta_file="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_unaligned.fasta"


> "$fasta_file"

cut -f1 "$uid_file" > "$accessions_file"

if efetch -db nuccore -input "$accessions_file" -format fasta -email "aliciarich@unomaha.edu" > "$fasta_file"; then
    echo "Sequences fetched successfully."
else
    echo "Error fetching sequences." >&2
    exit 1
fi

```

### Submit Batch Script

```{terminal, warning = FALSE, echo = FALSE}
cd "/work/richlab/aliciarich/bioinformatics_stats/batch_scripts"
sbatch "fetch_refs.sh"
```

Once the script finishes running, transfer the `refs_unaligned.fasta` file from the directory on Swan to the local matching directory to run the next step in R.

---

## Relabel Sequences

```{r}
refs_unaligned <- treeio::read.fasta("microeco/loris/refs_unaligned.fasta")

refs_labels <- enframe(labels(refs_unaligned), name = NULL, value = "label") %>%
  mutate(accession = str_extract(label, "^\\w{2}_\\d+\\.\\d(?=\\s.+)")) %>%
  left_join(select(alignments, superkingdom:taxid), by = join_by(accession == ref)) %>%
  mutate(taxid = as.character(str_glue("txid{taxid}"))) %>%
  arrange(superkingdom,
          phylum,
          class,
          order,
          family,
          genus,
          species,
          accession) %>%
  select(
    superkingdom:species,
    taxid,
    accession,
    label
  )
write_tsv(refs_labels, "microeco/loris/reference_key.tsv")
```

```{r}
refs_rename <- refs_labels %>%
  select(label, taxid)

rename.fasta("microeco/loris/refs_unaligned.fasta", refs_rename, "microeco/loris/refs_renamed.fasta")
```

---

## Align Sequences and Build ML Tree

### Batch Script

```{terminal, warning = FALSE, echo = FALSE}
#!/bin/bash
#SBATCH --job-name=mafft_iqtree
#SBATCH --error=/work/richlab/aliciarich/bioinformatics_stats/logs/mafft_iqtree_%A.err
#SBATCH --output=/work/richlab/aliciarich/bioinformatics_stats/logs/mafft_iqtree_%A.out
#SBATCH --partition=guest
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8         
#SBATCH --mem=64G                 
#SBATCH --time=6:00:00            

module purge
module load mafft/7.526
module load iqtree/2.3


input_fasta="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_renamed.fasta"
aligned_fasta="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_aligned_mafft.fasta"
out_prefix="/work/richlab/aliciarich/bioinformatics_stats/microeco/loris/refs_tree"

mafft \
  --thread 8 \
  --retree 2 \
  --maxiterate 1000 \
  "$input_fasta" > "$aligned_fasta"
  

iqtree -s $aligned_fasta \
        -m MFP+MERGE \
        -B 1000 \
        -T $SLURM_CPUS_PER_TASK \
        -pre "$out_prefix" \
        -redo
  
```

### Submit Batch Script

```{terminal, warning = FALSE, echo = FALSE}
cd "/work/richlab/aliciarich/bioinformatics_stats/batch_scripts"
sbatch "mafft_iqtree.sh"
```

Once the script finishes running, transfer the `refs_aligned_mafft.fasta` and `refs_tree.treefile` files from the directory on Swan to the local matching directory to run the next step in R.

#### Note the Final Model Used

I also like to extract the bottom of the log file produced to keep this note on the alignment model that the MAFFT algorithm selected. It just makes it easier for copying and pasting into manuscripts later.

```
alg=A, model=DNA200 (2), 1.53 (4.59), -0.00 (-0.00), noshift, amax=0.0
8 thread(s)

Strategy:
 FFT-NS-i (Accurate but slow)
 Iterative refinement method (max. 16 iterations)
```

---

# Next Steps

Proceed to the [`microbiome_preprocessing` tutorial](https://rich-molecular-health-lab.github.io/tutorials/microbiome_preprocessing.html) to merge these with your other microbiome results before cleaning and filtering your dataset.
