---
title: "Tutorial_ComAnalysis_v3.02"
author: "Gabriel Perron"
date: "2/19/2018"
output: html_document
editor_options:
chunk_output_type: console
---
# INTRODUCTION
This tutorial was developped at Bard College as part of Bio340 - Metagenomics and for Research Projects investigating microbial communities using the 16S rRNA or ITS marker. The tutorial is divided in three parts:
1) [Tutorial_16S_Install_v3.02.Rmd]: Installs (or updates) the different packages and dependendies required to run the pipeline
2) [Tutorial_16S_DADA2_v3.02.Rmd]: Runs [DADA2], a packages developped to process raw 16S (or ITS) reads while removing noise and producing a taxa table with counts.
3) [Tutorial_16S_ComAnalysis_v3.02.Rmd]: Combines taxa table and metadata into a [Phyloseq] object and runs different microbial community analyses, including alpha-diversity, beta-diversity, and taxa changes in abundance. Th
In addition to [DADA2](https://github.com/benjjneb/dada2) to infer amplicon sequence variants (AVS) the tutorial uses several other packages, primarily including:
[phyloseq](https://joey711.github.io/phyloseq/);
[vegan](https://cran.r-project.org/web/packages/vegan/vegan.pdf); [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).
##Loading Packages & Files
First, we load the different pacakges that will be used in this tutorial. If it is not working, please consult [Tutorial_16S_Install_v3.01].
```{r loading packages}
library("ggplot2")
library("phyloseq")
library("vegan")
library("magrittr")
library("data.table")
library('DESeq2');packageVersion("DESeq2")
library("picante")
library("phangorn")
library("phytools")
```
## Loading phyloseq object
If you re-open with a fresh R session and already created you phyloseq objects, you can simply load them as R data objects. These phyloseq objects are the core objects used for the following microbial community analyses.
```{r load phyloseq object}
ps.pruned.final <- readRDS(file.path("D:/Downloads/attachments/ps.pruned.final.RDS"))
read.tree("D:/Downloads/RAxML_bestTree.result") -> tree.16s
```
### Final pre-processing
We will create two additional phyloseq objects to remove taxa not seen at least 3 times in at least 15% of the samples as well as make a sequence table with relative abunance. This preprocossing allows us to visualize and analyze high-level patterns in the dataset rather than whole communitiy shifts.
In the initial filtering step we protect against an OTU (or in this case, ASV) with small mean & trivially large C.V. [(http://joey711.github.io/phyloseq/preprocess.html)]. In the phyloseq tutorial they filter to taxa seen at least 3 times at least 20% of samples, but we chose 15% since we have many fewer taxa to begin with and all of our singletons were either removed or corrected to real sequence variants with DADA2. Remember, however, preprocessing steps are dataset dependent and you may consider trying out a few different parameters.
```{r Filter low abundance taxa for plotting}
#Filter out taxa < 15%
ps.pruned.final.filt = filter_taxa(ps.pruned.final, function(x) sum(x>2) > (0.15*length(x)), TRUE)
#Take relative abundance
ps.pruned.final.filt.rel = transform_sample_counts(ps.pruned.final.filt, (function(x) x/sum(x)))
```
## Explarotary Bar Plots
Once we are happy with the samples we wish to consider for our analysis, our first steps will be to create explaratory plots to look at overall differences between different treatments. To this end, bar plots can be very useful to highlight certain trends in your data. Here is an example looking at the abundance of the top 20 ASVs present in our samples. In the following plot, each bar represent a unique ASV; many ASV can belong to the same family.
For more bar plot options, see: https://joey711.github.io/phyloseq/plot_bar-examples.html
Here we will plot two examples of such bar plots.
### Plot taxa >15%
```{r bar plots of taxa above 15%}
plot_bar(ps.pruned.final.filt, x="Cluster", fill="Phylum") +
theme_bw() +
scale_x_discrete(limits=c("A1P","B1L","A1M", "C3S", "C3M","A5M", "A5S"))
plot_bar(ps.15000.final.filt, x="Treatment", fill="Family") +
theme_bw() +
scale_x_discrete(limits=c("control","medium","high"))
```
The graphs avove are interesting but not really publication ready. We wil thus use a custom-made function
### Define a function for plotting purposes
The figure above is nice, but not looking really good. It is also not providing all the info we need for publication. Therefore, we will create a new R functions called "plot_ordered_bar" based on the following link (https://github.com/joey711/phyloseq/issues/442). Basically what it does is rather than plot the abundance starting from the bottom with the most abundant and decreasing upwards, it stacks the same colors on top of one another in each column, in a way, allowing more related bacteria to be side by side between samples.
The way to create a new R function is to run the whole block first as a chunk. The function will then appear as a new function in your R Global Environment. You can then use it to make graph. But remember, each time you open R or you clear your Global environment, this function must be re-runed before you can use it.
```{r Plot ordered bar function}
#Set random seed so that we can reproduce the methodology
set.seed(22886)
# Make some data
otumat = matrix(sample(1:100, 100, replace = TRUE)
, nrow = 10, ncol = 10)
rownames(otumat) <- paste0("OTU", 1:nrow(otumat))
colnames(otumat) <- paste0("Sample", 1:ncol(otumat))
taxmat = matrix(sample(letters, 70, replace = TRUE),
nrow = nrow(otumat),
ncol = 7)
rownames(taxmat) <- rownames(otumat)
colnames(taxmat) <- c("Domain",
"Phylum",
"Class",
"Order",
"Family",
"Genus",
"Species")
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
physeq = phyloseq(OTU, TAX)
########
# Plot_ordered_bar function
# (name kept the same for back compatibility
# may want to change to avoid confusion!)
#Note: There may be some hangovers
# from another script I was writing
# that I pillaged for this one
###### Start of Function Here ######
plot_ordered_bar <- function (physeq, x = "Sample",
y = "Abundance",
fill = NULL,
leg_size = 0.5,
title = NULL) {
require(ggplot2)
require(phyloseq)
require(plyr)
require(grid)
bb <- psmelt(physeq)
samp_names <- aggregate(bb$Abundance, by=list(bb$Sample), FUN=sum)[,1]
.e <- environment()
bb[,fill]<- factor(bb[,fill], rev(sort(unique(bb[,fill])))) #fill to genus
bb<- bb[order(bb[,fill]),] # genus to fill
p = ggplot(bb, aes_string(x = x, y = y,
fill = fill),
environment = .e, ordered = FALSE)
p = p +geom_bar(stat = "identity",
position = "stack",
color = "black")
p = p + theme(axis.text.x = element_text(angle = -90, hjust = 0))
p = p + guides(fill = guide_legend(override.aes = list(colour = NULL), reverse=TRUE)) +
theme(legend.key = element_rect(colour = "black"))
p = p + theme(legend.key.size = unit(leg_size, "cm"))
if (!is.null(title)) {
p <- p + ggtitle(title)
}
return(p)
}
# END #
```
### Plot phylum
In the following example, we will plot phylym with at least 1% abundance, making the figure more visually informative. Of coure, you might want to change the abundance cutoff values for your own study.
```{r Plot phylum}
#Maybe add a new step where I don't filter to 0.01 and show what comes out.
#Filtering taxa <1.0% relative abundance
ps.pruned.final.rel.filt0.01 = filter_taxa(ps.pruned.final.filt.rel, function(x) mean(x) > 0.01, TRUE)
phyplot01 = plot_ordered_bar(ps.pruned.final.rel.filt0.01, fill = "Phylum") +
geom_bar(stat="identity")+
ggtitle("a)")+
theme_classic()
phyplot01
```
Let's use the different functions of ggplot to make our figure a little more attractive:
Noteworthy thingsL
```{r}
#Make a list of sample names for plotting. This new list will order the samples so that you can compare similar categories together.
#Sample = c("A1P","B1L","A1M", "C3S", "C3M","A5M", "A5S")
#Plotting
phyplot01 = plot_ordered_bar(ps2, fill = "Phylum") +
geom_bar(stat="identity",aes(size=0.1))+
ggtitle("a)")+
scale_x_discrete(limits=metad$Cluster) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
scale_fill_manual(values=c("#0072B2", "#009E73", "#0025EA")) +
scale_y_continuous(limits = c(-0.05, 1),
breaks=c(0.00, 0.25, 0.5, 0.75, 1.00),expand = c(0,0)) +
theme(legend.position="none",
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size = 4.5),
axis.title.x = element_text(margin = margin(t = 9),size=8),
axis.title.y = element_text(size=8),
axis.text.y = element_text(size = 5),
axis.ticks.y = element_line(size=0.5),
panel.border = element_blank(),
plot.title = element_text(size=8))+
annotate("rect", xmin=c(0.5,0), xmax=c(10.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#D55E00", fill="#D55E00") +
annotate("rect", xmin=c(10.5,0), xmax=c(15.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#F0E442", fill="#F0E442") +
annotate("rect", xmin=c(15.5,0), xmax=c(20.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#E69F00", fill="#E69F00") +
annotate(geom = "segment", x=0, xend = 0, y = 0, yend = 1,size=0.5)
phyplot01
```
### Plot family
Choosing the cutoff of all seqs above 0.30% abundance because this leaves a visually informative amount of taxa. Then plotting families. Again, analysis dependent so choose an appropriate cutoff.
```{r Plot families}
#Plotting
famplot01 = plot_ordered_bar(ps.pruned.final.rel.filt0.01, fill = "Family") +
geom_bar(stat="identity",aes(size=0.1))+
ggtitle("a)")+
scale_x_discrete(limits=Cluster) + theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
scale_y_continuous(limits = c(-0.05, 1),
breaks=c(0.00, 0.25, 0.5, 0.75, 1.00),expand = c(0,0)) +
theme(legend.position="none",
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size = 4.5),
axis.title.x = element_text(margin = margin(t = 9),size=8),
axis.title.y = element_text(size=8),
axis.text.y = element_text(size = 5),
axis.ticks.y = element_line(size=0.5),
panel.border = element_blank(),
plot.title = element_text(size=8))+
annotate("rect", xmin=c(0.5,0), xmax=c(10.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#D55E00", fill="#D55E00") +
annotate("rect", xmin=c(10.5,0), xmax=c(15.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#F0E442", fill="#F0E442") +
annotate("rect", xmin=c(15.5,0), xmax=c(20.5,0),
ymin=c(-0.01,0) , ymax=c(-0.03,0), color="#E69F00", fill="#E69F00") +
annotate(geom = "segment", x=0, xend = 0, y = 0, yend = 1,size=0.5)
famplot01
```
Now, we will create another new custom made function so that we can summarize the data into a single talbe.
```{r}
#Create function that modify phyloseq object
fast_melt = function(physeq){
# supports "naked" otu_table as `physeq` input.
otutab = as(otu_table(physeq), "matrix")
if(!taxa_are_rows(physeq)){otutab <- t(otutab)}
otudt = data.table(otutab, keep.rownames = TRUE)
setnames(otudt, "rn", "taxaID")
# Enforce character taxaID key
otudt[, taxaIDchar := as.character(taxaID)]
otudt[, taxaID := NULL]
setnames(otudt, "taxaIDchar", "taxaID")
# Melt count table
mdt = melt.data.table(otudt,
id.vars = "taxaID",
variable.name = "SampleID",
value.name = "count")
# Remove zeroes, NAs
mdt <- mdt[count > 0][!is.na(count)]
# Calculate relative abundance
mdt[, RelativeAbundance := count / sum(count), by = SampleID]
if(!is.null(tax_table(physeq, errorIfNULL = FALSE))){
# If there is a tax_table, join with it. Otherwise, skip this join.
taxdt = data.table(as(tax_table(physeq, errorIfNULL = TRUE), "matrix"), keep.rownames = TRUE)
setnames(taxdt, "rn", "taxaID")
# Enforce character taxaID key
taxdt[, taxaIDchar := as.character(taxaID)]
taxdt[, taxaID := NULL]
setnames(taxdt, "taxaIDchar", "taxaID")
# Join with tax table
setkey(taxdt, "taxaID")
setkey(mdt, "taxaID")
mdt <- taxdt[mdt]
}
return(mdt)
}
#Create function that estimate standard error
se <- function(x) sqrt(var(x)/length(x))
#Creat summarize taxa function
summarize_taxa = function(physeq, Rank, GroupBy = NULL){
Rank <- Rank[1]
if(!Rank %in% rank_names(physeq)){
message("The argument to `Rank` was:\n", Rank,
"\nBut it was not found among taxonomic ranks:\n",
paste0(rank_names(physeq), collapse = ", "), "\n",
"Please check the list shown above and try again.")
}
if(!is.null(GroupBy)){
GroupBy <- GroupBy[1]
if(!GroupBy %in% sample_variables(physeq)){
message("The argument to `GroupBy` was:\n", GroupBy,
"\nBut it was not found among sample variables:\n",
paste0(sample_variables(physeq), collapse = ", "), "\n",
"Please check the list shown above and try again.")
}
}
# Start with fast melt
mdt = fast_melt(physeq)
if(!is.null(GroupBy)){
# Add the variable indicated in `GroupBy`, if provided.
sdt = data.table(SampleID = sample_names(physeq),
var1 = get_variable(physeq, GroupBy))
setnames(sdt, "var1", GroupBy)
# Join
setkey(sdt, SampleID)
setkey(mdt, SampleID)
mdt <- sdt[mdt]
}
# Summarize
Nsamples = nsamples(physeq)
summarydt = mdt[, list(meanRA = sum(RelativeAbundance)/Nsamples,
sdRA = sd(RelativeAbundance),
seRA = sd(RelativeAbundance)/sqrt(nsamples(physeq)),
seRA = se(RelativeAbundance),
minRA = min(RelativeAbundance),
maxRA = max(RelativeAbundance)),
by = c(Rank, GroupBy)]
return(summarydt)
}
```
Now let's summarize the data for ps.pruned.final.filt.rel
```{r}
summarize_taxa(ps.pruned.final.filt.rel, Rank = "Family",GroupBy = "SampleID")
```
## STATISTICAL ANALYSES & PLOTS
Broadly define, there are two kinds of analysis that are generally used to study metagenomics data: 1) alpha-diversity; and 2) beta-diversity. Here we will use phyloseq combined with ggplot to explore both type ofs analysis. We will also use the package [vegan] to conduct statistical analyses linked to the different diversity indices.
N.B. We will try to use colorblind friendly colors whenever possible when plotting our results.
```{r colorblind palette}
cbpaletteshort = c("#D55E00","#F0E442","#E69F00", "#0072B2")
```
## Alpha-diversity
Alpha-diversity consider the taxa diversity within each sample individually. The most simple alph-diversity index is richness, which consider the total number of observed taxa. Over the past 50 years or so, many diversity indices were developped for different purposes, often accounting for the relative abundance of each taxa. Here, we will run all diveristy and abundance analyses on subsampled reads, since we are using observed species as part of our metrics. Using our DADA2 sequences this is acceptable and highly advised.
Before, we can look at the different diversity indices, however, we need to control for the fact that our samples do not present the same sequencing depth. It is easy to imagine that a sample with a larger sequencing depth (or number of reads) could present a higher diversity value. However, it does not mean the sampel is in fact more diverse from an ecological point of view. For this reason, we will rarefy our samples to a similar depth using the following code. The rarefy function simply randomly draw a specific number of reads from each samples and re-evaluate diversity indices. We will conduct alpha diversity analyses on our unfiltered table to allow rare taxa and fine-scale differences between samples to be accounted for.
### Rarefying samples
```{r rarefying samples}
set.seed(401)
ps.pruned.final.rar <- rarefy_even_depth(ps.pruned.final, sample.size = min(sample_sums(ps.pruned.final)),
rngseed = 401, replace = TRUE, trimOTUs = TRUE, verbose = TRUE)
#Let's make sure that all samples have the same number of reads:
sample_sums(ps.pruned.final.rar)
#make phyloseq object with tree
phyloseq(sample_data(ps.pruned.final),
otu_table(ps.pruned.final), tax_table(ps.pruned.final), tree.16s) -> phyloseq.tree.16s2
#root and remake tree
phyloseq.tree.16s2@phy_tree -> tree2.16s
midpoint.root(tree2.16s) -> rooted.16s
#make phyloseq object with tree
phyloseq(sample_data(ps.pruned.final),
otu_table(ps.pruned.final), tax_table(ps.pruned.final), rooted.16s) -> phyloseq.tree.16s2
#make phyloseq object with tree
phyloseq(sample_data(ps.pruned.final.rar),
otu_table(ps.pruned.final.rar), tax_table(ps.pruned.final.rar), rooted.16s) -> phyloseq.tree.16s
```
Note that when looking at beta-diversity or differential abundance (see below) we will not subsample since we can use a Negative Binomial to accurately capture our distribution, for more information see [Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable, McMurdie and Holmes 2014](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531)
### Calculate alpha-diversity
The command below will run all alpha diversity metrics and output it as a table. You can copy and paste these values into an excel sheet if you wish to work with them that way.
First, we run `estimate_richness` on the phyloseq object. We can ignore the warnings about singletons, if any, since we processed our reads with DADA2, which relies on multiple observation of reads to correct sequences and so does not output singletons. The warning is in regards to diversity metrics (e.g., Chao1) that require singletons. Singletons are likely to be sequencing errors and were removed from our table.
Additionally, we are conducting alpha diversity analyses on our unfiltered table to allow rare species and fine-scale differences between samples to be accounted for.
```{r rarefying}
#This function will measure Richness, Shannon, for the phyloseq object ps.15000.final.rar
alpha <- estimate_richness(ps.pruned.final.rar, measures=c("Observed", "Shannon"))
#add SampleID column to the alpha dataframe
alpha$SampleID <- sample_names(ps.pruned.final.rar)
metad_alpha <- merge(metad,alpha,by="SampleID") #merge alpha and metad into a single dataframe
metad_alpha
write.csv(metad_alpha, file.path("D:/Downloads/attachments/metad_alpha.csv")) #writes out a .csv file that you can open in excel
```
#### Plot alpha-diversity
Next, we will plot the diversity metrics according to metadata categories and run statistical analysis (e.g. anova,, linear and polynomial regressions) comparing the differences in diversity between treatments.
Using the 'plot_richness()' command in R, we can simply visualize all the different alpha diversity indices in a single plot for explaratory purposes.
```{r plot alpah diversity}
plot_richness(ps.pruned.final.rar,measures=c("Observed", "Shannon"))
```
Or we can specify diversity indices of choices and highlight different treatment:
```{r plot specific alpha diversity}
plot_richness(ps.pruned.final.rar, x="Cluster", measures=c("Observed", "Shannon"), color="Cluster")
plot_richness(ps.pruned.final.rar, x="Size", measures=c("Observed", "Shannon"), color="Size")
plot_richness(ps.pruned.final.rar, x="Patch_size", measures=c("Observed", "Shannon"), color="Patch_size")
ggsave("Patch_size_summary.png", device = "png", path = "~/Metagenomics/")
plot_richness(ps.pruned.final.rar, x="Cluster2", measures=c("Observed", "Shannon"), color="Cluster2")
ggsave("Patch_summary.png", device = "png", path = "~/Metagenomics/")
```
In the next three setions, we will plot diversity measured as observed species, the Shannon (richness and eveness) index and Pielou's evennes indices. We will also run statistical analysis to see if diversity if affected by patch size. We chose to use the sequence table with no filtering of lowly abundant ASVs, since rare species can play a large role in defining a samples' alpha diversity.
### Observed species
```{r}
#Plotting observed taxa for each patch size
#N.B. We are using "Treatment" to plot the different concentrations as a categorical variable
obs.boxplot01 <- ggplot(metad_alpha, aes(Cluster,Observed)) +
geom_boxplot(aes(fill = Cluster),outlier.size=1,lwd=0.5,fatten = 1.25) +
ggtitle("a)") +
xlab(" ") + theme_classic()
obs.boxplot01
```
Now, let's try to spice up the figure to look a bit more publication ready
```{r}
#Plotting observed taxa for each patch type
obs.boxplot02 <- ggplot(metad_alpha, aes(Patch_size, Observed)) +
geom_boxplot(aes(fill = Patch_size),outlier.size=1,lwd=0.5,fatten = 1.25) +
xlab("Patch Type") + theme_classic()+ ylab("Observed Richness") +
stat_summary(fun.y=mean, geom = "point", shape = 5, size = 4) +
scale_fill_manual(values=c("#D55E00","#F0E442", "#E69F00", "#F2B2A6")) +
scale_x_discrete(limits=c("S","M","C","E")) +
theme(legend.position="none",
#axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
obs.boxplot02
ggsave("Observed.png", device = "png", path = "~/Metagenomics/")
```
In the following section, we run a simple statistical analysis to test whether diversity changes with the different treatments.
```{r statistical model for observed species}
#Running model with Concentration as categorical
clustered = aggregate(metad_alpha[,c('Observed',"Shannon")],list(metad_alpha$Cluster, metad_alpha$Size),mean)
clustered2 = aggregate(metad_alpha[,c('Observed',"Shannon")],list(metad_alpha$Cluster, metad_alpha$Patch_size),mean)
model.obs.OO<-aov(Observed~factor(Group.2),data=clustered) #This run the statisitics
anova(model.obs.OO) #this print the statistics
model.obs.O1<-aov(Observed~factor(Group.2),data=clustered2) #This run the statisitics
anova(model.obs.O1) #this print the statistics
```
In the analysis above, we can see that the number of observed taxa is statistically diffent (or significantly) becaus the P-value (i.e.g Pr(>F)) is lower than 0.05.
### Shannon metric
Plotting Shannon diversity. With median and hinges as 25th and 75th quartiles and means as diamonds.
```{r}
#Plotting observed taxa for each patch type
#N.B. We are using "Treatment" to plot the different concentrations as a categorical variable
shan.boxplot02 <- ggplot(metad_alpha, aes(Patch_size, Shannon)) +
geom_boxplot(aes(fill = Patch_size),outlier.size=1,lwd=0.5,fatten = 1.25) +
xlab("Patch Type") + theme_classic()+ ylab("Shannon Diversity") +
stat_summary(fun.y=mean, geom = "point", shape = 5, size = 4) +
scale_fill_manual(values=c("#D55E00","#F0E442", "#E69F00", "#F2B2A6")) +
scale_x_discrete(limits=c("S","M","C","E")) +
theme(legend.position="none",
#axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
shan.boxplot02
ggsave("Shannon.png", device = "png", path = "~/Metagenomics/")
```
```{r Shannon}
shan.boxplot <- ggplot(clustered, aes(Size,Avg_shan)) +
geom_boxplot(aes(fill = clustered$Size),outlier.size=1,lwd=0.5,fatten = 1.25) +
ylab("Shannon") +
xlab(" ") + theme_classic() +
stat_summary(fun.y=mean, geom = "point", shape = 5, size = 4) +
scale_fill_manual(values=c("#D55E00","#F0E442", "#E69F00")) +
scale_x_discrete(limits=c("S","M","L")) +
theme(legend.position="none",
axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
shan.boxplot
```
Let's go ahead and make and summarize our models...
```{r}
#Running models for Shannon diversity
model.shan.OO<-aov(Shannon~factor(Group.2),data=clustered)
anova(model.shan.OO)
model.shan.O1<-aov(Shannon~factor(Group.2),data=clustered2)
anova(model.shan.O1)
```
```{r}
#Phylogenetic Diversity
as.data.frame(phyloseq.tree.16s@otu_table) -> comm.16s
pd(comm.16s, rooted.16s) -> pd.16s
pd.16s$SampleID <- rownames(pd.16s)
merge(metad_alpha, pd.16s) -> pd.16s
clustered.16s = aggregate(pd.16s[,c('Observed',"Shannon","PD")],list(pd.16s$Cluster, pd.16s$Patch_size),mean)
clustered.16s2 = aggregate(pd.16s[,c('Observed',"Shannon","PD")],list(pd.16s$Cluster, pd.16s$Size),mean)
model.pd1.16s <- aov(clustered.16s$PD ~ clustered.16s$Group.2)
anova(model.pd1.16s)
model.pd2.16s <- aov(clustered.16s2$PD ~ clustered.16s2$Group.2)
anova(model.pd2.16s)
pd.boxplot02 <- ggplot(pd.16s, aes(Patch_size, PD)) +
geom_boxplot(aes(fill = Patch_size),outlier.size=1,lwd=0.5,fatten = 1.25) +
xlab("Patch Type") + theme_classic()+ ylab("Faith's Phylogenetix Diversity") +
stat_summary(fun.y=mean, geom = "point", shape = 5, size = 4) +
scale_fill_manual(values=c("#D55E00","#F0E442", "#E69F00", "#F2B2A6")) +
scale_x_discrete(limits=c("S","M","C","E")) +
theme(legend.position="none",
#axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
pd.boxplot02
ggsave("PD_16s.png", device = "png", path="D:")
```
### Summarize metrics in table
Then make a table of means of Observed, ACE, Shannon, Simpson and Fisher and Pielou metrics. This is for publication purposes. You can freely change the columns of diversity metrics to retain.
```{r Compute means for diversity table}
alphmean = aggregate(metad_alpha[,c('Observed',"Shannon")],list(metad_alpha$Cluster),mean)
alphsd = aggregate(metad_alpha[,c('Observed',"Shannon")],list(metad_alpha$Cluster),sd)
alphmean
alphsd
write.csv(alphmean,file.path(path, "alpha_means.csv"))
write.csv(alphsd,file.path(path, "alpha_sd.csv"))
```
## Beta-diversity
Often in ecological research, we are interested not only in comparing univariate descriptors of communities, like diversity (such as in alpha-diversity), but also in how the constituent species — or the composition — changes from one community to the next. One common tool to do this is ordination. The goal of ordination is to collapse information from multiple dimensions (e.g, from multiple communities, sites, etc.) into just a few, so that they can be visualized and interpreted.
### Using PCoA (principle coordinate analysis)
(From Wikipedia)
Principle coordinate analysis (PCoA) is often mistakeng for Principal Component Analysis (PCA), which is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (e.g., microbial taxa and their abundance) into a set of values of linearly uncorrelated variables called principal components (or sometimes, principal modes of variation). Principle coordinate analysis (PCoA) is similar in essence to principle component analysis but uses a dissimilarity matrix (i.e. distances mesured between instances of a data set).
### Run ordination
Here we will run Principle Coordinate Anlaysis (PCoA) on [UniFrac](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1317376/) distance scores to analyze beta diversity differences between communities. There are a number of other distance metrics available, such as Jaccard, Bray Curtis and Weighted UniFrac; and other multivariate analysis available such as multidimensional scaling (NMDS) for example when you don't have a tree.
You can see these in the phyloseq documentaion. Also, phyloseq has a great [tutorial](http://joey711.github.io/phyloseq/plot_ordination-examples.html) on using the ``plot_ordination`` function and [making scree plots](https://github.com/joey711/phyloseq/blob/master/vignettes/phyloseq-analysis.Rmd).
Here we will run the ordination calculation on the filtered but non-rarefied phyloseq object to look into specific patterns
```{r Run and plot ordinations}
#First run ordination
ps.pruned.final.filt.pc.uni <-ordinate(ps.pruned.final.filt, method="PCoA", distance = "morisita")
#Then we can see how much of total distance is captured by the eigenvalues
plot_scree(ps.pruned.final.filt.pc.uni)
```
How much variation do the first two axes (ones we will plot) explain? If this number is really low (i.e., <10%),it might be a good idea to wonder whether running this kind of analysis of meaningful at all.
```{r variation explained by axis}
#PCoA 1
(100*sum(ps.pruned.final.filt.pc.uni$values$Relative_eig[1]))
#PCoA 2
(100*sum(ps.pruned.final.filt.pc.uni$values$Relative_eig[2]))
#PCoA 1 and 2
(100*sum(ps.pruned.final.filt.pc.uni$values$Relative_eig[1:2]))
```
### Plot ordination
Comparisons of community structure
```{r}
#Plotting ordination comparing different patch sizes
#We use t-dist for ellipse b.c n < 30
p1.filt = plot_ordination(ps.pruned.final.filt,
ps.pruned.final.filt.pc.uni,
color="Size") +
geom_point(size = 1) +
scale_colour_manual(values=c("#D55E00","#F0E442", "#E69F00", "#0072B2")) +
theme_classic() +
stat_ellipse(type = "t")+
ylab("PCo2 (27.57%)") +
xlab("PCo1 (31.14%)") +
theme(legend.position="none",
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
p1.filt
#Plotting ordination including edges and interiors
p2.filt = plot_ordination(ps.pruned.final.filt,
ps.pruned.final.filt.pc.uni,
color="Patch_size") +
geom_point(size = 1) +
scale_colour_manual(values=c("#D55E00","#F0E442", "#E69F00", "#0072B2")) +
theme_classic() +
stat_ellipse(type = "t")+
ylab("PCo2 (27.57%)") +
xlab("PCo1 (31.14%)") +
theme(legend.position="right",
legend.text = "Patch Size",
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
p2.filt
ggsave("Morista.png", device = "png", path = "~/Metagenomics/")
ps.16s.ord<-ordinate(phyloseq.tree.16s2, method="PCoA", distance = "wunifrac")
#PCoA 1
(100*sum(ps.16s.ord$values$Relative_eig[1]))
#PCoA 2
(100*sum(ps.16s.ord$values$Relative_eig[2]))
#PCoA 1 and 2
(100*sum(ps.16s.ord$values$Relative_eig[1:2]))
p3.filt = plot_ordination(phyloseq.tree.16s2,
ps.16s.ord,
color="Patch_size") +
geom_point(size = 1) +
scale_colour_manual(values=c("#D55E00","#F0E442", "#E69F00", "#0072B2")) +
theme_classic() +
stat_ellipse(type = "t")+
ylab("PCo2 (16.27%)") +
xlab("PCo1 (38.66%)") +
theme(legend.position="right",
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
p3.filt
ggsave("Wunifrac_16s.png", device = "png", path = "D:/Downloads/ITS")
ps.16s.ord<-ordinate(phyloseq.tree.16s2, method="PCoA", distance = "unifrac")
#PCoA 1
(100*sum(ps.16s.ord$values$Relative_eig[1]))
#PCoA 2
(100*sum(ps.16s.ord$values$Relative_eig[2]))
#PCoA 1 and 2
(100*sum(ps.16s.ord$values$Relative_eig[1:2]))
p4.filt.16s = plot_ordination(phyloseq.tree.16s2,
ps.16s.ord,
color="Patch_size") +
geom_point(size = 1) +
scale_colour_manual(values=c("#D55E00","#F0E442", "#E69F00", "#0072B2")) +
theme_classic() +
stat_ellipse(type = "t")+
ylab("PCo2 (5.82%)") +
xlab("PCo1 (7.96%)") +
theme(legend.position="right",
axis.title.x = element_text(size=10),
axis.title.y = element_text(size=10),
axis.text.x=element_blank(),
axis.line.x=element_blank(),
axis.ticks.x = element_blank(),
axis.line.y=element_blank(),
axis.text.y=element_text(size=8),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=10))
p4.filt.16s
ggsave("Unifrac_16s.png", device = "png", path = "D:/Downloads/ITS")
```
```{r}
ggsave("Figure.jpeg", p1.filt, device = "jpeg", path = file.path(path), scale = 1, width = 12, height = 8, units = "cm", dpi = 300, limitsize = TRUE)
```
### Statistical analysis of Beta-diversity
To test whether difference in community structures (e.g., clustering) is significantly correlated with one of the metadata variables, you can use the following set of statistical analyses. Each test is designed to accommodate categorical or continuous data.
We use ADONIS (see ``vegan::adonis`` ) when we have continuous variables but have to use an ANOSIM (see ``vegan::anosim``) when there are categorical variables. We are looking at the P and R2/R values. A R2/R value of 0 means that there is no dissimilarity between the groups and a value of 1.0 suggest high dissimilarity. A value <0 suggests that the within group difference is in fact greater than between group differences. If we receive significant values when conducting these tests, we want to make sure the differencs are not due to a lack of homogeneity of group dispersions, and so we verify our results using ``vegan::betadisper``. Below we see our test statistics, p-values and anova tables printed.
#### ADONIS
ADONIS partitions a distance matrix among sources of variation in order to describe the strength and significance that a continuous variable has in determining variation of distances. This is a nonparametric method and is very similar to PERMANOVA, though it is more robust in that it can accept either categorical or continuous variables in the metadata mapping file, while PERMANOVA can only accept categorical variables. Adonis is part of the Vegan package.
```{r Run ADONIS}
#Recalculating distance and running adonis test
df = as(sample_data(ps.pruned.final.filt),"data.frame")
d=phyloseq::distance(ps.pruned.final.filt,"morisita")
#testing whether communities differ from patch size
treatment_ado = adonis(d ~ Size,df)
treatment_ado #print the results of the analysis.
#testing whether communities differ, including edges and interiors
treatment_ado2 = adonis(d ~ Patch_size,df)
treatment_ado2 #print the results of the analysis.
df = as(sample_data(phyloseq.tree.16s2), "data.frame")
d = UniFrac(phyloseq.tree.16s2, weighted = TRUE)
treatment_ado3 = adonis (d ~ Patch_size,df)
treatment_ado3
df = as(sample_data(phyloseq.tree.16s2), "data.frame")
d = UniFrac(phyloseq.tree.16s2, weighted = FALSE)
treatment_ado3 = adonis (d ~ Patch_size,df)
treatment_ado3
#Post-hoc?
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("E","S")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("E","C")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("E","M")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("S","M")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("M","C")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
subset_samples(phyloseq.tree.16s2, phyloseq.tree.16s2@sam_data[,4] == c("P","C")) -> s1
df = as(sample_data(s1), "data.frame")
d = UniFrac(s1, weighted = TRUE)
treatment_ado3s = adonis (d ~ Patch_size,df)
treatment_ado3s
```
---
This part is left for your to play with...
#DESeq2 differential abundance
DESeq2 uses methods developed to analyze gene expression data. In a nutshell, it uses differential abundance and tries to identify entries (e.g., genes or in microbiome studies taxa!) that changed in abundance between two states.
The methods described in this script are borrowed from a few places: the [phyloseq extension](http://joey711.github.io/phyloseq-extensions/DESeq2.html) and parts from [QIIME's open source DESeq2 script](https://github.com/biocore/qiime/blob/c2b62a86100f08e26011a11587a8da60ccaf409a/qiime/support_files/R/DESeq2_nbinom.r).
##Differential abundance analysis
There are three fitTypes available (local, paramteric and mean). After testing all three, we found the local fit was best. Though this is data dependent, I haven't seen anything other than a local fit used.
```{r}
#Make ISeV table a data frame
dds = phyloseq_to_deseq2(mergedps, ~ Size)
# calculate geometric means prior to estimate size factors
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds), 1, gm_mean)
dds = estimateSizeFactors(dds, geoMeans = geoMeans)
dds = DESeq(dds,fitType="local")
res = results(dds)
res = res[order(res$padj, na.last=NA), ]
alpha = 0.01
sigtab = res[(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(mergedps)[rownames(sigtab), ], "matrix"))
sigtab
#Let's look at the genus that were enriched in the presence of streptomycin
posigtab = sigtab[sigtab[, "log2FoldChange"] > 0, ]
posigtab = posigtab[, c("baseMean", "log2FoldChange", "lfcSE", "padj", "Phylum", "Class", "Family", "Genus")]
posigtab
#Let's look at genus that were enriched in the absence of streptomycin
negagtab = sigtab[sigtab[, "log2FoldChange"] < 0, ]
negagtab = negagtab[, c("baseMean", "log2FoldChange", "lfcSE", "padj", "Phylum", "Class", "Family", "Genus")]
negagtab
```
In the above table you can see what taxa change in abundace between the two experimental conditions.
##Plotting dispersion estimates
```{r plot DESeq summary}
theme_set(theme_bw())
sigtabgen = subset(sigtab, !is.na(Genus))
sigtabgen
# Phylum order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Phylum = factor(as.character(sigtabgen$Phylum), levels=names(x))
# Genus order
x = tapply(sigtabgen$log2FoldChange, sigtabgen$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtabgen$Genus = factor(as.character(sigtabgen$Genus), levels=names(x))
de1 = ggplot(sigtabgen, aes(y=Genus, x=log2FoldChange, color=Phylum)) +
geom_point(size=3) +
xlab("Log2 Fold Change") +
geom_vline(xintercept = 0.0, color = "gray", size = 0.5) +
scale_y_discrete(position = "right")+
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))+
theme(legend.position="none",
panel.grid.major.x = element_blank(),
panel.grid.minor.x=element_blank(),
panel.grid.major.y=element_line(size=0.2),
axis.line.x=element_blank(),
axis.line.y=element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=0.5))
de1 + scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7"))
ggsave("Genus_Change.jpeg", de1, device = "jpeg", path = path, scale = 1, width = 12, height = 8, units = "cm", dpi = 300, limitsize = TRUE)
de2 = ggplot(sigtabgen, aes(y=log2FoldChange, x=Genus, color=Phylum)) +
ggtitle("b)")+
geom_point(size=1.9) +
ylab("Log2 Fold Change") +
geom_hline(yintercept = 0.0, color = "black", size = 0.4, linetype="longdash") +
scale_color_manual(values=c("#0072B2","#CC79A7", "#009E73"))+
theme(legend.position="none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y=element_blank(),
panel.grid.major.x=element_line(size=0.2,linetype="dotted"),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8),
axis.line.x=element_blank(),
axis.line.y=element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size = 4.5),
axis.text.y=element_text(size=5),
panel.border = element_rect(colour = "black", fill=NA, size=0.5),
plot.title = element_text(size=8))
de2
ggsave("Genus_Change.jpeg", de2, device = "jpeg", path = path, scale = 1, width = 12, height = 8, units = "cm", dpi = 300, limitsize = TRUE)
```
The above figure shows the different genus that changes in frequency between the two experimental treatment. Taxa above zero increased in the presence of streptomycin while taxa below zero decreased in the presence of streptomycin. Not that it is possible to see multiple within the same taxa if there are mutliple ASV variants within the genus (or even within the same species).
###THE END###
You are now ready to write up a manuscript describing your results. When doing so, always make sure to stay up to date with the most current version of DADA2 and phyloseq. New updates are released periodically with important changes to the ever-evolving analyses and statistical tools. Also, if you hae an especially large data set, you might consdier machine learning tools such as ???. The way to analyze 16S microbial community data is constantly improving, especially when it comes to time series and mixed models.
Good luck!