--- title: "Tutorial_ComAnalysis_v3.02" author: "Gabriel Perron" date: "2/19/2018" output: html_document editor_options: chunk_output_type: console --- # INTRODUCTION This tutorial was developped at Bard College as part of Bio340 - Metagenomics and for Research Projects investigating microbial communities using the 16S rRNA or ITS marker. The tutorial is divided in three parts: 1) [Tutorial_16S_Install_v3.02.Rmd]: Installs (or updates) the different packages and dependendies required to run the pipeline 2) [Tutorial_16S_DADA2_v3.02.Rmd]: Runs [DADA2], a packages developped to process raw 16S (or ITS) reads while removing noise and producing a taxa table with counts. 3) [Tutorial_16S_ComAnalysis_v3.02.Rmd]: Combines taxa table and metadata into a [Phyloseq] object and runs different microbial community analyses, including alpha-diversity, beta-diversity, and taxa changes in abundance. Th In addition to [DADA2](https://github.com/benjjneb/dada2) to infer amplicon sequence variants (AVS) the tutorial uses several other packages, primarily including: [phyloseq](https://joey711.github.io/phyloseq/); [vegan](https://cran.r-project.org/web/packages/vegan/vegan.pdf); [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html). ##Loading Packages & Files First, we load the different pacakges that will be used in this tutorial. If it is not working, please consult [Tutorial_16S_Install_v3.01]. ```{r loading packages} library("ggplot2") library("phyloseq") library("vegan") library("magrittr") library("phangorn") library("DECIPHER") library("phytools") library("data.table") library('DESeq2');packageVersion("DESeq2") ``` We also need to load the different data objects created in [Tutorial_16S_DADA2_v3.02] that will be used in the microbial communities analysis ```{r loading data objects} #taxa table taxa.its <- readRDS("D:/Downloads/ITS/ITS_taxa.Rdata") #sequence table seqtab.nochim.its <- readRDS("D:/Downloads/ITS/ITS.nochim.Rdata") ``` # PHYLOSEQ From the website (https://joey711.github.io/phyloseq/): The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into Operational Taxonomic Units (OTUs) or ASVs, especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs. This package leverages many of the tools available in R for ecology and phylogenetic analysis (vegan, ade4, ape, picante), while also using advanced/flexible graphic systems (ggplot2) to easily produce publication-quality graphics of complex phylogenetic data. phyloseq uses a specialized system of S4 classes to store all related phylogenetic sequencing data as single experiment-level object, making it easier to share data and reproduce analyses. ## Creating metadata file The first step is to create a metadata file that stores all the information associated with our samples. Please see the example metadata file to see how you should structure your file. ```{r creating metadata} Metadata.its <- file.path("D:/Downloads/ITS/ITS_fragments_metadata.csv") metad.its <- read.csv(Metadata.its, header=TRUE, sep=",") metad.its$SampleID <- paste0(metad.its$SampleID) seqtab.frag <- subset(seqtab.nochim.its, rownames(seqtab.nochim.its) %in% metad.its$SampleID) #Check that sample names match between the metadata file and ASVs table. Should return TRUE all(rownames(seqtab.frag) %in% metad.its$SampleID) rownames(metad.its) <- metad.its$SampleID keep.cols <- c("SampleID","Soil_ID","Patch_size","Size","Cluster","Cluster_sep","Lab_ID") #keep whichever columns you feel are important! metad.its <- metad.its[rownames(seqtab.frag), keep.cols] #this is your final metadata object #Let us check out the metadata file #metad.its <- subset(metad.its, metad$Lab_ID !="") head(metad.its) saveRDS(metad.its, file.path("D:/Downloads/ITS/metadata_its.RDS")) ``` ## Creating phyloseq object We then creat the phylosey object using all the files you have created thus far. We will use this phyloseq object for most of our analyes and plotting moving forward. ```{r} ps <- phyloseq(tax_table(taxa.its), sample_data(metad.its), otu_table(seqtab.frag, taxa_are_rows = FALSE)) ps #print summary ``` ##Removing undesired taxa Finally, we makes sure we are only looking at bacteria and removing mitochondria and Chloroplast form our phyloseq object ```{r removing chloroplast and mitochondira} ps_its <- ps %>% subset_taxa( Kingdom == "k__Fungi" & Family != "mitochondria" & Class != "Chloroplast" ) ps_its #is the number of taxa lower? saveRDS(ps_its, file.path("D:/Downloads/ITS/ps_its.RDS")) ``` ## Pre-processing of phyloseq object Before looking into the different diversity analyses, we must make sure our results will not be biased by sequencing depth. The first thing to look for is that the number of reads in each sample should be more or less normally distributed. Also, you want to note any samples that may have very few reads. This could be due to sequencing failure (it happens!) or because a sample had a low microbial load to start with. Either way, such discrepancy could biase your results and you may wish to remove such low-reads samples from further analysis. There is no real rule of thumb here and you maybe want to prune different number of sample to see if this affect your conclusions. ### Check Reads Distribution First, let's visualize the distribution of read counts from our samples. ```{r plot read counts distribution} #First, let's compute the number of reads for each sample sample_sum_df <- data.frame(sum = sample_sums(ps_its)) #You can print the number of reads by running "sample_sum_df" #Let's plot the number of reads in each sample using a histogram ggplot(sample_sum_df, aes(x = sum)) + geom_histogram(color = "black", fill = "indianred", binwidth = 5000) + ggtitle("Distribution of sample sequencing depth") + xlab("Read counts") + theme(axis.title.y = element_blank()) + theme_bw() ``` In the current example, we can see that nine samples have count reads below 50,000. That being said, 50,000 is considered a very good coverage for this kind of analysis. We will thus look more specifically at different cutoff values. ### Pruning samples from phyloseq object Remove samples with 10,000 or below ```{r prune phyloseq} #Prune at 10,000 reads ps.pruned_its = prune_samples(sample_sums(ps_its)>10000,ps) ps.pruned_its sample_names(ps.pruned_its) sample_sum_df <- data.frame(sum = sample_sums(ps.pruned_its)) #You can print the number of reads by running "sample_sum_df" #Let's plot the number of reads in each sample using a histogram ggplot(sample_sum_df, aes(x = sum)) + geom_histogram(color = "black", fill = "indianred", binwidth = 10000) + ggtitle("Distribution of sample sequencing depth") + xlab("Read counts") + theme(axis.title.y = element_blank()) + theme_bw() ``` In the above example, we see that pruning below 15000 removed all but one low-concentration sample. For this reason, we decided to remove low-concentration samples altogether from the analysis. ```{r} ps.pruned.final.its = ps.pruned_its #Let's save this new phyloseq object saveRDS(ps.pruned.final.its, file.path("D:/Downloads/ITS/ps.pruned.final.its.RDS")) ``` ```{r} seqs.its <- getSequences(seqtab.frag) names(seqs.its) <- seqs.its #Alignment using DECIPHER alignment.its <- AlignSeqs(DNAStringSet(seqs.its), anchor=NA) #Multiple alignment using Phangorn phang.align.its <- phyDat(as(alignment.its, "matrix"), type="DNA") dm.its <- dist.ml(phang.align.its) treeNJ.its <- NJ(dm.its) fit.its = pml(treeNJ.its, data=phang.align.its) #Tree using phangorn fitGTR.its <- update(fit.its, k=4, inv=0.2) ptm <- proc.time() #This functions along with line 228 will serve to time how long it took to make the tree fitGTR.its <- optim.pml(fitGTR.its, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0)) proc.time() - ptm write.tree(fitGTR.its$tree,"D:/Downloads/ITS/ITStree.tre") saveRDS(fitGTR.its$tree, "D:/Downloads/ITS/ITStree.Rdata") #UPGMA tree treeUPGMA.its <- upgma(dm.its) fit.its = pml(treeUPGMA.its, data=phang.align.its) fitGTR2.its <- update(fit.its, k=4, inv=0.2) ptm <- proc.time() #This functions along with line 228 will serve to time how long it took to make the tree fitGTR2.its <- optim.pml(fitGTR2.its, model="GTR", optInv=TRUE, optGamma=TRUE, rearrangement = "stochastic", control = pml.control(trace = 0)) proc.time() - ptm write.tree(fitGTR2.its$tree,"D:/Downloads/ITS/ITStree2.tre") saveRDS(fitGTR2.its$tree, "D:/Downloads/ITS/ITStree2.Rdata") ``` ```{r} midpoint.root(fitGTR.its$tree) -> rooted.its ```