phyloseq: Reproducible interactive analysis of microbiome census data using R

Collaborative development of phyloseq on GitHub.

Official stable release of phyloseq on Bioconductor.

Advances in DNA sequencing technology have dramatically improved the scope and scale of culture-independent investigations into microbial communities. There are effective software tools available to process raw DNA sequences and classify them taxonomically, and even provide some subsequent ecological analysis. However, additional project-specific statistical analysis is often needed, and in many cases these latter-stage custom analyses are difficult (or impossible) for peer researchers to independently reproduce.

To help address this we have created a new open-source R package, “phyloseq”, that  provides a set of tools for importing, organizing, filtering, analyzing, and graphically-summarizing phylogenetic sequencing data. It emphasizes the integration of taxa-abundance data with other experimental covariates, including qualitative/quantitative measurements of clinical or environmental samples – as well as phylogeny and taxonomy. The included importers bind together related data types into a custom “experiment” class instance, with tools that automatically propagate/validate changes across all relevant data. In general, researchers only need to manipulate their “experiment-level” object during analysis, making data smoothing / trimming less prone to mistakes, and often simplifying analysis commands to just one data argument. Among the supported analysis tools, phyloseq includes a native, optionally-parallel implementation of Fast UniFrac (weighted and unweighted), and incorporates this in a more general “distance()” function that explicitly supports 43 other ecological distance methods, as well as a general “ordinate()” function that supports many different ordination methods. The phyloseq package leverages many of the tools available in R for ecological/phylogenetic analysis, graphics, statistics, and parallel/cloud computing, with emphasis on flexible publication-quality graphics built with ggplot2.

Through in-package documentation, the phyloseq development site on GitHub, including the phyloseq wiki, we provide examples of custom interactive analysis using phyloseq with microbiome data from diverse environments, including the “Human Enterotype” and “Global Patterns” datasets. We emphasize ways to document analysis steps “as you go” so that the process can be easily and reliably shared with – and reproduced by – peers.  We further emphasize tools for clustering, multiple-testing, the animation of results from time-series data, as well as planned infrastructure for integrating additional related data types (e.g. mass spectrometry, gene expression, metabolic networks). We also discuss the use of phyloseq with data from shotgun (non-amplified) metagenomic samples, and possibilities for future development. Anyone can download the complete source code, contribute code, as well as contribute through feature requests and bug reports on the phyloseq issues page.

In summary, phyloseq is a new open-source software tool for accessible and reproducible statistical analysis of phylogenetic sequencing data. It is now available on the web from both GitHub and Bioconductor.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s