phyloseq: Reproducible interactive analysis of microbiome census data using R

Collaborative development of phyloseq on GitHub.

Official stable release of phyloseq on Bioconductor.

Advances in DNA sequencing technology have dramatically improved the scope and scale of culture-independent investigations into microbial communities. There are effective software tools available to process raw DNA sequences and classify them taxonomically, and even provide some subsequent ecological analysis. However, additional project-specific statistical analysis is often needed, and in many cases these latter-stage custom analyses are difficult (or impossible) for peer researchers to independently reproduce.

To help address this we have created a new open-source R package, “phyloseq”, that  provides a set of tools for importing, organizing, filtering, analyzing, and graphically-summarizing phylogenetic sequencing data. It emphasizes the integration of taxa-abundance data with other experimental covariates, including qualitative/quantitative measurements of clinical or environmental samples – as well as phylogeny and taxonomy. The included importers bind together related data types into a custom “experiment” class instance, with tools that automatically propagate/validate changes across all relevant data. In general, researchers only need to manipulate their “experiment-level” object during analysis, making data smoothing / trimming less prone to mistakes, and often simplifying analysis commands to just one data argument. Among the supported analysis tools, phyloseq includes a native, optionally-parallel implementation of Fast UniFrac (weighted and unweighted), and incorporates this in a more general “distance()” function that explicitly supports 43 other ecological distance methods, as well as a general “ordinate()” function that supports many different ordination methods. The phyloseq package leverages many of the tools available in R for ecological/phylogenetic analysis, graphics, statistics, and parallel/cloud computing, with emphasis on flexible publication-quality graphics built with ggplot2.

Through in-package documentation, the phyloseq development site on GitHub, including the phyloseq wiki, we provide examples of custom interactive analysis using phyloseq with microbiome data from diverse environments, including the “Human Enterotype” and “Global Patterns” datasets. We emphasize ways to document analysis steps “as you go” so that the process can be easily and reliably shared with – and reproduced by – peers.  We further emphasize tools for clustering, multiple-testing, the animation of results from time-series data, as well as planned infrastructure for integrating additional related data types (e.g. mass spectrometry, gene expression, metabolic networks). We also discuss the use of phyloseq with data from shotgun (non-amplified) metagenomic samples, and possibilities for future development. Anyone can download the complete source code, contribute code, as well as contribute through feature requests and bug reports on the phyloseq issues page.

In summary, phyloseq is a new open-source software tool for accessible and reproducible statistical analysis of phylogenetic sequencing data. It is now available on the web from both GitHub and Bioconductor.

Error : package does not have a name space

Here is a frustrating knot I am beginning to unravel.

R has a namespace structure to help reduce errors resulting from functions, methods, objects, classes having the same name from two different packages. Makes sense. Those R objects that are supposed to be available to other packages are said to be exported, and the designation of exported objects occurs in a file named NAMESPACE in the package root (no file extension).

Seems straightforward enough, right? Here’s the rub:

Not all packages available from the CRAN repository actually have a NAMESPACE file. This means the package doesn’t have a specified namespace. A package without a NAMESPACE file should really only specify the dependency using the “Depends:” field in the DESCRIPTION file, which will basically load the whole package in the background when your custom package is loaded.

A related problem arises when your package depends on another package with an extremely limited NAMESPACE file. Check the NAMESPACE file of that packages source. If it really only has a one or two “export( stuff )” fields, perhaps only a a few lines total, chances are it is not a sufficient NAMESPACE specification for you to use the “Import:” field in your DESCRIPTION file. If you do try, chances are that you will receive errors during package check or package build. If you are unsure, you can start by listing all package dependencies in the “Depends: ” field, and then move one-by-one the unsure package names to the “Imports: ” field until you are satisfied.

This issue is mentioned in the Bioconductor instructions for package developers:

http://www.bioconductor.org/developers/package-guidelines/#dependencies

It includes some useful suggestions and scenarios in addition to those listed here.

When I originally wrote this post, I suggested manually modifying your NAMESPACE file in order to get the package to build. I no longer suggest that as an option. Ideally, you will create your NAMESPACE file automagically using Roxygen, and your Import/Depends specifications as well as in-source roxygen commands (e.g. @import) will be sufficient to generate a functional NAMESPACE file that does not cause warnings or errors during package build. It can be done!  Just takes some fiddling to understand what the errors are.

Finally, in general, I agree with the Bioconductor directive that all packages have a namespace specification. A thoughtful discrimination between external and internal methods is going to be useful for long-term stability of packages, especially as the number of R packages increases. For now, NAMESPACE does not appear to be required by CRAN, but that does not mean you should not make one, nor does it mean others will not find it extremely useful!

p.s.  If you use roxygen (or roxygen2) you will need to remove/comment all import tags from your roxygen-comments related to the namespace-absent packages. Commenting-out those tags is preferable, so that function-level dependencies are still documented in the source file in Roxygen notation, which is useful for others (as well as yourself).

Adding Sweave.sty and Rd.sty to your LaTeX path in Mac OS X

R framework for sty

Special .sty files for building R documentation

Okay, this extremely specific, but a headache that must be solved if you plan to create or modify R extensions (aka R source code packages). This is because R requires that you also build documentation through a process that is partially dependent on LaTeX, and a few specific .sty files. On Mac OS X (snow leopard) running the latest (2.13.1) version of R and the BasicTeX version of TeXLive (2011) via MacTeX.

I find that I prefer to use an unholy (but effective) combination of command-line and GUI for navigating complex distro structures, especially when they are within “hidden” portions of the OS X filesystem, where navigating directly in Finder is not straightforward. Assuming you have the BasicTeX-2011 installed and the a recent version of R installed AND both were installed with the default directory paths for OS X, the following code will bring up the a folder with the .sty files that you need to copy (from within the R framework):

> cd /Library/Frameworks/R.framework/Resources/share/texmf/tex

> open -a Finder latex

Once you have that folder in front of you, grab the “Rd.sty” and “Sweave.sty” files, copy them with command-c, then open the LaTeX folder that you need for the paste:

> cd /usr/local/texlive/2011basic/texmf-dist/tex/latex/

> open -a Finder base

And do your command-p to paste. When done adding those .sty files to the tex/latex/base/ directory, make sure to run:

> sudo texhash

The texhash command will update tex so that its aware of the new .sty files that you just added.

You’ll also notice that the R directory you broke into has an “upquote.sty” file in it. The upquote package is also required for building R package documentation, as are the “helvetic” and “courier” packages, but they are all available from CTAN, so I recommend installing them through your favorite TeX/LaTeX package installing method.

The all-command-line version of what I described above is:

> cp /Library/Frameworks/R.framework/Resources/share/texmf/tex/latex/Rd.sty /usr/local/texlive/2011basic/texmf-dist/tex/latex/base

> cp /Library/Frameworks/R.framework/Resources/share/texmf/tex/latex/Sweave.sty /usr/local/texlive/2011basic/texmf-dist/tex/latex/base

> sudo texhash

I Googled-around for a while to try and find an official repo approach to getting these R-depending LaTeX style files into the LaTeX path. Eventually, I gave up and hacked this approach. I hope it saves someone else some time. The LaTeX-side of these instructions is version dependent, but the R-side is not (necessarily) version dependent and may stay accurate for some time.