phyloseq: Reproducible interactive analysis of microbiome census data using R

Collaborative development of phyloseq on GitHub.

Official stable release of phyloseq on Bioconductor.

Advances in DNA sequencing technology have dramatically improved the scope and scale of culture-independent investigations into microbial communities. There are effective software tools available to process raw DNA sequences and classify them taxonomically, and even provide some subsequent ecological analysis. However, additional project-specific statistical analysis is often needed, and in many cases these latter-stage custom analyses are difficult (or impossible) for peer researchers to independently reproduce.

To help address this we have created a new open-source R package, “phyloseq”, that  provides a set of tools for importing, organizing, filtering, analyzing, and graphically-summarizing phylogenetic sequencing data. It emphasizes the integration of taxa-abundance data with other experimental covariates, including qualitative/quantitative measurements of clinical or environmental samples – as well as phylogeny and taxonomy. The included importers bind together related data types into a custom “experiment” class instance, with tools that automatically propagate/validate changes across all relevant data. In general, researchers only need to manipulate their “experiment-level” object during analysis, making data smoothing / trimming less prone to mistakes, and often simplifying analysis commands to just one data argument. Among the supported analysis tools, phyloseq includes a native, optionally-parallel implementation of Fast UniFrac (weighted and unweighted), and incorporates this in a more general “distance()” function that explicitly supports 43 other ecological distance methods, as well as a general “ordinate()” function that supports many different ordination methods. The phyloseq package leverages many of the tools available in R for ecological/phylogenetic analysis, graphics, statistics, and parallel/cloud computing, with emphasis on flexible publication-quality graphics built with ggplot2.

Through in-package documentation, the phyloseq development site on GitHub, including the phyloseq wiki, we provide examples of custom interactive analysis using phyloseq with microbiome data from diverse environments, including the “Human Enterotype” and “Global Patterns” datasets. We emphasize ways to document analysis steps “as you go” so that the process can be easily and reliably shared with – and reproduced by – peers.  We further emphasize tools for clustering, multiple-testing, the animation of results from time-series data, as well as planned infrastructure for integrating additional related data types (e.g. mass spectrometry, gene expression, metabolic networks). We also discuss the use of phyloseq with data from shotgun (non-amplified) metagenomic samples, and possibilities for future development. Anyone can download the complete source code, contribute code, as well as contribute through feature requests and bug reports on the phyloseq issues page.

In summary, phyloseq is a new open-source software tool for accessible and reproducible statistical analysis of phylogenetic sequencing data. It is now available on the web from both GitHub and Bioconductor.


Convert QIIME virtualbox vdi to VMware vmdk

Earlier this year I ran into the problem of having a QIIME job that required too much RAM for the virtualbox virtual machine that I had running on my Mac Pro. The 6 GB on my machine were more than adequate, as were the 8 processor cores, however only a fraction of this was available to the VM through the virtualbox platform. The job crashed with various system resource errors, even though the VM itself had been built for this exact purpose, and the job was not especially large relative to typical datasets.

After much Googling around, I discovered a number of different approaches, including using something called qemu. I found these to take a long time, and have several steps, and generally not work for me. I eventually found a post, which appears to be no longer available, explaining how to do this conversion with single command in the terminal:

> vboxmanage clonehd “QIIME-1.2.0-amd64.vdi” “qiime.vmdk” -format VMDK -variant standard

Note: the above is a single-line. vboxmanage is a command that should be available if you have installed VirtualBox.

This will take some time. Once finished, you should build a new VMware VM using the Fusion GUI, and select the vmdk file as the source. VMware will want to “upgrade” the VM because it will appear to be an older version. You should let it do this so that you have all the latest features of VMware available. Once built, you should be able to adjust the available system resources to as much as VMware will allow.

I have made this work on later versions of the QIIME virtual box, including the current version, QIIME-1.3.0. Please feel free to post replies if you find additional issues. I only know that this works on a Mac Pro running Snow Leopard. Other systems / OSes may vary, particularly Windows variants.

Thanks to the original post that showed me how to do this:

Convert VirtualBox (vdi) to VMWare (vmdk)
Posted At : August 27, 2010 5:49 PM | Posted By : Jeff Coughlin
Related Categories: Misc

Finally, it is possible that the virtual box implementation on a Mac has improved to the point that the available system resources is competitive with VMWare Fusion. Until that is clear, I hope that these instructions are useful, not just for QIIME, but any other situations in which you need to convert a vdi to a vmdk.

Error : package does not have a name space

Here is a frustrating knot I am beginning to unravel.

R has a namespace structure to help reduce errors resulting from functions, methods, objects, classes having the same name from two different packages. Makes sense. Those R objects that are supposed to be available to other packages are said to be exported, and the designation of exported objects occurs in a file named NAMESPACE in the package root (no file extension).

Seems straightforward enough, right? Here’s the rub:

Not all packages available from the CRAN repository actually have a NAMESPACE file. This means the package doesn’t have a specified namespace. A package without a NAMESPACE file should really only specify the dependency using the “Depends:” field in the DESCRIPTION file, which will basically load the whole package in the background when your custom package is loaded.

A related problem arises when your package depends on another package with an extremely limited NAMESPACE file. Check the NAMESPACE file of that packages source. If it really only has a one or two “export( stuff )” fields, perhaps only a a few lines total, chances are it is not a sufficient NAMESPACE specification for you to use the “Import:” field in your DESCRIPTION file. If you do try, chances are that you will receive errors during package check or package build. If you are unsure, you can start by listing all package dependencies in the “Depends: ” field, and then move one-by-one the unsure package names to the “Imports: ” field until you are satisfied.

This issue is mentioned in the Bioconductor instructions for package developers:

It includes some useful suggestions and scenarios in addition to those listed here.

When I originally wrote this post, I suggested manually modifying your NAMESPACE file in order to get the package to build. I no longer suggest that as an option. Ideally, you will create your NAMESPACE file automagically using Roxygen, and your Import/Depends specifications as well as in-source roxygen commands (e.g. @import) will be sufficient to generate a functional NAMESPACE file that does not cause warnings or errors during package build. It can be done!  Just takes some fiddling to understand what the errors are.

Finally, in general, I agree with the Bioconductor directive that all packages have a namespace specification. A thoughtful discrimination between external and internal methods is going to be useful for long-term stability of packages, especially as the number of R packages increases. For now, NAMESPACE does not appear to be required by CRAN, but that does not mean you should not make one, nor does it mean others will not find it extremely useful!

p.s.  If you use roxygen (or roxygen2) you will need to remove/comment all import tags from your roxygen-comments related to the namespace-absent packages. Commenting-out those tags is preferable, so that function-level dependencies are still documented in the source file in Roxygen notation, which is useful for others (as well as yourself).

Site-Specific Mobilization of Vinyl Chloride Respiration Islands by a Mechanism Common in Dehalococcoides



Vinyl chloride is a widespread groundwater pollutant and Group 1 carcinogen. A previous comparative genomic analysis revealed that the vinyl chloride reductase operon, vcrABC, of Dehalococcoides sp. strain VS is embedded in a horizontally-acquired genomic island that integrated at the single-copy tmRNA gene, ssrA.


We targeted conserved positions in available genomic islands to amplify and sequence four additional vcrABC -containing genomic islands from previously-unsequenced vinyl chloride respiring Dehalococcoides enrichments. We identified a total of 31 ssrA-specific genomic islands from Dehalococcoides genomic data, accounting for 47 reductive dehalogenase homologous genes and many other non-core genes. Sixteen of these genomic islands contain a syntenic module of integration-associated genes located adjacent to the predicted site of integration, and among these islands, eight contain vcrABC as genetic ‘cargo’. These eight vcrABC -containing genomic islands are syntenic across their ~12 kbp length, but have two phylogenetically discordant segments that unambiguously differentiate the integration module from the vcrABCcargo. Using available Dehalococcoides phylogenomic data we estimate that these ssrA-specific genomic islands are at least as old as theDehalococcoides group itself, which in turn is much older than human civilization.


The vcrABC -containing genomic islands are a recently-acquired subset of a diverse collection of ssrA-specific mobile elements that are a major contributor to strain-level diversity in Dehalococcoides, and may have been throughout its evolution. The high similarity between vcrABC sequences is quantitatively consistent with recent horizontal acquisition driven by ~100 years of industrial pollution with chlorinated ethenes.

Adding Sweave.sty and Rd.sty to your LaTeX path in Mac OS X

R framework for sty

Special .sty files for building R documentation

Okay, this extremely specific, but a headache that must be solved if you plan to create or modify R extensions (aka R source code packages). This is because R requires that you also build documentation through a process that is partially dependent on LaTeX, and a few specific .sty files. On Mac OS X (snow leopard) running the latest (2.13.1) version of R and the BasicTeX version of TeXLive (2011) via MacTeX.

I find that I prefer to use an unholy (but effective) combination of command-line and GUI for navigating complex distro structures, especially when they are within “hidden” portions of the OS X filesystem, where navigating directly in Finder is not straightforward. Assuming you have the BasicTeX-2011 installed and the a recent version of R installed AND both were installed with the default directory paths for OS X, the following code will bring up the a folder with the .sty files that you need to copy (from within the R framework):

> cd /Library/Frameworks/R.framework/Resources/share/texmf/tex

> open -a Finder latex

Once you have that folder in front of you, grab the “Rd.sty” and “Sweave.sty” files, copy them with command-c, then open the LaTeX folder that you need for the paste:

> cd /usr/local/texlive/2011basic/texmf-dist/tex/latex/

> open -a Finder base

And do your command-p to paste. When done adding those .sty files to the tex/latex/base/ directory, make sure to run:

> sudo texhash

The texhash command will update tex so that its aware of the new .sty files that you just added.

You’ll also notice that the R directory you broke into has an “upquote.sty” file in it. The upquote package is also required for building R package documentation, as are the “helvetic” and “courier” packages, but they are all available from CTAN, so I recommend installing them through your favorite TeX/LaTeX package installing method.

The all-command-line version of what I described above is:

> cp /Library/Frameworks/R.framework/Resources/share/texmf/tex/latex/Rd.sty /usr/local/texlive/2011basic/texmf-dist/tex/latex/base

> cp /Library/Frameworks/R.framework/Resources/share/texmf/tex/latex/Sweave.sty /usr/local/texlive/2011basic/texmf-dist/tex/latex/base

> sudo texhash

I Googled-around for a while to try and find an official repo approach to getting these R-depending LaTeX style files into the LaTeX path. Eventually, I gave up and hacked this approach. I hope it saves someone else some time. The LaTeX-side of these instructions is version dependent, but the R-side is not (necessarily) version dependent and may stay accurate for some time.