phyloseq: Reproducible interactive analysis of microbiome census data using R

Collaborative development of phyloseq on GitHub.

Official stable release of phyloseq on Bioconductor.

Advances in DNA sequencing technology have dramatically improved the scope and scale of culture-independent investigations into microbial communities. There are effective software tools available to process raw DNA sequences and classify them taxonomically, and even provide some subsequent ecological analysis. However, additional project-specific statistical analysis is often needed, and in many cases these latter-stage custom analyses are difficult (or impossible) for peer researchers to independently reproduce.

To help address this we have created a new open-source R package, “phyloseq”, that  provides a set of tools for importing, organizing, filtering, analyzing, and graphically-summarizing phylogenetic sequencing data. It emphasizes the integration of taxa-abundance data with other experimental covariates, including qualitative/quantitative measurements of clinical or environmental samples – as well as phylogeny and taxonomy. The included importers bind together related data types into a custom “experiment” class instance, with tools that automatically propagate/validate changes across all relevant data. In general, researchers only need to manipulate their “experiment-level” object during analysis, making data smoothing / trimming less prone to mistakes, and often simplifying analysis commands to just one data argument. Among the supported analysis tools, phyloseq includes a native, optionally-parallel implementation of Fast UniFrac (weighted and unweighted), and incorporates this in a more general “distance()” function that explicitly supports 43 other ecological distance methods, as well as a general “ordinate()” function that supports many different ordination methods. The phyloseq package leverages many of the tools available in R for ecological/phylogenetic analysis, graphics, statistics, and parallel/cloud computing, with emphasis on flexible publication-quality graphics built with ggplot2.

Through in-package documentation, the phyloseq development site on GitHub, including the phyloseq wiki, we provide examples of custom interactive analysis using phyloseq with microbiome data from diverse environments, including the “Human Enterotype” and “Global Patterns” datasets. We emphasize ways to document analysis steps “as you go” so that the process can be easily and reliably shared with – and reproduced by – peers.  We further emphasize tools for clustering, multiple-testing, the animation of results from time-series data, as well as planned infrastructure for integrating additional related data types (e.g. mass spectrometry, gene expression, metabolic networks). We also discuss the use of phyloseq with data from shotgun (non-amplified) metagenomic samples, and possibilities for future development. Anyone can download the complete source code, contribute code, as well as contribute through feature requests and bug reports on the phyloseq issues page.

In summary, phyloseq is a new open-source software tool for accessible and reproducible statistical analysis of phylogenetic sequencing data. It is now available on the web from both GitHub and Bioconductor.


Convert QIIME virtualbox vdi to VMware vmdk

Earlier this year I ran into the problem of having a QIIME job that required too much RAM for the virtualbox virtual machine that I had running on my Mac Pro. The 6 GB on my machine were more than adequate, as were the 8 processor cores, however only a fraction of this was available to the VM through the virtualbox platform. The job crashed with various system resource errors, even though the VM itself had been built for this exact purpose, and the job was not especially large relative to typical datasets.

After much Googling around, I discovered a number of different approaches, including using something called qemu. I found these to take a long time, and have several steps, and generally not work for me. I eventually found a post, which appears to be no longer available, explaining how to do this conversion with single command in the terminal:

> vboxmanage clonehd “QIIME-1.2.0-amd64.vdi” “qiime.vmdk” -format VMDK -variant standard

Note: the above is a single-line. vboxmanage is a command that should be available if you have installed VirtualBox.

This will take some time. Once finished, you should build a new VMware VM using the Fusion GUI, and select the vmdk file as the source. VMware will want to “upgrade” the VM because it will appear to be an older version. You should let it do this so that you have all the latest features of VMware available. Once built, you should be able to adjust the available system resources to as much as VMware will allow.

I have made this work on later versions of the QIIME virtual box, including the current version, QIIME-1.3.0. Please feel free to post replies if you find additional issues. I only know that this works on a Mac Pro running Snow Leopard. Other systems / OSes may vary, particularly Windows variants.

Thanks to the original post that showed me how to do this:

Convert VirtualBox (vdi) to VMWare (vmdk)
Posted At : August 27, 2010 5:49 PM | Posted By : Jeff Coughlin
Related Categories: Misc

Finally, it is possible that the virtual box implementation on a Mac has improved to the point that the available system resources is competitive with VMWare Fusion. Until that is clear, I hope that these instructions are useful, not just for QIIME, but any other situations in which you need to convert a vdi to a vmdk.