Thursday, August 2, 2012

Pond et al., 2009

Pond, K.S., Wadhawan, S., Chiaromonte, F., Ananda, G., Chung, W.Y., Taylor, J., Nekrutenko, A., 2009. Windshield splatter analysis with the Galaxy metagenomic pipeline. Genome Research 19, 2144-2153.

These authors describe a novel software system that integrates several functions relevant for metagenomic analysis, generally defined as examining environmental samples of nucleic acids (typically DNA) without culturing the organisms present, and drawing inferences about the biological community (phylogenetic or functional) from the sequences. This is currently my favourite paper, for the quality of the writing, the density of information, the usefulness of the described methodology, and especially for the dataset they use as their demonstration of their system. I'm going to mostly use direct quotes from this paper, because there's no way I could say any of this better by paraphrasing.

The abstract starts with: 
How many species inhabit our immediate surroundings? A straightforward collection technique suitable for answering this question is known to anyone who has ever driven a car at highway speeds. 
The Introduction describes the existing resources available for metagenomic analyses, and how those resources can be expected to deal with prokaryotic and eukaryotic data. For example, while protein sequences are often employed in studies of prokaryotes (including the use of predicted protein sequences and open reading frames (ORFs) from DNA sequences), the small fraction of eukaryote genomes that codes for proteins makes such strategies less useful for investigating community composition of eukaryotes.

The authors undertook two voyages on sequential days in July of 2007, travelling from Pennsylvania to New Brunswick, in a minivan equiped with sticky tape on its bumper. They frequently refer to "windshield splatter", though this is slightly inaccurate, as the tape was affixed to the bumper of the vehicle, several decimeters closer to ground level than the windshield.

Jumping into the Results:
The most prominent difference between the two trips is in the number of reads identified with green plants (Viridiplantae): 10,242 in trip A versus 612 in trip B. It is unlikely that a two orders of magnitude difference reflects a genuine variation in species abundance of such a ubiquitous taxonomic group between the two trips. Because during each trip we collected two samples (left and right sides of the vehicle; see Methods) we were able to trace the majority (9317) of Viridiplantae reads to the left subsample. The most likely explanation for this overabundance is that a piece of plant material (e.g., a leaf or stem fragment) adhered to the collection surface. 
This illustrates a few of the striking differences between biology at the level of macroscopic organisms (e.g. most of botany, or the animals that a good naturalist would be expected to be familiar with) and microscopic, especially bacterial. A single leaf or stem fragment contains thousands to millions of cells in direct contact with each other in a dense 3-dimensional structure. Bacterial cells in the environment are often found in biofilms, which are typically a single cell layer or only a few cell layers thick, and cover a tiny area. Or they occur as individual cells, separated by multiple cell-length-equivalents from their neighbours. Also, identification to high taxonomic levels such as Order or Phylum is common in environmental microbiology, yet essentially unheard of for multicellular organisms - if it's big enough to see, it can be identified to Family or better by a person equiped with a readily-available guide. Yet they report a "green plant" - anything from roses to ginkos is included in that high-level taxon!
The list included unexpected entries such as the genus Homo even though the two trips were uneventful. Such matches are likely caused by road debris (which often includes roadkill) adhering to the collecting tape. Because few entries in NT and WGS databases are derived from, say, white-tailed deer (Odocoileus virginianus, a prevalent large mammal roadkill in the northeastern United States), reads truly representing this speces are more likely to match abundant human sequences. 
That first sentence, above, is probably my favourite sentence in the entire paper. "the two trips were uneventful." Just savour that, and ponder the meanings...
This is also another striking difference between metagenomics and related microbiological sampling and study strategies and how multicellular eukaryotes are most often studied. No ecologist would normally need to describe the probability of mistaking a sample derived from a white-tailed deer with that from a human, yet here, because of the way the databases used for comparison and identification are structured, consideration of roadkill rates (and roadside clean-up efforts, presumably) are required to refine the raw identifications derived from comparisons of DNA sequences. 

Existing tools for major steps in the environmental-sample-to-phylogeny experimental pipeline are difficult to use and make work together, thus: 
This is why our objective was to build a complete pipeline for homology-based taxonomic labeling of metagenomic reads that was self-contained and guided the user from data acquistion and QC, to database searches, and finally, actual metagenomic analyses. We demonstrate that the classification performance of our solution is on par with currently available applications...
Our second goal was to perform a eukaryotic metagenomic study on the organic matter collected on an automobile's windshield. Specifically, we were interested in addressing two questions: Can one identify eukaryotic taxa from random reads generated by the next-generation sequencing technology from environmental samples? and Is it possible to contrast species abundance between geographic locations? While this pilot analysis provides positive answers to both questions, it also raises important issues and limitations. 
I leave it to you to read this excellent paper and see the "issues and limitations" they describe.
And I *love* their methods: 
The front bumper of a 2006 Dodge Caravan ("The Wanderer") was divided at the license plate into "left" (passenger side) and "right" (driver side), and was taped with a double-sided carpet tape. On top of the carpet tape, a 3M 5414 Water Soluble Wave Solder Tape was affixed, exposing its sticky side. The tapes were applied on June 23, 2007, at 6 am EDT in State College, Pennsylvania, and removed in tubes containing Tris EDTA buffer at 12 pm EDT in Manchester, Connecticut. New tapes were again applied in Portland, Maine, at 5 pm EDT and removed in Moncton, New Brunswick, at 12 pm EDT the following day.
Note that they named the vehicle (with a pretty good name, in my opinion), and they describe "left" and "right" in opposition to the common standard among drivers - their description is based on a person standing in front of the vehicle, facing the windshield; their "left" is the vehicle's starboard side, and "right" is port. It's extremely unlikely "The Wanderer" is a right-hand-drive vehicle.

Their software is web-based and available at:  www.usegalaxy.org