Next-generation sequencing

Counteracting bias in metagenomics

Ensuring metagenomic data integrity

Metagenomic approaches utilize similar processes and workflows to conventional studies (e.g., PCR and qPCR). In both cases, the first step is to obtain, isolate, and purify a nucleic acid sample – DNA for genomic studies and RNA for transcriptomic investigations. This sample is then amplified (and sequenced in the case of next-generation sequencing techniques), with the end product read and measured using specialized instrumentation. Finally, software is employed to process, compile and analyze the resultant raw data.

When designing a metagenomics workflow, investigators must consider that the post-amplification product is as accurate a representation of the original sample as possible.

What sets metagenomics apart from conventional approaches is scale. When designing and executing a metagenomics workflow, investigators must consider not only how to optimize nucleic acid yields pre- and post-amplification, but also that the post-amplification product is as accurate a representation of the original sample as possible. This places gene expression magnitude and proportion into play across thousands of organisms in a sample, each with unique genetic profiles, potentially present in a metagenomics sample. Metagenomics studies, therefore have increased difficulty when compared with conventional, single organism microbe studies.

What is bias and how is it introduced?

Unfortunately, bias – the systemic distortion of the measured data values from the true values of the original sample – is present to some degree in all experimental processes, and metagenomics is no exception. From sample acquisition to sequencing and read assembly, bias can be introduced at any stage throughout the typical metagenomics workflow (1). To start, whether a sample is truly representative of the greater community that it is part of depends on sampling location and frequency. For example, when studying the gut microbiome, a fecal sample will yield a different microbiota than one obtained from the intestinal mucosa. Additionally, sample composition can be biased by how the samples were stored and transported to the laboratory.

From sample acquisition to sequencing and read assembly, bias can be introduced at any stage throughout the typical metagenomics workflow.

Extracting nucleic acids for metagenomic studies typically first requires liberating them from cellular enclosures. Cell membranes and walls are broken down through chemical, enzymatic, or mechanical means. However, microbes differ in how easily they are lysed, resulting in dramatic differences in nucleic acid yield proportions. Changing extraction techniques can result in as much as a 10-fold difference in measured proportion of a given taxon from the same sample (2). As such, it is important for researchers to understand – and compensate for – the inherent biases introduced by their extraction protocol and/or reagents of choice (3).

Sources of bias in shotgun sequencing

Similarly, individual sequencing techniques also possess their own biases. Primer construction, amplification protocol, genomic size, and even whether the nucleic acid sample is single- or double-stranded, have all been identified as sources of bias (3–5). For example, while shotgun sequencing creates random fragments for subsequent read generation, randomness does not automatically equate to uniformity, potentially resulting in the preferential amplification of some genomic or transcriptomic regions over others. Likewise, 16S sequencing relies on 16S ribosomal RNA (rRNA) as a phylogenetic marker to determine microbiome composition (3).

Sources of bias in 16S rRNA sequencing

16S rRNA sequencing targets conserved regions that surround hypervariable regions of the bacterial 16S rRNA gene and has been widely used. Analysis of the 16S rRNA gene has been a mainstay of sequence-based bacterial analysis for decades. (7) Analysis of the ITS (Internal transcribed spacer) region allows the profiling of fungal genomes (8).

Awareness leads to countermeasures

Bias is cumulative. A distortion introduced during sample preparation will be amplified during sequencing and highlighted during analysis. It is therefore critical for scientists to understand potential sources of bias and develop a thorough series of controls in an effort to compensate for it. Positive and negative controls can be used to identify variability between experimental runs using the same protocol and same sample, while databases such as the Microbiome Quality Control project can help demonstrate how changes in protocol translate to changes in the final result. Finally, researchers need to be aware that efforts to detect certain organisms of interest (e.g., a pathogen) may result in the masking of many others, thus creating a biased portrait of the microbial community (1). While fully removing bias may be impossible, understanding and mitigating bias will prove essential if metagenomics is to become a clinical diagnostic tool (1, 6).

References:

McLaren, M. R. et al. (2019) Consistent and correctable bias in metagenomic sequencing measurements. BioRxiv.
Costea, P.I. et al. (2017) Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35(11), 1069–1076.
Brooks, J. P. et al. (2015)The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 15, 66.
Brinkman, N. E., et al. (2018) Reducing inherent biases introduced during DNA viral metagenome analyses of municipal wastewater. PLoS One 13(4), e0195350.
Beszteri, B. et al. (2010) Average genome size: a potential source of bias in comparative metagenomics. ISME J. 4(8), 1075–1077.
Amrane, S. and Lagier, J. -C. (2018) Metagenomic and clinical microbiology. Hum. Microbiome J. 9,1–6.
Johnson, J.S., Spakowicz, D.J., Hong, B. et al. (2019) Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat. Commun. 10, 5029.
Peay K.G. Kennedy P.G., Bruns, T.D. (2008) Fungal Community Ecology: A Hybrid Beast with a Molecular Master. Bioscience. 58:9. 799-810