These percentages were similar to those reported above based on the comparative method (the 3.3% of homopolymers that disagreed between the two datasets includes both Roche 454- and Illumina-specific homopolymer errors). 4, 5, 6 and Table 1). The DNA sample was divided into two aliquots of equal volume. For example, Roche 454 sequencing may be advantageous for resolving sequences with repetitive structures or palindromes or for metagenomic analyses based on unassembled reads, given the substantially longer read length (Fig. To eliminate the possibility that our results were biased by the selection of reference genomes, we used the reference assembly of Fibrobacter succinogenes subsp. Analyzed the data: CL. Thus, the results reported for Illumina based on the metagenome of Lake Lanier (47 G+C%) should be also applicable to metagenomes with different G+C% contents. 2) should be independent of the NGS platform considered and broadly applicable to short-read sequencing. Thus, Roche 454 is advantageous with respect to gene calling when working with unassembled reads. e30087. In the reference genome approach, genes annotated in the Lanier.454 and Lanier.Illumina contigs were compared against their orthologs in publicly available genomes, and homopolymer errors were identified assuming the publicly available sequences contained no errors. Red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. The results for the isolate genomes were based on Illumina input reads that were about 5 times as many as the Roche 454 input reads to provide a ratio that was similar to that of the metagenomic comparisons (51). Moreover, Illumina yielded longer and more accurate contigs (e.g., fewer truncated genes due to frameshifts) despite the substantially shorter read length relatively to Roche 454 and the comparable average sequencing error in the raw reads of the two platforms (0.5% per base in our hands; Fig. 5), which was consistent with our observations on the assembly N50 values of the metagenomes (Fig. 4), despite the fact that reads were trimmed based on the same quality standard prior to the analysis. (A) Venn diagram showing the extent of overlapping and platform-specific raw reads between the Lanier.454 and Lanier.Illumina datasets (without assembly). For convenience, we called the two sequence data sets Lanier.454 and Lanier.Illumina, respectively. 7).
Gene sequences from assembled contigs were extracted and ClustalW2 [31] was used to align the sequences against their orthologs from the reference assembly. Conversely, protein sequences annotated on Illumina reads more frequently matched to the wrong protein sequence in the reference assembly (mismatched genes) or did not match any reference gene (unmatched genes). (A) A's and T's contribute significantly more homopolymer errors than C's and G's. The alignments were used to count frameshift errors separately for each Illumina or Roche 454 dataset. Assemblies of isolate genome sequences (closed or high-draft) were downloaded from the NCBI RefSeq database (called reference assemblies for convenience); raw Illumina and Roche 454 sequencing reads were available through the Joint Genome Institute (JGI, www.jgi.doe.gov). All 2D plots (panels B, D, E, and F) represent the arithmetic average of the medians of each dataset for the same genome; Illumina medians were identical among replicate datasets; therefore, only one value is shown in panel E. The results show that Illumina sequence quality was affected less than that of Roche 454 by the G+C% content of the sequenced DNA (note the lower r-squared value and the slope in E). We aligned the assembled contigs from 9 Illumina and 8 Roche 454 assemblies from JGI data for the same genome against the TIGR reference assembly and calculated base call error rate and gap open error rate as described above for JGI genomes. These results revealed that, in general, the two platforms sampled the same fraction of the total diversity in the sample. here. Copyright: 2012 Luo et al. Graph shows the variation observed in assemblies from different (replicate) datasets of the same genome; red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. It is, however, currently economically unfavorable to obtain similar coverage with the Roche 454 sequencer to the Illumina data (see Discussion below). Similarly for the Roche 454 data, a 2D-grid assembly was performed, varying the size of input sequences (20, 30, 40, , 130) and the minimal aligned length to merge contigs or reads (30 bp, 40 bp, , 100 bp) for Newbler.
Single-base sequencing errors increased by an average of 2% when non-homopolymer-associated errors were also taken into account for both platforms. Yes Finally, gene calling on individual reads (as opposed to assembled contigs) was found to be less error prone in Lanier.454 reads than in Lanier.Illumina reads, mainly due to the longer read length. Illumina does not appear to share these limitations but it has its own systematic base calling biases [13]. Newbler (version 2.0) was used to assemble Lanier.454 with parameters set at 100 bp for overlap length and 95% for nucleotide identity. We also found that the systematic single-base errors associated with GGC-motifs in Illumina data reported recently [16] represented only a minor fraction of the non-homopolymer-associated errors (0.015% of the total bases analyzed, consistent with the frequency reported in the original study). 6). 4, which is based on isolate genome data). Consistent with these interpretations, we found that the single-base error of Illumina contigs increased by about 0.07% when we removed reads from the assembly so that the average coverage of the Illumina contigs would approximate the average coverage of the Roche 454 contigs (8). Further, the single-base sequence and gap opening error rates of individual reads were typically higher by 0.5% and a factor of 10, respectively, for the Roche 454 compared to the Illumina reads (Fig. Lanier.454 and Lanier.Illumina reads were trimmed at both the 5 and 3 ends using a Phred quality score cutoff of 20. For more information about PLOS Subject Areas, click The matching gene of the assembly from the protein search using BLAT was compared to the gene matched by the raw read using Bowtie and instances of agreements (matched genes), disagreements (mismatched genes) and no match found (BLAT search did not match a gene while Bowtie mapping did) were counted and reported in Fig. We also provide quantitative estimates of the errors in gene and contig sequences assembled from datasets characterized by different levels of complexity and G+C% content. Velvet was used to assemble each of these Illumina datasets with K-mer set at 31. It is critical to assess the quality of the derived assemblies; to this end, several studies have recently attempted to evaluate the sequencing errors and artifacts specific to each NGS platform. Venn diagram showing the extent of overlapping and platform-specific sequences of assembled contigs longer than 500 bp. We also measured the percent of the reference genome recovered in each assembly and the degree of chimerism of contigs as follows: A 500 bp window was used to slide through all assembled contig sequences longer than 500 bp with a step of 100 bp. succinogenes S85. Between 10 and 15 replicate datasets for each genome and each sequencing platform were analyzed; the exact number depended on the amount of total data available for each genome. 1B). succinogenes S85, which was sequenced independently by The Institute for Genomic Research (TIGR GenBank accession: CP002158.1; JGI GenBank accession: CP001792.1).
Due to frameshifts caused primarily by homopolymer-associated errors in the derived consensus sequence of the contigs, genes from Roche 454 assembly had fewer complete matches in the NR database relatively to their Illumina counterparts (inset; results are based on a total of 72,709 gene sequences annotated on contigs that were shared between the two assemblies and were longer than 500 bp). The average G+C% content of the metagenome was 47.4%; thus, our results are not simply attributable to higher abundance of A's and T's in the metagenome.
Noticeably, due to the inherent biases of the Roche 454 sequencing approach to produce more frameshifts in A and T rich DNA (Fig. 2B). Competing interests: The authors have declared that no competing interests exist. Assemblies were obtained for each possible combination and the base call error and gap opening error of the resulting assemblies were determined as described for individual reads above. Hence, the majority of non-homopolymer-associated errors remain challenging to model and thus, to correct. Although low coverage contigs (e.g., 1 to 5) are likely to contain a higher fraction of chimeric sequences than 0.2% according to our previous study [18], such contigs were rare in the results reported here, which included only contigs longer than 500 bp with average coverage 10 or higher (only about 3% of the contigs showed less than 5 coverage; Fig. https://doi.org/10.1371/journal.pone.0030087.g004. We also estimated the abundance of each contig shared between the two assemblies by counting the number of reads composing the contig, which can be taken as a proxy of the abundance of the corresponding DNA sequence in the sample [19]. The resulting datasets were 502 Mbp (Lanier.454) and 2,460 Mbp (Lanier.Illumina) in size; all our bioinformatic analyses and comparisons were based on these trimmed datasets. These results were attributable to a higher number of (artificial) frameshifts, caused by homopolymer-associated base call errors, present in the Lanier.454 versus the Lanier.Illumina assembled sequences. The slightly higher single-base accuracy of Roche 454 metagenomic reads relative to that of the isolate genome reads is presumably due to the use of the latest, optimized Roche 454 protocol in the former and slight differences in the performance of the sequencers used. Finally, we calculated the average single-base call error rate and gap opening error rate of individual reads of each dataset as follows: raw reads were trimmed using the same standards as described above and subsequently mapped onto the corresponding reference assembly from RefSeq. The sample comprised DNA from the prokaryotic fraction of a planktonic microbial community of a temperate freshwater lake (Lake Lanier, Atlanta, GA); the complexity of the community sampled (in terms of species richness and evenness) was estimated to be comparable to that of surface oceanic communities, but lower than that of soil communities [17]. Correction: Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. KonstantinidisKT Consistent with the results from assembled contigs, we obtained 90% of overlapping sequences (80% when the overlapping sequences were expressed as a fraction of the total Illumina dataset) between the two datasets when we performed a similar analysis using all raw (not assembled) reads (Fig. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies. NGS platforms produce millions of short sequence reads, which vary in length from tens of base pairs (bp) to 800 bp. The two platforms agreed on over 90% of the assembled contigs and 89% of the unassembled reads as well as on the estimated gene and genome abundance in the sample (Fig. No additional external funding was received for this study. https://doi.org/10.1371/journal.pone.0030087.g007. Yes LuoC, From the human gastrointestinal tract to the ocean abyss, whole-genome shotgun metagenomics is revolutionizing our understanding of the structure, diversity, and function of microbial communities [1], [2], [3], [4]. 2B, inset). We compared the reads from the Lanier.Illumina dataset against the Lanier.454 dataset to identify the fraction of reads shared between the two datasets. NGS platforms continue to improve, while new major advancements in sequencing chemistries are on the horizon [22], creating a lot of excitement among microbial ecologists and engineers. More importantly, it is currently unclear how the above limitations affect the quality of the gene and genome sequences assembled from complex DNA samples, and whether the technologies provide different estimates of the genetic diversity in a sample due to their inherent chemistry and protocol differences. We would like to thank Chad Haase and Ryan Weil for their assistance with sequencing and Rachel Poretsky for critically reading the manuscript. For Lanier.Illumina, the SOAPdenovo [23] and Velvet [24] de novo assemblers were used to pre-assemble short reads into contigs using different K-mers. View Next generation sequencing (NGS) technologies, such as the Roche 454, Illumina/Solexa, and, to a lesser extent, ABI SOLiD, have been cornerstones in this revolution [5], [6], [7]. Among these genes, Roche 454 data appeared to have the wrong (artificial) sequence more often than Illumina data. We obtained (after trimming) a total of 502 Mbp (450 bp long reads) and 2,460 Mbp (100 bp pair-ended reads) from Roche 454 and Illumina sequencing, respectively, of the same community DNA sample. One aliquot was sequenced with the Roche 454 FLX Titanium sequencer (average read length, 450 bp) and the other one with the llumina GA II (100100 bp pair-ended reads) at Emory University Genomics Facility. These errors were not observed in the Illumina data, presumably due to both the high sequence coverage that greatly facilitated the resolution of homopolymer ambiguities and the less pronounced sequencing biases of Illumina (Fig.
Gene sequences from assembled contigs were extracted and ClustalW2 [31] was used to align the sequences against their orthologs from the reference assembly. Conversely, protein sequences annotated on Illumina reads more frequently matched to the wrong protein sequence in the reference assembly (mismatched genes) or did not match any reference gene (unmatched genes). (A) A's and T's contribute significantly more homopolymer errors than C's and G's. The alignments were used to count frameshift errors separately for each Illumina or Roche 454 dataset. Assemblies of isolate genome sequences (closed or high-draft) were downloaded from the NCBI RefSeq database (called reference assemblies for convenience); raw Illumina and Roche 454 sequencing reads were available through the Joint Genome Institute (JGI, www.jgi.doe.gov). All 2D plots (panels B, D, E, and F) represent the arithmetic average of the medians of each dataset for the same genome; Illumina medians were identical among replicate datasets; therefore, only one value is shown in panel E. The results show that Illumina sequence quality was affected less than that of Roche 454 by the G+C% content of the sequenced DNA (note the lower r-squared value and the slope in E). We aligned the assembled contigs from 9 Illumina and 8 Roche 454 assemblies from JGI data for the same genome against the TIGR reference assembly and calculated base call error rate and gap open error rate as described above for JGI genomes. These results revealed that, in general, the two platforms sampled the same fraction of the total diversity in the sample. here. Copyright: 2012 Luo et al. Graph shows the variation observed in assemblies from different (replicate) datasets of the same genome; red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. It is, however, currently economically unfavorable to obtain similar coverage with the Roche 454 sequencer to the Illumina data (see Discussion below). Similarly for the Roche 454 data, a 2D-grid assembly was performed, varying the size of input sequences (20, 30, 40, , 130) and the minimal aligned length to merge contigs or reads (30 bp, 40 bp, , 100 bp) for Newbler.


Noticeably, due to the inherent biases of the Roche 454 sequencing approach to produce more frameshifts in A and T rich DNA (Fig. 2B). Competing interests: The authors have declared that no competing interests exist. Assemblies were obtained for each possible combination and the base call error and gap opening error of the resulting assemblies were determined as described for individual reads above. Hence, the majority of non-homopolymer-associated errors remain challenging to model and thus, to correct. Although low coverage contigs (e.g., 1 to 5) are likely to contain a higher fraction of chimeric sequences than 0.2% according to our previous study [18], such contigs were rare in the results reported here, which included only contigs longer than 500 bp with average coverage 10 or higher (only about 3% of the contigs showed less than 5 coverage; Fig. https://doi.org/10.1371/journal.pone.0030087.g004. We also estimated the abundance of each contig shared between the two assemblies by counting the number of reads composing the contig, which can be taken as a proxy of the abundance of the corresponding DNA sequence in the sample [19]. The resulting datasets were 502 Mbp (Lanier.454) and 2,460 Mbp (Lanier.Illumina) in size; all our bioinformatic analyses and comparisons were based on these trimmed datasets. These results were attributable to a higher number of (artificial) frameshifts, caused by homopolymer-associated base call errors, present in the Lanier.454 versus the Lanier.Illumina assembled sequences. The slightly higher single-base accuracy of Roche 454 metagenomic reads relative to that of the isolate genome reads is presumably due to the use of the latest, optimized Roche 454 protocol in the former and slight differences in the performance of the sequencers used. Finally, we calculated the average single-base call error rate and gap opening error rate of individual reads of each dataset as follows: raw reads were trimmed using the same standards as described above and subsequently mapped onto the corresponding reference assembly from RefSeq. The sample comprised DNA from the prokaryotic fraction of a planktonic microbial community of a temperate freshwater lake (Lake Lanier, Atlanta, GA); the complexity of the community sampled (in terms of species richness and evenness) was estimated to be comparable to that of surface oceanic communities, but lower than that of soil communities [17]. Correction: Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample. KonstantinidisKT Consistent with the results from assembled contigs, we obtained 90% of overlapping sequences (80% when the overlapping sequences were expressed as a fraction of the total Illumina dataset) between the two datasets when we performed a similar analysis using all raw (not assembled) reads (Fig. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies. NGS platforms produce millions of short sequence reads, which vary in length from tens of base pairs (bp) to 800 bp. The two platforms agreed on over 90% of the assembled contigs and 89% of the unassembled reads as well as on the estimated gene and genome abundance in the sample (Fig. No additional external funding was received for this study. https://doi.org/10.1371/journal.pone.0030087.g007. Yes LuoC, From the human gastrointestinal tract to the ocean abyss, whole-genome shotgun metagenomics is revolutionizing our understanding of the structure, diversity, and function of microbial communities [1], [2], [3], [4]. 2B, inset). We compared the reads from the Lanier.Illumina dataset against the Lanier.454 dataset to identify the fraction of reads shared between the two datasets. NGS platforms continue to improve, while new major advancements in sequencing chemistries are on the horizon [22], creating a lot of excitement among microbial ecologists and engineers. More importantly, it is currently unclear how the above limitations affect the quality of the gene and genome sequences assembled from complex DNA samples, and whether the technologies provide different estimates of the genetic diversity in a sample due to their inherent chemistry and protocol differences. We would like to thank Chad Haase and Ryan Weil for their assistance with sequencing and Rachel Poretsky for critically reading the manuscript. For Lanier.Illumina, the SOAPdenovo [23] and Velvet [24] de novo assemblers were used to pre-assemble short reads into contigs using different K-mers. View Next generation sequencing (NGS) technologies, such as the Roche 454, Illumina/Solexa, and, to a lesser extent, ABI SOLiD, have been cornerstones in this revolution [5], [6], [7]. Among these genes, Roche 454 data appeared to have the wrong (artificial) sequence more often than Illumina data. We obtained (after trimming) a total of 502 Mbp (450 bp long reads) and 2,460 Mbp (100 bp pair-ended reads) from Roche 454 and Illumina sequencing, respectively, of the same community DNA sample. One aliquot was sequenced with the Roche 454 FLX Titanium sequencer (average read length, 450 bp) and the other one with the llumina GA II (100100 bp pair-ended reads) at Emory University Genomics Facility. These errors were not observed in the Illumina data, presumably due to both the high sequence coverage that greatly facilitated the resolution of homopolymer ambiguities and the less pronounced sequencing biases of Illumina (Fig.