What Is Better Mapping to a Reference or De Novo Assembly in Rna Seq

Genome Res. 2010 Oct; 20(10): 1432–1440.

Optimization of de novo transcriptome assembly from next-generation sequencing data

Yann Surget-Groba

Department of Zoology and Animal Biology, University of Geneva, 1211 Geneva iv, Switzerland

Juan I. Montoya-Burgos

Department of Zoology and Animal Biology, University of Geneva, 1211 Geneva four, Switzerland

Received 2009 Dec 7; Accepted 2010 Jul 29.

Abstract

Transcriptome analysis has of import applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging chore requiring algorithmic improvements. We present ii methods for essentially improving transcriptome de novo assembly. The starting time method relies on the observation that the employ of a single chiliad-mer length by current de novo assemblers is suboptimal to gather transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its expert functioning past assembling de novo a published next-generation transcriptome sequence data ready of Aedes aegypti, using the existing genome to check the accuracy of our method. The 2d method relies on the apply of a reference proteome to improve the de novo assembly. Nosotros adult the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using imitation data, nosotros testify that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the not-model catfish Loricaria gr. cataphracta. Using the Multiple-thousand and STM methods, the associates increases in contiguity and in factor identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core ready of genes regulating tooth development in vertebrates, while classic de novo associates failed.

Transcriptomic information is used in a wide range of biological studies and provides fundamental insights into biological processes and applications such every bit levels of gene expression (Torres et al. 2008), factor expression profiles after experimental treatments or infection (Hegedus et al. 2009), discovery of tissue biomarkers (Disset et al. 2009), cancer cistron expression (Morrissy et al. 2009), cistron discovery (Hahn et al. 2009), gene content (Reinhardt et al. 2009), and isolation of conserved ortholog genes for phylogenomic purposes (Hughes et al. 2006; Dunn et al. 2008), among others.

Nevertheless, transcriptomic information is more often than not abundant only for model organisms on which international enquiry attempt and funding is concentrated, setting bated not-model organisms. This situation is drastically changing with the emergence and generalization of next-generation DNA sequencing technologies that tremendously reduce cost, labor, and time, providing the opportunity to conduct big-scale genomic projects at lower cost for non-model organisms.

The most economical adjacent-generation sequencing technologies are those that generate short sequence reads, typically in the range of 30–100 bp, and are the method of option for "re-sequencing" model organisms (eastward.chiliad., the Illumina applied science) (Porreca et al. 2007). In this case, the analysis is performed past mapping the curt-reads onto the reference genome or transcriptome. This approach has recently been used for transcriptome profiling in a method called RNA-seq that is expected to permit major breakthroughs in transcriptome assay (Mortazavi et al. 2008; Nagalakshmi et al. 2008; Wilhelm et al 2008; Wang et al. 2009; Montgomery et al. 2010).

However, de novo assemblies of sequences without a known reference using short reads have been considered difficult (Schuster 2008), and researchers working on non-model organisms have frequently turned to the more expensive longer sequence reads (250–450 bp) obtained by the 454 Life Sciences (Roche) engineering (Margulies et al. 2005; east.g., Vera et al. 2008; Hale et al. 2009). Nevertheless, the applicability of curt-reads methods as an appropriate choice for de novo transcriptome assembly has recently received attention. By reassembling the transcriptome of a species with a known genome using a de novo assembler, Gibbons et al. (2009) accept shown that brusk-reads can be of considerable utility for assembling transcriptomes of non-model organisms.

Despite the fast evolution of assemblers able to efficiently handle more and more reads (Zerbino and Birney 2008; Simpson et al. 2009), transcriptome assembly is still difficult. For instance, elongation of contigs is not but impeded by repeats or allelic variations merely also past alternatively spliced transcripts. Moreover, while genomic sequencing coverage is more often than not uniform across the genome, transcriptome coverage is highly variable, depending on gene expression level, excluding the utilize of coverage information to resolve repeated motifs (Zerbino and Birney 2008). Therefore, the quality of a de novo transcriptome associates is highly dependent on the user-defined sequence overlap length between two reads required to consider them equally contiguous (referred as k-mer length). The best k-mer value for a given assembly depends on the sequencing depth, the read error rate, and the complexity of the genome/transcriptome to be assembled (Simpson et al. 2009). For transcriptome assembly, in which coverage is not compatible, using higher thousand-mer length volition theoretically event in a more contiguous associates of highly expressed transcripts. On the contrary, poorly expressed transcripts will be better assembled if lower one thousand-mer lengths are used (Zerbino and Birney 2008). These theoretical expectations have been experimentally supported in a controlled de novo transcriptome assembly of a model organism (Gibbons et al. 2009). The choice of the thousand-mer length is and so a subjective decision of whether to emphasize on transcript diversity by using a short k-mer length (that will pb to the associates of numerous and highly fragmented transcript fragments), or to emphasize on contiguity by using a longer k-mer length (that will let the recovery of longer transcript fragments simply at the price of a lower transcript diverseness). Hence, in nearly cases, an intermediate k-mer length is chosen to reach a compromise between these two extremes. Therefore, an approach for de novo transcriptome associates that takes advantage of the assembly performances of various k-mer lengths is highly desirable.

The analysis of genomes or transcriptomes of not-model organisms tin can exist enhanced by performing comparisons with the genome of closely related model organisms. For instance, algorithms have been proposed for boosting the associates of bacterial genomes using available genomes of related species (Salzberg et al. 2008). In eukaryotes, the transcriptome of a non-model plant (Pachycladon enysii)—which has recently diverged from the reference Arabidopsis thaliana (seven–x million year ago, Mya)—was analyzed using a combination of archetype read mapping against the reference transcriptome, de novo assembly, and contig mapping against the reference genome using BLAST (Collins et al. 2008). Likewise, the brain EST information set of the social wasp Polistes metricus, which was generated by adjacent-generation 454 sequencing, was successfully analyzed by comparing it to the consummate genome of the honey bee, from which it diverged 100 to 150 Mya (Toth et al. 2007). An unexplored extension of these comparisons is the use of a closely related model organism to serve every bit template for improving the assembly of the transcriptome of a not-model organism. However, at the nucleotide level this approach is limited to those non-model organisms that possess a very close relative with a consummate genome. This limitation is due to the increasing corporeality of nucleotide differences betwixt ortholog genes with increasing evolutionary altitude, which will rapidly atomic number 82 to the absence of adept-quality matches between the two species. Nonetheless, differences in amino acid sequence accumulate more slowly than nucleotide differences with increasing evolutionary distance, and then comparing sequence translations against a reference proteome might be a promising approach to improve the assembly of the coding fraction of the transcriptome of non-model organisms, even if the reference model is distantly related.

Here nosotros present two methods for improving de novo transcriptome assembly, which answer the expectations presented above. The principle of the first method is to perform multiple assemblies with diverse k-mer lengths and to retain the all-time function of each 1 to class the final assembly. In the 2d method, we assemble the coding contigs into scaffolds past mapping their translation on a distant reference proteome. The pipeline implementing this method can be applied to the results of any de novo transcriptome associates as long as a reference genome or transcriptome of an evolutionarily linked species is bachelor. We so validated the efficiency and the accurateness of our two methods by using false and real data from species with a known genome. To demonstrate the efficiency of both methods on real data from non-model organisms, we applied them in assembling reads from a next-generation short-read sequencing experiment that we performed on the transcriptome of the Neotropical catfish Loricaria gr. cataphracta. We demonstrate the applied efficiency of these methods by their success in recovering the full set of transcripts belonging to the gene network regulating dental development, while archetype methods failed.

Results

De novo transcriptome assembly with multiple 1000-mer values

The basic assumption of this new method is that dissimilar k-mers will allow the assembly of transcripts with different abundances. To verify this supposition, we offset assembled the recently published side by side-generation transcriptome sequence information gear up of the xanthous fever mosquito, Aedes aegypti (Gibbons et al. 2009), with unlike m-mer values (Table 1). Nosotros then estimated the abundance of the transcripts assembled with the different k-mer values based on their read coverage (digital gene expression). This assay was too conducted on a simulated 35-bp RNA-seq data set based on the gear up of zebrafish cDNA from Ensembl (Supplemental Table S1). As expected, the average coverage of assembled contigs increases with increasing thousand-mer values on both the real (Table ii) and the simulated (Supplemental Tabular array S2) information sets. However, information technology is worth noting that the standard deviation of transcript abundances also increases with higher one thousand-mer values. Hence, low grand-mer values allow the assembly of numerous transcripts with relatively low affluence, while larger values allow the assembly of a lower number of transcripts but with a much larger range of abundances. Given the unlike characteristics of the transcripts assembled with different g-mer lengths, combining the results obtained with diverse k-mer lengths into a final assembly seems to exist a promising way of improving de novo associates of sequences with very variable coverage levels as is the example for nonstandardized transcriptomes.

Table ane.

Summary statistics of the assemblies used to appraise the performances of the Multiple-k de novo assembly method based on the Aedes aegypti RNA-seq data set (Gibbons et al. 2009)

An external file that holds a picture, illustration, etc. Object name is 1432tbl1.jpg

Tabular array 2.

Coverage of the contigs assembled from the Aedes aegypti data set up with different k-mer lengths

An external file that holds a picture, illustration, etc. Object name is 1432tbl2.jpg

To accept advantage of the assembling properties of unlike 1000-mer lengths, we have designed two alternative methods of de novo assembly that apply multiple k-values. In the first identify, nosotros designed the "subtractive Multiple-1000" method that starts the assembly with a high k-mer length and then uses the nonassembled reads of this assembly to perform some other assembly with a smaller chiliad-mer value. This procedure can exist reiterated. The second method, which we called the "additive Multiple-k" method, pools the contigs obtained with different k-mer lengths and subsequently removes redundant contigs (come across Methods). We investigated the performances of the two alternative methods using the transcriptome of Ae. aegypti (Gibbons et al. 2009), for which the complete genome is known (Nene et al. 2007). Hence, it is possible to evaluate the number of transcripts recovered by the different assembly methods, too as the proportion of the reference transcriptome covered by the assembled contigs. We besides compared the results obtained with these new approaches to the optimum assembly obtained with a single-thou. We first carried out Velvet assemblies using k-mer lengths from xix to 29 and selected the assembly obtained with k = 21 since it gives a practiced compromise between the number of contigs and their length (as was already determined by Gibbons et al. 2009).

Nosotros tested ii assembly variants (A and B) of the subtractive Multiple-k method: associates A with two k-values (k = 27 followed by k = nineteen, which are the 2 virtually farthermost k-values still displaying interesting statistics; see Tabular array ane), and assembly B with 3 m-values (one thousand = 27, then k = 21, and finally k = xix). As tin exist seen from Table ane, the subtractive Multiple-k method does not provide a clear improvement to the Velvet de novo assembly (the associates shows the lowest N50 and a lower number of transcripts recovered than the single-grand method) and will not be discussed further.

The additive Multiple-k method was performed with all the yard-values between 19 and 29. The concluding assembly statistics bespeak that this approach outperforms all others (Table 1). The number of contigs >100 bp and total length are both doubled as compared to the single-yard Velvet assembly. Interestingly, this marked increase is accompanied by a higher N50 (median length-weighted contig length) (Zerbino and Birney 2008), indicating a substantial comeback in contiguity. Furthermore, the reference transcripts recovered are more numerous, reaching nigh forty% of the reference transcriptome, and base of operations coverage of reference transcripts is doubled as compared to the single-k Velvet assembly.

It can exist noted that the number of transcripts identified with the Multiple-m method is like to the number of transcripts identified with k = xix, simply that the number of contigs is much higher in the former. This is due to the fact that the Multiple-k method pools contigs assembled with various k-mer lengths and covering unlike parts of each transcript. Hence, for a given transcript, more sequence information is bachelor in the Multiple-k assembly. This situation may exist explained by splice variants with different abundances resulting in a heterogeneous number of exons, which are therefore assembled with unlike k-mer lengths. Variation in the coverage within a transcript may also be due to regions more difficult to reverse-transcribe, amplify, or sequence (like repeated motifs or strong secondary structures), to genes with alternative transcription start or stop sites or to stochasticity.

When comparing the ready of transcripts identified with the single-1000 Velvet associates and the additive Multiple-yard method, we identified 6697 transcripts in common between the two methods, while 1090 new transcripts were plant only with the latter. Furthermore, the contigs belonging to this common set (30,749 contigs) were longer when they were assembled with the Multiple-k method (N50 = 218 vs. 180; maximum length = 2920 vs. 2017; full length = 6.3 vs. 5.5 Mb for the condiment Multiple-k method and single-k Velvet assembly, respectively). Hence, the additive Multiple-k method does not merely amend the transcript diversity of the assembly but also increases contiguity.

Scaffolding using translation mapping (STM)

De novo transcriptome assemblies may be substantially improved by the add-on of a scaffolding pace where the contigs belonging to a single transcript are ordered, orientated, and assembled. This scaffolding pace is generally performed using paired-ends libraries, but the generation of such libraries doubles the price of a sequencing experiment. An innovative way of scaffolding without incorporating additional sequence is to use the proteome of a related species equally a reference to gather contigs belonging to a same coding sequence. We have designed a method called "Scaffolding using Translation Mapping" (STM) that exploits the fact that, by translating contigs into amino acid sequences, it is possible to search for orthologous regions in a reference proteome, even when it belongs to a distantly related organism. In this way, all translated contigs matching a aforementioned reference poly peptide can be assembled into a scaffold, provided that they pass some accurateness checks (a diagrammatic representation of the pipeline is presented in Fig. 1). In instance reads are long plenty (typically longer than 70 bp), we adult two flavors for this method: with or without the incorporation of the orphan reads non included in the initial assembly, named STM⁺ and STM⁻, respectively.

An external file that holds a picture, illustration, etc. Object name is 1432fig1.jpg

Diagrammatic representation of the STM method. This pipeline can either use just contigs (STM^- method) or, if reads are long enough, contigs plus unassembled reads (STM⁺ method). These contigs/reads are mapped on the reference proteome using BLASTX. When a contig has no significant hit or is the simply 1 to map on a given reference protein, it cannot be further assembled and is directed into the last assembly. When there are several hits on a same reference protein (Box 1: an example with five hits) their relative positions are recorded on the reference scale. If there is an overlap in the positioning of several hits (here hits 2, 3, and 4 course an overlap group), their consensus sequence is computed, and when the number of ambiguities is below a user-defined threshold, the consensus is accepted and a scaffold is synthetic (Box ii: dashed line represents N's added to join the contigs). Else, the consensus is rejected and the contigs of the overlap group are assembled using CAP. If the upshot of this assembly step is a single "super-contig," it is accepted and a scaffold is constructed (Box 3). If more than ane super-contig is obtained (Box iv), the overlap group associates is rejected and the contigs are placed as independent transcripts in the concluding assembly. If present, the other nonoverlapping hits (or nonambiguous overlap groups) are joined into a scaffold, which is incorporated into the final associates.

In lodge to assess the accurateness of our method, we accept tested it using a simulated de novo transcriptome assembly of the zebrafish (Danio rerio), a model organism with a highly studied and richly annotated genome (for simulation details, see Methods). In this way, we could approximate the number of misassemblies by comparing the scaffolds obtained to the original transcriptome using BLASTN (Altschul et al. 1997). We considered as mis-assemblies all scaffolds that did not lucifer perfectly an existing transcript of the original transcriptome. To identify the fault rate that can be attributed solely to the STM method, we first adamant the amount of misassembled contigs due to the Velvet de novo assembler via BLASTN confronting the initial transcriptome, which resulted in 668 erroneous contigs (0.56% of the total) (Tabular array 3). We then performed the STM method on the zebrafish de novo assembly using the proteome of the stickleback (Gasterosteus aculeatus) as a reference. The results of the STM method bear witness a clear comeback of the transcriptome associates, either with STM⁺ or STM⁻ (Table 3). The number of contigs >100 bp was decreased by ∼10% coupled to a marked increase in N50 of 31% and 42% for STM⁻ and STM⁺, respectively. The STM method as well leads to a much longer maximum scaffold length and a greater total length, especially for STM⁺, which globally shows better associates statistics than STM⁻ (Table 3). Nevertheless, the assembly error rate specific to STM⁻ is ane.16% (1.70% when including the error charge per unit of the de novo assembly), while information technology is 2.42% for the STM⁺ (2.91% when including the de novo assembly error rate). This examination indicates that STM⁺ performs the best yet with a slightly higher fault rate than STM⁻, which also enhances essentially the assembly with minor fault risk.

Table iii.

Assembly statistics and misassembly rate for the Velvet de novo assembly and STM method practical to the Danio rerio simulated data set up

An external file that holds a picture, illustration, etc. Object name is 1432tbl3.jpg

We so investigated whether the efficiency of this scaffolding method varied depending on the characteristics of the transcripts being assembled. First, nosotros classified the contigs that were most efficiently scaffolded with our new method (a ready of 38 transcripts showing a 20-fold increase or more than of their length later on scaffolding) according to their Factor Ontology, showing no detail bias in Get categories (Supplemental Table S3). Then, to cheque whether the assembly efficiency depended on the known transcript length or on its abundance, nosotros measured the correlation between the contigs' length increment afterward scaffolding and the real transcript length (as given in the reference transcriptome), and its read coverage. Both these correlations were quite low (Spearman rank correlation coefficient of 0.1218 and −0.1336, respectively), suggesting a lack of strong upshot of transcript length or affluence on the STM method's efficiency.

Optimized de novo transcriptome assembly of the catfish Loricaria

Having demonstrated in controlled weather condition the accurateness and high operation of the two new methods for de novo transcriptome assembly, nosotros and then used them to assemble the transcriptome of a non-model organism: the Neotropical catfish L. gr. cataphracta. The genus Loricaria belongs to the catfish family Loricariidae, the nigh species-rich family unit of freshwater fishes in the Neotropics. All loricariids share the presence of extra-oral and post-cranial denticles. These denticles develop in the same way equally do teeth, and like morphogenetic mechanisms underlying their formation may be inferred (Sire 2001).

Deciphering the genetic command of the development of loricariids' ectopic teeth may contribute to the understanding of teeth germination and regeneration in vertebrates and will certainly shed low-cal on the evolutionary implication resulting from bearing such denticles, particularly on the great species diversification of loricariids. To this aim, we sequenced and assembled the transcriptome of L. gr. cataphracta embryos to reconstruct the sequences of the transcripts known to control tooth development. Recently, the genes forming the core dental regulatory network have been identified and represent a conserved set of fourteen genes that provides the molecular machinery and developmental constraints for all teeth, either jaw or pharyngeal teeth (Fraser et al. 2009). Out of these xiv genes, 5 are duplicated in teleosts, resulting in a ready of 19 genes.

The transcriptome of full embryos of Fifty. gr. cataphracta—from stages ranging from end of gastrula until hatching—was sequenced with 71-bp single-terminate reads on the Illumina Genome Analyzer Ii platform. One sequencing lane was used and resulted in nine.56 meg reads. Reads were first assembled using Velvet and a range of unmarried k-mer lengths. For this step nosotros kept the assembly obtained with k = 41 every bit it gave a good compromise betwixt the number of contigs obtained, the N50, and the number of unigenes recovered (Supplemental Table S4). Next, the condiment Multiple-k method was performed, pooling the assemblies obtained with values of k = 37, 41, 45, 49, 53, 57, and 61. Summary statistics (Tabular array iv) bear witness that the additive Multiple-k assembly makes utilize of 38.7% more reads than the single-k Velvet associates. It also displays twice every bit many contigs >100 bp and a higher N50, indicative of an increased contiguity. All other associates statistics are also markedly improved. In detail, the additive Multiple-grand method allowed the identification of about 2000 additional genes, representing an increase of >twenty% every bit compared to the single-1000 Velvet associates (Table 4).

Table iv.

Statistics of de novo assembly of Loricaria gr. cataphracta transcriptome

An external file that holds a picture, illustration, etc. Object name is 1432tbl4.jpg

We implemented the ii flavors of the STM method to the additive Multiple-k de novo assembly. STM⁻ was performed with a minimum contig length threshold of 73 bp. Its resulting summary statistics indicated a reduction in the number of contigs/scaffolds >100 bp due to the assembly of some of them into scaffolds. Out of the 166,490 contigs >73 bp, 23,675 (14.2%) were successfully incorporated into 6613 scaffolds. Equally expected, a substantial increment is observed for N50 (+27.iv%), and particularly for the maximum contig/scaffold length, which reaches >82 kb (Table four). However, the number of dissimilar transcripts has slightly decreased (−0.88%). This decrease is probably due to the few instances where two or more than contigs belonging to different transcripts were erroneously joined into a single scaffold.

The STM⁺ method integrated 4.6% more reads and further improved the assembly every bit indicated by the summary statistics (Tabular array 4). Notably, this led to a marked increase in the number of unigenes identified (+72.3% compared to the single-g Velvet de novo assembly, +42% compared to the Multiple-grand method). Using the STM⁺ method, we recovered 246 transcripts longer than the longest transcript obtained without information technology. The longest transcript identified was a transcript coding for the titin b (ttnb), and the size distribution of these 246 long transcripts is presented in Supplemental Figure S1.

We and then examined whether our new methods allowed a better assembly of the 19 genes representing the core set of dental development regulatory genes. Using classic de novo transcriptome assembly methods (Velvet with 1000 = 41), we were able to retrieve transcript fragments of 7 out of the 19 genes. By implementing the Multiple-thousand method, we identified an boosted transcript belonging to the set of tooth development genes, and the sequence length of the seven transcript fragments already recovered was in most cases substantially increased (Table 5). Finally, the use of the Multiple-k together with the STM⁺ methods resulted in the assembly of transcript fragments of the full set of nineteen genes, and with a marked increase in sequence length for those transcripts recovered earlier (Table 5).

Table 5.

Assembly of the core set of dental development regulatory genes in Loricaria gr. cataphracta

An external file that holds a picture, illustration, etc. Object name is 1432tbl5.jpg

Discussion

Improving de novo transcriptome assembly

The emergence of next-generation sequencing technologies has impressively enlarged the realm of transcriptomic analyses. For instance, these new technologies have been efficiently employed in the discovery of new genes (Hahn et al. 2009), the evolution of new tissue-specific or cancer biomarkers (Levin et al. 2009; Morrissy et al. 2009), the isolation of fast-evolving genes (Montoya-Burgos et al. 2010), the detection of new alternative splice variants (Carninci 2008; Gibbons et al. 2009; Tang et al. 2009), allele-specific gene expression (Main et al. 2009), SNP discovery in genes (Barbazuk et al. 2007), or epigenetic gene regulation (Elling and Deng 2009). These advances and futurity ones rely, however, on the size and quality of the transcriptome assembly.

In this report, we present methods to improve both the quantity and the quality of the information that can be extracted from a de novo transcriptome assembly. By taking advantage of the assembling properties of many unlike k-mer lengths, the Multiple-k method is able to incorporate the best parts (i.east., the more contiguous) of each assembly into the final associates. We have demonstrated that this strategy leads to a considerable increase in both contig contiguity (past keeping long contigs of highly expressed genes assembled with loftier k-values) and in transcript diversity (by keeping contigs of poorly expressed genes that but assemble with low thousand-values). Furthermore, the use of this method avoids the subjective selection of a single k-mer length.

The second methodology we developed, the STM method, uses the information of a reference proteome to accurately join contigs into scaffolds. Simulated data demonstrated that this method efficiently joins multiple transcript fragments that are part of a single gene, providing new and valuable information on the club and the orientation of these fragments along the original transcript.

Importantly, the sequential application of these two methods to the new next-generation brusque-read data set of the catfish L. gr. cataphracta demonstrates their utility in improving the de novo assembly of a not-model organism transcriptome. Starting time, the additive Multiple-k method makes use of more sequence data from the original information set than a single-thousand Velvet de novo assembly; the number of reads used is increased by 38.seven%. This, together with an viii.8% increment in contiguity, leads to the identification of ∼21% more unigenes. A further increase in contiguity is observed when using the STM⁻ method. The STM⁺ method, which includes orphan reads into the process (four.vi% more reads used), leads to a remarkable increase in the number of unigenes identified in the Loricaria transcriptome (+72%, as compared to the single-k Velvet de novo associates), which corresponds to >56% of the zebrafish factor prepare Zv8 (Ensembl GeneBuild). In the single-k Velvet de novo assembly, simply 33% of these were recovered.

The number of unigenes identified using the different assembly methods is illustrated in a Venn diagram (Supplemental Fig. S2) and shows that 8986 unigenes were identified by all the methods tested here. However, an analysis of the contigs assigned to this fix of shared unigenes (Supplemental Table S5) indicates that the new methods allow a more face-to-face assembly of these contigs. Hence, these methods not simply allow the identification of a higher number of unigenes, but also allow a better associates of the transcripts belonging to the unigenes already identified using a single-k Velvet associates.

The increase in the number of unigenes identified is probably non artifactual since the fault rate of the STM method can likely not be higher in this experiment than the i adamant in the assembly of the simulated zebrafish transcriptome data set using the proteome of the stickleback as reference; these 2 model fish species diverged ∼300 Mya, while Loricaria and the zebrafish have diverged more than recently, ∼150 Mya (Steinke et al. 2006). Moreover, the expected small amount of misassemblies (scaffolding two or more contigs belonging to dissimilar transcripts) will merely atomic number 82 to an underestimate of the number of unigenes. Indeed, when using the best BLASTX hit to annotate a "chimeric" scaffold, a single cistron identification will be obtained, while ii or more than would accept been obtained with the unassembled contigs. Hence, when the master goal of a de novo transcriptome assembly experiment is factor identification, the inclusion of reads (>seventy bp) with the STM⁺ method is highly recommended.

The recovery of more than various transcripts with higher contiguity is also demonstrated by the successful recovery of all the core set of dental development regulatory genes using a combination of the additive Multiple-k and STM⁺ methods, while less than half of them were recovered using single-k Velvet de novo assembly. Furthermore, this assay demonstrates once once again the interest of our methods to assemble transcripts with depression abundance since the unmarried-grand method only assembled the transcripts with the highest sequence coverage (Table 5).

Application requirements

The Multiple-g method can be implemented in conjunction with whatsoever assembler that uses the k-mer length parameter, such every bit those based on de Bruijn graphs representation of sequence neighborhoods, as initially implemented in this field by Pevzner and coworkers (Pevzner and Tang 2001; Pevzner et al. 2001). Current assemblers using this graph-based approach include ALLPATHS (Butler et al. 2008) and ALLPATHS two (MacCallum et al. 2009), Edena (Hernandez et al. 2008), Velvet (Zerbino and Birney 2008), EULER-SR (Chaisson and Pevzner 2008), and Abyss (Simpson et al. 2009).

As to the STM method, it works using the output data gear up of the associates and is therefore independent of the assembler used. This makes it of general use for de novo transcriptome assembly.

STM method limitations

The STM method relies on the assumption that the factor prepare of the reference proteome, which volition serve as a template for joining contigs into scaffolds, is sufficiently similar in terms of factor composition, ortholog gene length, or multigene families, to the gene set up of the transcriptome under assembly. Large differences may reduce the number of scaffolds or atomic number 82 to an increment of misassemblies. The errors introduced past this method arise mainly when two contigs from different transcripts map on a single reference transcript. This can happen when recent paralogs or pseudogenes are present or when the reference proteome is non complete plenty, specially for multigene families for which not all members are present in the reference proteome. The error rate tin be reduced by increasing the similarity cutoff in the STM process at the cost of a lower scaffolding efficiency. Hence, the selection of a similarity cutoff is a merchandise-off between accuracy and efficiency.

In this respect, a parallel tin be drawn between the STM method and gene orthology prediction, for which substantial literature exists. EST databases are being used for predicting gene orthology among species, particularly for phylogenomic purposes (e.g., Burki et al. 2008; Dunn et al. 2008). However, it has been recently shown that ortholog prediction accuracy is significantly higher when at least one of the two transcriptomes compared is complete, and that comparing two fractional transcriptomes results in many more false-positive predictions and in more than unpredicted truthful orthologs (Gibbons et al. 2009). Interestingly, this same study showed that although the corporeality of predicted orthologs decreases with increasing evolutionary altitude, the prediction accuracy remains the same. This observation is promising equally it may well hold true in the context of the STM method for improving the assembly of coding parts of a non-model transcriptome, as suggested by the low fault rate associated with this method, fifty-fifty when using the proteome of the stickleback to reconstruct the zebrafish transcriptome, ii species that diverged 290–330 Mya (Steinke et al. 2006; Yamanoue et al. 2006).

In the future, the STM method will not only benefit from the promise of longer and more numerous reads resulting from next-generation sequencing technologies, only also from both the comeback of current model species transcriptomes/proteomes and the fast rate of development of new model organisms and their transcriptomes/proteomes.

Methods

Multiple-grand method

As no optimal k-mer length exists for whatever de novo transcriptome assembly, nosotros designed and investigated ii procedures to combine the all-time associates information obtained with unlike k-mer lengths into a final associates. The first method consists in assembling the set of reads using a high 1000-value so that highly expressed genes are all-time assembled. The reads used in this initial assembly are then discarded, and a new assembly is performed with the remaining reads and using a lower chiliad-value so that genes with lower expression levels are well assembled. These steps tin can exist repeated one or more times using decreasing k-values. The contigs of the unlike assemblies are then pooled to form the final assembly. We called this approach the "subtractive Multiple-chiliad" method. In the second method, the reads used in the associates with a high k-value are not discarded before running the subsequent associates with a lower chiliad; each assembly uses the full prepare of reads. Some contigs volition appear in two or more assemblies introducing redundancy. We used CD-HIT-EST (Li and Godzik 2006) to remove redundancy and retain the longest possible contigs; the full ready of contigs is mapped against itself. The short-redundant contigs are removed, and the remaining contigs of the pool of assemblies compose the final assembly. We called this process the "additive Multiple-thou" method.

Scaffolding using translation mapping (STM)

The bioinformatics pipeline for building scaffolds based on contig and read translations is diagrammed in Figure ane. Subsequent to a de novo transcriptome assembly, contigs and unassembled (orphan) reads longer than a given threshold are simultaneously translated and "blasted" against a reference proteome using BLASTX. The threshold size should be long enough to potentially result in sufficiently expert BLASTX Due east-values (we used a threshold size of 71 bp, giving translations of 23 amino acids). BLASTX results are parsed to retain only good quality hits; the criteria nosotros used are contig coverage >90%, identity >sixty%, and E-value ≤ 10^−five. The contigs with no good BLASTX hit, or orphan contigs, are directly placed into the final assembly information gear up. If reads (longer than the threshold size) were included in the procedure, those that showed low-quality BLASTX hits are discarded.

The BLASTX results are parsed to think the coding strand and mapping position of the contigs/reads on the reference protein. If only one contig/read maps on a given reference protein, it cannot be further assembled and is directly included in the final associates (for contigs) or discarded (for reads). When multiple contigs/reads map on a same reference protein, their relative position is set according to their identify along the sequence reference in nucleotide coordinates (termed "reference scale" in Fig. one). The contigs/reads are then joined to form a scaffold, with Due north'due south filling the spaces between them. This way of proceeding ensures that the reading frame is maintained. Several contigs/reads belonging to the same scaffold may overlap, and sequence differences may exist in the overlapping regions. The overlapping contigs/reads, called overlap groups, are therefore checked for the presence of minor or major sequence differences at each position by computing a bulk rule consensus sequence (here the majority rule parameter was fix to 75%). Small-scale differences, which may represent allelic variations or sequencing errors, will exist resolved in the consensus, and the scaffold is congenital by joining the various overlap groups. Ambiguous bases (Northward) volition announced in the consensus when major sequence discrepancies exist at a given position. If ambiguous positions comprehend <1% of the consensus sequence length of the overlap group, they are still considered as allelic variations or sequencing errors. Else, when >ane% of the consensus sequence length is equanimous by Due north, which may outcome from the misassembly of splice variants, or of transcripts displaying sequence affinities, or due to indels in the reference sequence relative to the transcript being assembled, and then the overlap group is examined for discerning amongst the various cases. The assembler CAP (Huang 1992) is used to reassemble the sequences composing the overlap grouping, without using the positioning on the reference sequence. This realignment resolves instances where indels were the cause of the problem; the scaffold is thus assembled and included in the final assembly data set. If the problem persists, then the sequences composing the overlap group are separated and placed in the final assembly.

Validation of Multiple-thousand and STM methods

To investigate and test the performances of our two methods, we analyzed ii independent data sets, one based on real information and one based on false data. To test the Multiple-thousand assembly method, nosotros used the Ae. aegypti side by side-generation short-reads (36 bp) data fix recently published by Gibbons et al. (2009), generated from the same strain as the one used to sequence the complete genome (Nene et al. 2007). This information set was subjected to de novo assembly using Velvet v0.seven.59 (Zerbino and Birney 2008) and with k-mer lengths of 19 to 29. Unless otherwise specified, the assembly statistics were taken from the Velvet output file. Nosotros then applied the 2 versions of the Multiple-g method to this data set and evaluated their efficiency. As the Multiple-k method is not aimed at assembling reads into contigs but rather uses the contigs constructed by a de novo assembler under different grand-mer lengths, we did not evaluate the misassembly rate, which depends on the assembler used. We rather adamant the improvements by looking at the assembly statistics and the number of reference transcripts recovered as compared to the single-k Velvet assembly (obtained with the optimal k-value). The number of reference transcripts recovered was calculated past comparing the resulting contigs of the de novo assembly to the Ae. aegypti reference transcriptome (Nene et al. 2007) using BLASTN. We considered as being correctly identified the hits covering at least 95% of the query sequence and having at least 99% identity with the reference transcript. We estimated the number of bases of the reference transcriptome covered by our assembly by summing the lengths of these good hits.

To investigate the behavior of the Multiple-k method, we conducted the same analyses on a simulated RNA-seq data fix. First, we simulated a transcriptome from the Ensembl set up of D. rerio cDNA (32,337 transcripts including splice variants). Unlike relative abundances were randomly assigned to each of these transcripts to mimic the variation in cistron expression level observed in a real data prepare (the abundance profile was set according to the distribution of D. rerio ESTs density found in the Unigene database). This faux transcriptome contained a total of 317,272 transcripts for a total length of 514,277,767 bp. An RNA-seq experiment was and so fake from this transcriptome. We generated x meg 35-bp reads in a shotgun process using the simreads program of the Rmap package (Smith et al. 2009) and applied an error rate of 1% to mimic sequencing errors.

Similarly, to exam the accuracy of the STM method in highly controlled conditions, we simulated a simplified next-generation sequencing experiment of the zebrafish coding transcriptome (32,337 coding transcripts, representing 51,837,753 bp) by generating iv million random single-end reads of 76 bp in size (representing an ∼6× coverage of the transcriptome with homogeneous cistron expression level). The simulation was performed with simreads of the Rmap parcel (Smith et al. 2009). This data set was first subjected to a Velvet de novo associates with an capricious chiliad-mer length of 41. It was and so used to determine the scaffolding misassembly rate of the STM method. We first calculated the error charge per unit due to the de novo assembler, and and then due to the STM method, by comparing the contigs assembled to the reference transcriptome of D. rerio (the same from which the reads were false), using the same procedure as described for the Multiple-k method.

The scripts implementing the Multiple-g and STM methods are available in the Supplemental Material and can also be downloaded from http://www.surget-groba.ch/downloads/stm.tar.gz.

Quantification of transcripts from the RNA-seq experiments

To quantify the affluence of transcripts assembled in both Ae. aegyptii and L. gr. cataphracta, we mapped the reads from the RNA-seq experiments onto the assembled contigs using Maq v0.7.1 (Li et al. 2008). Reads mapping with a quality below 20 were discarded, and the number of reads mapping on a given transcript were corrected past the transcript size and the full number of reads to obtain the number of "reads per kilobase per one thousand thousand" (rpkm).

Illumina sequencing and de novo associates of Loricaria transcriptome

To test our methods in real conditions, we conducted a consummate experiment of next-generation transcriptome sequencing and de novo assembly using our methods for a non-model organism, the catfish L. gr. cataphracta.

Full RNA was extracted from fresh L. gr. cataphracta full embryos of 2–viii d post-fecundation (stages ranging from finish of gastrula to hatching) using TRIzol reagent (GIBCO). After quantification and quality verification of the full RNA, mRNA was isolated using the mRNA Isolation Kit (Roche Diagnostics) according to the manufacturer'southward instructions. We used the "mRNA-SEQ" Transcriptome Shotgun process and Kit (Roche) for preparing the cDNA for Illumina sequencing. The sequencing experiment was performed by the company Fasteris SA. First, 1 μg of embryo mRNA was zinc-fragmented to reach sizes ranging from 200 to 500 bases. First-strand cDNA was synthesized using random hexamer primers. 2nd-strand synthesis was performed by handling with RNase H and DNA polymerase I for strand elongation, co-ordinate to the manufacturer's instructions. Double-strand cDNA ends were repaired using T4 DNA polymerase, Large (Klenow) fragment of DNA polymerase I, and T4 polynucleotide kinase in the presence of ATP and the four dNTPs. Afterwards purification, adenine nucleotides were added at the iii′ side of the blunt-ended DNA fragments with Klenow fragment (exo⁻) then purified. Forked Illumina adapters were ligated to the cDNA overnight at xv°C, using T4 DNA ligase in the presence of ATP, and then purified. The cDNA–adapter complexes were loaded onto a well-resolved 3% agarose TBE gel, and complexes of 250–350 bp in size were extracted by excising the corresponding region of the gel and purifying the complexes with the High Pure PCR Product Purification Kit (Roche). Finally, the cDNA–adapter complexes were PCR-amplified for xv cycles. The prepared cDNA library was sequenced with 71-bp unmarried-end reads on one lane of the Illumina Genome Analyzer II platform and processed using the Illumina Pipeline Software v1.4.0, according to the manufacturer's instructions (Illumina). The reads data set was deposited at the NCBI Sequence Read Archive (SRA) under accession number SRA010189.

The nine.56 million reads of 71 bp that were generated were subjected to a series of de novo assemblies using k-mer lengths ranging from 37 to 61. The summary statistics were used to decide the optimal k-mer length (Supplemental Table S4). This data set was so subjected sequentially to the additive Multiple-k so either to the STM⁻ or the STM⁺ method.

In this experiment, we estimated the number of different genes recovered by comparison the resulting contigs to the proteome of D. rerio, using BLASTX, and kept merely hits with an E-value ≤10⁻¹⁰. We then counted the number of distinct genes (unigenes) identified.

Acknowledgments

We thank Antonis Rokas for sharing with us his Ae. aegypti next-generation brusque-read data set. We thank Ilham Bahechar for her help with laboratory piece of work, and Marta Burgos and Alison R. Davis for revising the manuscript. We thank Patrick Descombes, Laurent Farinelli, and iii anonymous reviewers for their useful discussions and comments. This work was supported by funds from the County de Genève, the Swiss National Research Fund (Number 3100A0-122303/1), and the G & A Claraz Foundation.

Footnotes

References

Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ 1997. Gapped BLAST and PSI-Smash: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 [PMC free article] [PubMed] [Google Scholar]
Barbazuk WB, Emrich SJ, Chen Hard disk, Li L, Schnable PS 2007. SNP discovery via 454 transcriptome sequencing. Institute J 51: 910–918 [PMC free article] [PubMed] [Google Scholar]
Burki F, Shalchian-Tabrizi K, Pawlowski J 2008. Phylogenomics reveals a new 'megagroup' including virtually photosynthetic eukaryotes. Biol Lett iv: 366–369 [PMC free commodity] [PubMed] [Google Scholar]
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB 2008. ALLPATHS: De novo associates of whole-genome shotgun microreads. Genome Res xviii: 810–820 [PMC free article] [PubMed] [Google Scholar]
Carninci P 2008. Hunting hidden transcripts. Nat Methods 5: 587–589 [PubMed] [Google Scholar]
Chaisson MJ, Pevzner PA 2008. Short read fragment assembly of bacterial genomes. Genome Res 18: 324–330 [PMC free commodity] [PubMed] [Google Scholar]
Collins LJ, Biggs PJ, Voelckel C, Joly Southward 2008. An approach to transcriptome assay of not-model organisms using brusque-read sequences. Genome Inform 21: 3–14 [PubMed] [Google Scholar]
Disset A, Cheval L, Soutourina O, Duong Van Huyen JP, Li G, Genin C, Tostain J, Loupy A, Doucet A, Rajerison R 2009. Tissue compartment analysis for biomarker discovery past cistron expression profiling. PLoS ONE 4: e7779 doi: 10.1371/journal.pone.0007779 [PMC free article] [PubMed] [Google Scholar]
Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, et al. 2008. Broad phylogenomic sampling improves resolution of the beast tree of life. Nature 452: 745–749 [PubMed] [Google Scholar]
Elling AA, Deng XW 2009. Next-generation sequencing reveals complex relationships between the epigenome and transcriptome in maize. Plant Betoken Behav four: 760–762 [PMC costless commodity] [PubMed] [Google Scholar]
Fraser GJ, Hulsey CD, Bloomquist RF, Uyesugi K, Manley NR, Streelman JT 2009. An aboriginal factor network is co-opted for teeth on one-time and new jaws. PLoS Biol vii: 233–247 [PMC gratuitous article] [PubMed] [Google Scholar]
Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A 2009. Benchmarking side by side-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 26: 2731–2744 [PubMed] [Google Scholar]
Hahn DA, Ragland GJ, Shoemaker DD, Denlinger DL 2009. Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis . BMC Genomics 10: 234 doi: 10.1186/1471-2164-10-234 [PMC gratuitous article] [PubMed] [Google Scholar]
Hale MC, McCormick CR, Jackson JR, DeWoody JA 2009. Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): The relative merits of normalization and rarefaction in gene discovery. BMC Genomics ten: 203 doi: 10.1186/1471-2164-ten-203 [PMC gratis article] [PubMed] [Google Scholar]
Hegedus Z, Zakrzewska A, Agoston VC, Ordas A, Racz P, Mink One thousand, Spaink HP, Meijer AH 2009. Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. Mol Immunol 46: 2918–2930 [PubMed] [Google Scholar]
Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J 2008. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res eighteen: 802–809 [PMC costless article] [PubMed] [Google Scholar]
Huang XQ 1992. A contig assembly program based on sensitive detection of fragment overlaps. Genomics fourteen: 18–25 [PubMed] [Google Scholar]
Hughes J, Longhorn SJ, Papadopoulou A, Theodorides Thousand, de Riva A, Mejia-Chang Thousand, Foster PG, Vogler AP 2006. Dense taxonomic EST sampling and its applications for molecular systematics of the Coleoptera (beetles). Mol Biol Evol 23: 268–278 [PubMed] [Google Scholar]
Levin JZ, Berger MF, Adiconis X, Rogov P, Melnikov A, Fennell T, Nusbaum C, Garraway LA, Gnirke A 2009. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol ten: R115 doi: x.1186/gb-2009-10-ten-r115 [PMC free article] [PubMed] [Google Scholar]
Li W, Godzik A 2006. Cd-striking: A fast program for clustering and comparing big sets of protein or nucleotide sequences. Bioinformatics 22: 1658–1659 [PubMed] [Google Scholar]
Li H, Ruan J, Durbin R 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858 [PMC free commodity] [PubMed] [Google Scholar]
MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter IA, Gnirke A, Malek J, McKernan G, Ranade S, Terrance PS, et al. 2009. ALLPATHS 2: Pocket-sized genomes assembled accurately and with high continuity from short paired reads. Genome Biol 10: R103 doi: 10.1186/gb-2009-ten-x-r103 [PMC free article] [PubMed] [Google Scholar]
Main BJ, Bickel RD, McIntyre LM, Graze RM, Calabrese PP, Nuzhdin SV 2009. Allele-specific expression assays using Solexa. BMC Genomics 10: 422 doi: ten.1186/1471-2164-x-422 [PMC free article] [PubMed] [Google Scholar]
Margulies M, Egholm M, Altman Nosotros, Attiya Due south, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen ZT, et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380 [PMC free commodity] [PubMed] [Google Scholar]
Montgomery SB, Sammeth M, Gutierrez-Arcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET 2010. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464: 773–777 [PMC free commodity] [PubMed] [Google Scholar]
Montoya-Burgos JI, Foulon A, Bahechar I 2010. Transcriptome screen for fast evolving genes by Inter-Specific Selective Hybridization (ISSH). BMC Genomics xi: 126 doi: 10.1186/1471-2164-eleven-126 [PMC gratis article] [PubMed] [Google Scholar]
Morrissy Equally, Morin RD, Delaney A, Zeng T, McDonald H, Jones Southward, Zhao Y, Hirst M, Marra MA 2009. Next-generation tag sequencing for cancer gene expression profiling. Genome Res 19: 1825–1835 [PMC costless article] [PubMed] [Google Scholar]
Mortazavi A, Williams BA, Mccue One thousand, Schaeffer L, Wold B 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5: 621–628 [PubMed] [Google Scholar]
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344–1349 [PMC free article] [PubMed] [Google Scholar]
Nene V, Wortman JR, Lawson D, Haas B, Kodira C, Tu ZJ, Loftus B, Eleven ZY, Megy K, Grabherr M, et al. 2007. Genome sequence of Aedes aegypti, a major arbovirus vector. Scientific discipline 316: 1718–1723 [PMC complimentary article] [PubMed] [Google Scholar]
Pevzner PA, Tang HX 2001. Fragment assembly with double-barreled information. Bioinformatics 17: S225–S233 [PubMed] [Google Scholar]
Pevzner PA, Tang HX, Waterman MS 2001. An Eulerian path approach to Deoxyribonucleic acid fragment assembly. Proc Natl Acad Sci 98: 9748–9753 [PMC free article] [PubMed] [Google Scholar]
Porreca GJ, Zhang G, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, et al. 2007. Multiplex distension of large sets of man exons. Nat Methods 4: 931–936 [PubMed] [Google Scholar]
Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL 2009. De novo assembly using low-coverage curt read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae . Genome Res 19: 294–305 [PMC gratuitous article] [PubMed] [Google Scholar]
Salzberg SL, Sommer DD, Puiu D, Lee VT 2008. Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comp Biol 4: e1000186 doi: x.1371/journal.pcbi.1000186 [PMC free commodity] [PubMed] [Google Scholar]
Schuster SC 2008. Adjacent-generation sequencing transforms today's biological science. Nat Methods five: sixteen–18 [PubMed] [Google Scholar]
Simpson JT, Wong M, Jackman SD, Schein JE, Jones SJM, Birol I 2009. Abyss: A parallel assembler for short read sequence information. Genome Res 19: 1117–1123 [PMC costless article] [PubMed] [Google Scholar]
Sire JY 2001. Teeth exterior the mouth in teleost fishes: How to benefit from a developmental accident. Evol Dev three: 104–108 [PubMed] [Google Scholar]
Smith AD, Chung WY, Hodges E, Kendall J, Hannon K, Hicks J, Xuan ZY, Zhang MQ 2009. Updates to the RMAP short-read mapping software. Bioinformatics 25: 2841–2842 [PMC free article] [PubMed] [Google Scholar]
Steinke D, Salzburger Westward, Meyer A 2006. Novel relationships among ten fish model species revealed based on a phylogenomic assay using ESTs. J Mol Evol 62: 772–784 [PubMed] [Google Scholar]
Tang FC, Barbacioru C, Wang YZ, Nordman Eastward, Lee C, Xu NL, Wang XH, Bodeau J, Tuch BB, Siddiqui A, et al. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nat Methods 6: 377–382 [PubMed] [Google Scholar]
Torres TT, Metta Yard, Ottenwalder B, Schlotterer C 2008. Gene expression profiling past massively parallel sequencing. Genome Res 18: 172–177 [PMC free article] [PubMed] [Google Scholar]
Toth AL, Varala 1000, Newman TC, Miguez FE, Hutchison SK, Willoughby DA, Simons JF, Egholm M, Hunt JH, Hudson ME, et al. 2007. Wasp gene expression supports an evolutionary link between maternal beliefs and eusociality. Science 318: 441–444 [PubMed] [Google Scholar]
Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH 2008. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 17: 1636–1647 [PubMed] [Google Scholar]
Wang Z, Gerstein M, Snyder M 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63 [PMC complimentary article] [PubMed] [Google Scholar]
Wilhelm BT, Marguerat S, Watt Due south, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bähler J 2008. Dynamic repertoire of a transcriptome surveyed at single-nucleotide resolution. Nature 454: 1239–1243 [PubMed] [Google Scholar]
Yamanoue Y, Miya M, Inoue JG, Matsuura K, Nishida Yard 2006. The mitochondrial genome of spotted green pufferfish Tetraodon nigroviridis (Teleostei : Tetraodontiformes) and departure time interpretation amid model organisms in fishes. Genes Genet Syst 81: 29–39 [PubMed] [Google Scholar]
Zerbino DR, Birney E 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829 [PMC free commodity] [PubMed] [Google Scholar]

Manufactures from Genome Research are provided hither courtesy of Cold Spring Harbor Laboratory Press

turneyantim1984.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945192/

What Is Better Mapping to a Reference or De Novo Assembly in Rna Seq

Optimization of de novo transcriptome assembly from next-generation sequencing data

Yann Surget-Groba

Juan I. Montoya-Burgos

Abstract

Results

De novo transcriptome assembly with multiple 1000-mer values

Table ane.

Tabular array 2.

Scaffolding using translation mapping (STM)

Table iii.

Optimized de novo transcriptome assembly of the catfish Loricaria

Table iv.

Table 5.

Discussion

Improving de novo transcriptome assembly

Application requirements

STM method limitations

Methods

Multiple-grand method

Scaffolding using translation mapping (STM)

Validation of Multiple-thousand and STM methods

Quantification of transcripts from the RNA-seq experiments

Illumina sequencing and de novo associates of Loricaria transcriptome

Acknowledgments

Footnotes

References

0 Response to "What Is Better Mapping to a Reference or De Novo Assembly in Rna Seq"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel