Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning.
In this study, scientists at Stanford University use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. They show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous analysis of short reads. For long-noncoding RNAs (lncRNA) genes, however, they find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, they demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. This method is applicable to all long read sequencing technologies.
Tilgner H, Raha D, Habegger L, Mohiuddin M, Gerstein M, Snyder M. (2013) Accurate identification and analysis of human mRNA isoforms using deep long read sequencing. G3 (Bethesda) 3(3), 387-97. [abstract]