Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here researchers from the Centre for Genomic Regulation present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.
Using the CLS approach to extend GENCODE lncRNA annotation
(a) The strategy for automated, high-quality transcriptome annotation. CLS can be used to complete existing annotations (blue) or to map novel transcript structures in suspected loci (gold). Capture oligonucleotides (black bars) are designed to tile across targeted regions. PacBio libraries are prepared for from the captured molecules. Illumina HiSeq short-read sequencing can be carried out for independent validation of predicted splice junctions (SJ). Predicted transcription start sites can be confirmed by CAGE clusters (green), and transcription termination sites by non-genomically encoded polyA+ sequences in PacBio reads (red). Rectangles with lighter shading and dashed outlines denote novel exons. (b) A summary of the human and mouse capture library designs. The numbers of individual gene loci probed are shown. PipeR pred., ortholog predictions in mouse genome of human lncRNAs made by PipeR; snRNA, small nuclear RNA; snoRNA, small nucleolar RNA; UCE, ultraconserved elements; Prot. coding, expression-matched, randomly selected protein-coding genes; ERCC, spike-in sequences; Ecoli, randomly selected Escherichia coli genomic regions (enhancers and UCEs were probed on both strands, and these were counted separately). (c) Types of RNA samples used in the study.