Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, current reference gene collections remain far from complete: many genes models are fragmentary, while thousands more remain uncatalogued, particularly for long non coding RNAs (lncRNAs).
To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third generation long-read sequencing.
Researchers from the CRG, Barcelona Institute of Science and Technology present an experimental re-annotation of the entire GENCODE intergenic lncRNA population in matched human and mouse tissues. CLS approximately doubles the annotated complexity of targeted loci, in terms of validated splice junctions and transcript models. The full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.
Capture Long Seq approach to extend the GENCODE lncRNA annotation
(A) Strategy for automated, high-quality transcriptome annotation. CLS may be used to complete existing annotations (blue), or to map novel transcript structures in suspected loci (orange). Capture oligonucleotides (black bars) are designed to tile across targeted regions. PacBio libraries are prepared for from the captured molecules. Illumina HiSeq short-read sequencing can be performed for independent validation of predicted splice junctions. Predicted transcription start sites can be confirmed by CAGE clusters (green), and transcription termination sites by non-genomically encoded polyA sequences in PacBio reads. Novel exons are denoted by lighter coloured rectangles. (B) Summary of human and mouse capture library designs. Shown are the number of individual gene loci that were probed. “PipeR pred.”: orthologue predictions in mouse genome of human lncRNAs, made by PipeR (31); “UCE”: ultraconserved elements; “Prot. coding”: expression-matched, randomly-selected protein-coding genes; “ERCC”: spike-in sequences; “Ecoli”: randomly-selected E. coli genomic regions. Enhancers and UCEs are probed on both strands, and these are counted separately. “Total nts”: sum of targeted nucleotides. (C) RNA samples used.