Next-generation sequencing approaches, in particular RNA-seq, provide a genome-wide expression profiling allowing the identification of novel and rare transcripts such as long noncoding RNAs (lncRNA). Many RNA-seq studies have now been performed aimed at the characterization of lncRNAs and their possible involvement in cell development and differentiation in different organisms, cell types, and tissues.
Researchers from the National Institute of Molecular Genetics, Milan present a step-by-step pipeline for the identification and characterization of lncRNA based on RNA-seq Poly-A+ fractions data using paired-end Illumina reads. This pipeline includes all the software, command lines and suggestions for a complete NGS analysis of lncRNAs, starting from quality control and reads mapping to differential expression analysis or identification of lncRNA signature. It was originally developed for the identification of lncRNAs expressed in human lymphocytes populations and can be applied to any available RNA-seq dataset.
Data investigation: principal component analysis and hierarchical clustering
(a) Principal component analysis (PCA) performed using DESeq2 rlog-normalized RNA-seq data on different CD4+ T cell subsets. (b) Hierarchical clustering analyses performed on the same normalized RNA-seq data. The distance metric values used for clustering are represented by a color code from white (low correlation) to dark blue (maximum correlation)