COME – a robust coding potential calculation tool for lncRNA identification and characterization

Recent genomic studies suggest that novel long non-coding RNAs (lncRNAs) are specifically expressed and far outnumber annotated lncRNA sequences. To identify and characterize novel lncRNAs in RNA sequencing data from new samples, researchers at Tsinghua University have developed COME, a coding potential calculation tool based on multiple features. It integrates multiple sequence-derived and experiment-based features using a decompose-compose method, which makes it more accurate and robust than other well-known tools. The developers also showed that COME was able to substantially improve the consistency of predication results from other coding potential calculators. Moreover, COME annotates and characterizes each predicted lncRNA transcript with multiple lines of supporting evidence, which are not provided by other tools. Remarkably, they found that one subgroup of lncRNAs classified by such supporting features (i.e. conserved local RNA secondary structure) was highly enriched in a well-validated database (lncRNAdb). The developers further found that the conserved structural domains on lncRNAs had better chance than other RNA regions to interact with RNA binding proteins, based on the recent eCLIP-seq data in human, indicating their potential regulatory roles.

COME workflow: a coding potential calculator based on multiple features

lncrna

COME integrates multiple features with a supervised model to classify protein coding transcripts (mRNAs) and non-coding transcripts (lncRNAs). Multiple features (GC content, sequence conservation score, etc.) are processed by a decomposeā€“compose procedure: feature values are initially calculated and indexed at the bin level (B). They are first indexed at the whole genome level, then mapped to each transcript (A). (C) The feature vectors of each transcript are composed at the transcript level by the maximum, mean and variance scores of the overlapping bins. (D) The probability of being mRNA predicted by the supervised model is the coding potential score for a given transcript.

Availability – The software implementation is available at https://github.com/lulab/COME

Hu L, Xu Z, Hu B, Lu ZJ. (2016) COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res [Epub ahead of print]. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*