Multiple computational tools have been widely applied to the detection of coding driver mutations in cancer; however, the prioritization of pathogenic non-coding variants remains a difficult and demanding task. The present study was performed to distinguish non-coding disease-causing mutations from neutral ones, and to prioritize potential cancer-associated long non-coding RNAs (lncRNAs) with a logistic regression model in lung cancer. A logistic regression model was constructed, combining 19,153 disease-associated ClinVar and Human Gene Mutation Database pathogenic variants as the response variable and non-coding features as the predictor variable. Validation of the model was conducted with genome-wide association study (GWAS) disease- or trait-associated single nucleotide polymorphisms (SNPs) and recurrent somatic mutations. High scoring regions were characterized with respect to their distribution in various features and gene classes; potential cancer-associated lncRNA candidates were prioritized, combining the fraction of high-scoring regions and average score predicted by the logistic regression model. H3K79me2 was the most negative factor that contributed to the model, while conserved regions were most positively informative to the model. The area under the receiver operating characteristic curve of the model was 0.89. The model assigned a significantly higher score to GWAS SNPs and recurrent somatic mutations compared with neutral SNPs (mean, 5.9012 vs. 5.5238; P<0.001, Mann-Whitney U test) and non-recurrent mutations (mean, 5.4677 vs. 5.2277, P<0.001, Mann-Whitney U test), respectively. It was observed that regions, including splicing sites and untranslated regions, and gene classes, including cancer genes and cancer-associated lncRNAs, had an increased enrichment of high-scoring regions. In total, 2,679 cancer-associated lncRNAs were determined and characterized. A total of 104 of these lncRNAs were differentially expressed between lung cancer and normal specimens. The logistic regression model is a useful and efficient scoring system to prioritize non-coding pathogenic variants and lncRNAs, and may provide the basis for detecting non-coding driver lncRNAs in lung cancer.
Fitting and validation of the logistic regression model
(A) Densities of ClinVar and Human Gene Mutation Database pathogenic variants for all 25 non-coding features (red line, average density in the human genome). (B) Regression estimates for all features used in the logistic regression model. (C) Receiver operating characteristic curve for the model. (D) Scaled scores for GWAS, neutral SNPs (1 million random neutral SNPs), non-recurrent and recurrent mutations of lung cancer. CR, conserved region; TFBS, transcription factor binding site; cTFBS, conserved TFBS; UTR, untranslated region; HE, highly expressed region; SNP, single nucleotide polymorphism; Sensitive, known binding sites or motifs of transcription factors with high ratio of rare SNPs (allele frequency <0.01); ncExon, non coding Exon; H3K4me1, H3K9ac, etc., histone modification data; ER, early replicated region; Dnase, Dnase I hypersensitive site; LE, low expressed region; ECS, evolutionarily conserved structure; LR, late replicated region; TPR, true positive rate; FPR, false positive rate; GWAS, genome-wide association study.