from Biology Computes
Ying Sha, Sanjeev Sariya, Ali M Pirani, Adrian Lawsin
Cancer is a disease of aberrant gene expression. Genome-wide association studies in cancer reveal that more than 80% of cancer-associated SNPs (single-nucleotide polymorphism) occur in noncoding regions of the genome. Long noncoding RNAs (lncRNAs) are non-protein coding transcripts longer than 200 nucleotides. This arbitrary limit distinguishes lncRNAs from miRNAs (microRNA), piRNAs (piwi-interacting RNA), snoRNAs (small nucleolar RNAs)and other short ncRNAs (non-coding RNAs). lncRNA can act as transcriptional suppressor and activator.
Known mechanism of lncRNAs:
•Chromatin remodeling: Hetrochromatin to Euchromatin – The well-characterised lncRNAs ANRIL, XIST, HOTAIR and KCNQ1OT1 are able to recruit epigenetic modifiers to specific loci to reprogram the chromatin state.
•Transcriptional co-activation and –repression: Activation or Inhibition – Some lncRNAs may play an important role by operating as a transcriptional repressor or activator. lncRNA lincRNA-p21 contains binding sites for the tumour suppressor p53 in its promoter and is directly activated by p53 in response to DNA damage.
•Protein inhibition – Telomeres are the protective ends of chromosomes. These ends are progressively shortened during cell division until they reach a critical length triggering cell death or senescence. Telomeric ends are transcribed into a lncRNA named TERRA, which binds telomerase, inhibiting its activity in vitro. TERRA are downregulated in many cancer cells.
•Post-transcriptional modifications – Some lncRNAs are constituents of macromolecular complexes with roles in RNA processing. The lncRNA MALAT1 is thought to act at a post-transcriptional level by controlling alternative splicing of pre-mRNAs.
•Decoy – Some lncRNAs can act as decoys, sequestering biomolecules and preventing them from fulfilling their cellular functions. PTENP1 has a disrupted open reading frame, preventing it encoding a functional protein. PTENP1 can restrict cell proliferation by acting as a microRNA decoy for the tumour suppressor PTEN.
•How does lncRNA affects Lung, Breast and Prostate Cancer?
•Specifically, we want to :
(1) find long noncoding RNAs which are relevant to 3 specific cancer types
(2) predict what biological processes are lncRNAs involved in carcinogenesis.
Getting RNA-seq data from Gencode V7 (and refSeq data as reference), Sailfish was used to quantify transcript expression. The resulting gene expression tables were then used in DEGSeq R package to identify differentially expressed lincRNAs. R package was also used for cluster analysis to produce heat map and clusters. DAVID and KEGG was used on the DEGSeq results to produce pathway maps and tables.
Information of RNA-seq data
For each cancer type, we downloaded two RNA-seq datasets for cancer tissue and two RNA-seq datasets for matched/adjacent normal tissue as control group. We have analyzed 12 RNA-seq datasets in total.
•Long noncoding RNA：Gencode v7
GENCODE is a scientific project in genome research and part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project. It contains the most complete human lncRNA annotation to date.
•Protein coding RNA：Refseq
Expression table for lincRNA (5776*12 matrix)
Expression table for mRNA (40766*12 matrix)
We randomly select 1000 genes, and to compute the pearson correlation of the expression vector of these 1000 genes between 12 samples. The result shows that replicates of each type are well corresponded, samples from same tissue are similar than between different tissues, too.
Challenges of getting expression table
•We tried bowtie to get expression table, but failed because
(1) Too many lncRNAs chosen, which causes a huge bowtie index built from fastq format.
(2) Biocluster experienced shortage of space at that time.
(1) We select only those intergenic(locate between protein-coding genes), and known lncRNA annotations (remove novel, putative ones).
(2) Use sailfish, an alignment-free tool, instead of bowtie.
•DEGseq is an R package to identify differentially expressed genes from RNA-Seq data.
•(In default, it takes number of reads mapped to a gene as input, but it could also accept expression value(RPKM) obtained from Sailfish.)
Identification of differentially expressed lincRNA between cancer tissue and matched normal tissue
•Use DEGseq MA-plot-based method to identify differentially expressed genes. (MA plot is statistical analysis tool widely used to detect and visualize intensity-dependent ratio of microarray data.)
•M= log2C1- log2C2, and A = (log2C1 + log2C2)/2. Under the random sampling assumption, the conditional distribution of M given A follows an approximate normal distribution.
•For each gene on the MA plot, test the hypothesis Ho:p1=p2. H1: p1≠p2. A p Value could be assigned based on the conditional normal distribution.
C1, C2 denotes counts of reads mapped to a specific gene in sample1 and sample2, separately.
RNA seq could be modeled as random sampling process.
C ~ binomial distribution ( n, p)
This is the output of DEGseq for lncRNAs in breast cancer datasets.
The two density plot and the boxplot show the distribution of expression value after normalization.
The scatter plot shows the lncRNA expression values of breast cancer cells against those from matched normal tissue.
On the most right hand side, this graph shows the result of DEGseq, every dot represents a gene. Red dots beyond the red lines represent differentially expressed lncRNAs.
Similarly, the above shows the DEGseq results for the lung cancer datasets.
Likewise, the above is the results of DEGseq on the prostate cancer data.
Overlap of differentially expressed lincRNAs among three cancer types
Adding up the numbers, we got 363 differentially expressed lncRNAs in all.
We created a heatmap showing the pearson correlation of all 363 DE lincRNAs we identified based on their expression pattern. From the graph, we can see around 10 groups via hierarchical clustering. Within each group, these lincRNAs sharing similar expression pattern may have related regulatory roles.
To better visualize the clusters of genes, we create plots showing the clustered genes in each cancer types and list the ensemble id of lncRNAs.
If we know well-characterized lincRNAs (mechanism verified by experimental evidences), we can guess that lincRNAs in the same group with the well-characterized lincRNAs may have similar or related regulatory roles.
This could provide guidance to further experimental work.
Mds is a another method of visualizing the level of similarity of individual cases of a dataset. Mds algorithms aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible.
We used mds to visualize up/down regulated lincRNAs in each cancer type in a two-dimensional space. We found that in general the expression patterns of down regulated lincRNA expression patterns more similar than those of upregulated lincRNAs. (Maybe the function of up regulated lincRNAs are more diversified than that of down regulated lincRNAs.)
The results in general show that the down regulated genes are more clustered together than the up regulated genes.
Challenges of running pathway analysis
•Because lncRNAs are not as well annotated as protein coding RNAs, we cannot directly run pathway analysis on lncRNAs.
•Using filtered refSeq protein coding mRNAs; filtered by two conditions:
•Their expression pattern should be highly correlated with at least one of DE lincRNAs. (Spearman correlation coefficient> 0.7 or <-0.7) Why 0.7? This is a commonly used threshold. If we would like it to be more strict, we will need to compute the distribution of coefficients and do t-test .
•They should locate within 20k bp of one of DE lincRNAs. Why 20kb? Fine, we just assume that genes within 20kb are close to each other and execute their function locally.(cis-regulation)
We suppose that these protein coding RNA are the potential cis-regulated targets of our DE lncRNAs. The Spearman rank correlation method makes no assumptions about the distribution of the data. It may therefore be more appropriate for data with large outliers that are not normally distributed. It is a robost version of pearson correlation.
An example of our correlation:
If we have 3 mRNA called A,B,C and 3 DE lincRNAs called a,b,c, we compute spearman correlation between the two sets of genes and get a table look like this:
Therefore, mRNAs A and C passed our first filtering condition. They are highly correlated either positively or negatively with at least one of DE lincRNA.
Pathway Analysis: KEGG
•KEGG (Kyoto Encyclopedia of Genes and Genomes)
•A “computer representation” of the biological system.
•Used for modeling and simulation, browsing and retrieval of data.
•Links datasets to known pathways and complexes (the “Pathway” database)
Pathway Analysis: DAVID
•DAVID (Database for Annotation, Visualization and Integrated Discovery)
•provides functional interpretation of large lists of genes derived from genomic studies.
•use DAVID Functional annotation chart to identify enriched annotation terms associated with user’s gene list.
Two top terms from DAVID functional annotation charts:
•Cell adhesion – Dysfunction of cell-adhesion and cell-migration occurs during cancer metastasis. Cellular adhesion and traction can allow cells to migrate.
•Alternative splicing – It provides cells with the opportunity to create protein isoforms of differing, even opposing, functions from a single gene. Cancer cells often take advantage of this flexibility to produce proteins that promote growth and survival.
Discussion on Pathways:
•Pathways identified for three cancer types all contain terms related “cancer”, which confirmed the feasibility of our indirect way of doing pathway analysis.
•Some pathways are common among three cancer types, which means that these pathways are either critical in the carcinogenesis or downstream pathways:
1.) PI3K-Akt signaling pathway
2.) MicroRNAs in cancer
1.) PI3K-Akt signaling pathway
PI3Ks, constitute a lipid kinase family characterized by their ability to phosphorylate inositol phospholipids to generate the second messenger PIP3. RPTK activation results in PIP3 and PIP2 production by PI3K at the inner side of the plasma membrane. Akt interacts with these phospholipids, causing its translocation to the inner membrane. Activated Akt modulates the function of numerous substrates involved in the regulation of cell survival, cell cycle progression and cellular growth.
2.) MicroRNAs in cancer
MicroRNAs (miRNAs) are a class of small noncoding RNAs that control gene expression by targeting mRNAs and triggering either translation repression or RNA degradation. miRNAs function by annealing to complementary sites on the coding sequences or 3′ UTRs of target gene transcripts, where they promote recruitment of protein complexes that impair translation and/or decrease the stability of mRNA. Recent studies show that lncRNA may have biological function as “decoys” through sequestration of miRNAs to affect their regulation of expressed genes. Besides, lncRNAs can function as primary microRNA precursors(H19).
As the Venn Diagram shows, there were only 3 genes that is included in all three cancer data sets. Out of the 3, only one is well annotated, NM_003285.
Quick review of our project
•lncRNA was utilized to reach differentially expressed gene data
•Since lncRNAs are not completely annotated, we cannot make a direct correlation between lncRNA and pathway involved in cancer
•Refseq mRNAs are filtered by two conditions, and then these selected mRNAs are processed to analyze different pathways which are impacted by lncRNA
•Since lncRNA are involved in the regulation of some protein coding genes and these genes are responsible for some pathways in cancer cells, we conclude that lncRNA can have some affect on cancer.
•We identified 363 differentially expressed lincRNAs in 3 cancer types.
•From the clustering results, we can guess genes within same group may share similar regulatory roles or similar mechanisms, providing guidance to future experimental work.
•We infer from the pathway analysis that lincRNA may be involved in PI3K-Akt pathway, interaction with microRNAs, cell adhesion and alternative splicing to influence carcinogenesis.
Cheetham S W, Gruhl F, Mattick J S, et al. Long noncoding RNAs and the genetics of cancer[J]. British journal of cancer, 2013.
Da Wei Huang B T S, Lempicki R A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources[J]. Nature protocols, 2008, 4(1): 44-57.
Derrien T, Johnson R, Bussotti G, et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression[J]. Genome research, 2012, 22(9): 1775-1789.
Kanehisa M, Goto S, Furumichi M, et al. KEGG for representation and analysis of molecular networks involving diseases and drugs[J]. Nucleic acids research, 2010, 38(suppl 1): D355-D360.
Patro R, Mount S M, Kingsford C. Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms[J]. arXiv preprint arXiv:1308.3700, 2013.
Wang L, Feng Z, Wang X, et al. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data[J]. Bioinformatics, 2010, 26(1): 136-138.
Wang, Kevin C., and Howard Y. Chang. “Molecular mechanisms of long noncoding RNAs.” Molecular cell 43.6 (2011): 904-914.