from International Innovation by Hao Zhu –
Researchers from Southern Medical University, China, have developed an effective computational approach that helps elucidate long non-coding RNAs’ DNA binding motifs and binding sites
Professor Hao Zhu describes his work on the evolution and function of long non-coding RNAs, explaining how his new computational approach could provide novel information towards an improved understanding of when and how they regulate epigenomic modification
Firstly, can you explain why you chose to focus your research on long non-coding RNAs (lncRNAs) and provide some context regarding the importance of your computational studies to the medical sciences?
Less than 2 per cent of human DNA encodes protein-coding genes, and there are only slight differences in these highly conserved genes between species such as humans, mice and even fish. This makes it difficult to explain the distinct phenotypes of animals. Recently, considerable DNA regions previously classified as ‘dark matter’ were found to encode long RNAs that lack protein-coding capacity, and the clade- or species-specificity of these lncRNAs could explain the distinctive phenotypic differences. When I became an independent researcher in 2009, the many new genes discovered meant new opportunities and challenges.
Many non-coding RNAs bind to DNA in unconventional ways, and lncRNAs are not an exception. Since they bind to both DNA sequences and DNA/histone-modification proteins such as DNA methyltransferase and polycomb repressive complexes, the function of many lncRNAs is to regulate epigenomic modification and alter the genome state without changing its DNA sequence. Many diseases, including cancers, are caused by dysregulated gene expression, often as a result of dysregulated epigenomic modification. I am using computational methods to systematically reveal both the DNA-binding motifs within the lncRNAs and their corresponding binding sites in genomes. Taken together, this information will significantly help to explain the mechanisms of many diseases.
How does your research build on the limited information available on lncRNA functional domains, including their origins and evolution?
Experimentally identifying lncRNAs using sequencing techniques was expensive, so I began by analysing the origin and evolution of individual lncRNAs such as HOTAIR and ANRIL. Now, lncRNAs in humans and mice have been systematically identified by the GENCODE project using RNA sequencing (RNA-seq), yet their functional domains still remain unclear.
To analyse lncRNA functional domains, homologous genes are often needed. Sequence alignments can reveal which parts are conserved and which are not. Using the Infernal program developed by Professor Sean Eddy (Janelia Farm Research Campus, USA) running on the Tianhe 2 supercomputer, I spent more than a year searching the orthologous genes of the GENCODE identified 13,562 human lncRNAs in 16 mammalian genomes. These obtained sequences facilitate the analysis of the origin, evolution and functional domains of human lncRNAs.
By what means does your computational method and program, LongTarget, overcome the difficulties involved in deciphering lncRNA functions and erroneous genome methylation?
Identifying the DNA-binding motif (normally 40-80 base pairs) in an lncRNA (may be up to 90 kilobases) is difficult. LongTarget adopts a simple method: reconstruct the DNA sequence of interest using base-pairing rules, align the reconstructed DNA to the lncRNA, and analyse and recognise binding motifs and binding sites simultaneously in the aligned regions. We have analysed considerable lncRNAs that function in cis or control an imprinting cluster, and confirmed that this method works well. A challenge we still face is to analyse lncRNAs that function in trans, that is, they bind to remote sites on the same chromosome or even to sites on other chromosomes.
Do you have any plans to develop this research further in the next few years?
Yes. My first task is to improve the running speed of LongTarget by implementing an OpenMP version. I then plan to systematically analyse the DNA-binding motifs and binding sites of simian-specific, human-specific and mouse-specific lncRNAs, as this is likely important for deciphering why we are human and to what extent simians are different from rodents. Meanwhile, I plan to analyse the sequencing data of cancers, to pursue a mechanical understanding of dysregulated and cell-specific gene expression in cancer cells and to determine whether cancers show commonalities in epigenomic modification
A spotlight on genomic ‘dark matter’
LONG NON-CODING RNAs (lncRNAs) comprise a significant portion of the human genome, but do not encode proteins, and were once thought of as the ‘dark matter’ of the genome. While difficult to detect and characterise, they are known to play important roles in the control of gene expression and other cellular processes by regulating epigenomic modification.
Gene expression can be silenced by either methylating the DNA that encodes them and/or modifying the structural chromatin of the chromosome itself. This epigenomic modification, performed by DNA methyltransferases (DNMTs) and polycomb repressive complexes (PRCs), is often found to be misregulated in cancerous cells where the cell cycle is incorrectly driven by consistently expressed genes. There are relatively few types of DNMTs and PRCs, so cells utilise lncRNAs with specific DNA-binding motifs to target these proteins to the required sites of the genome. The lncRNA forms a triplex with its complementary sequence on the double-stranded DNA using Hoogsteen base pairing. The identification of DNA-binding motifs in lncRNAs and their corresponding binding sites in the genome is therefore crucial for detecting target genes of lncRNAs and examining correct and erroneous epigenomic modification.
FINDING A BINDING MOTIF IN A HAYSTACK
The discovery of lncRNA DNA-binding motifs is experimentally very difficult, because little information is available on their structures, and lncRNAs can be up to 90 kilobases in length while their typical DNA binding motifs are just 40-80 bases long. Professor Hao Zhu and his team at Southern Medical University, Guangzhou, China, have developed an effective computational tool, LongTarget, that can predict lncRNAs’ DNA-binding motifs and the genes they target. “The human genome contains more than 13,562 lncRNAs, so computationally predicting their DNA-binding motifs and genomic binding sites is highly valuable,” Zhu elaborates. “Predicting an lncRNA’s DNA-binding motif could reveal how mutations would influence its ability to bind to DNA, while predicting an lncRNA’s binding site(s) reveals which genes it epigenomically regulates.”
Zhu tested LongTarget by analysing lncRNAs that silence genes at known genomic regions (imprinting clusters). The program’s predictions were found to be sensitive and specific, showing that it is feasible to predict many lncRNAs’ DNA binding motifs and binding sites. Zhu also showed that, in addition to the promotor regions and common DNA methylation targets known as CpG sites, lncRNAs also bind to many transposable elements, possibly including those within the lncRNA genes themselves. He proposes that lncRNAs targeting these transposable elements may help to regulate the highly tissue-specific expression of lncRNAs themselves. As the data available on these enigmatic molecules increase, innovative computational studies and tools such as that developed by Zhu will pave the way for novel discoveries in cell biology.
Computationally analysing lncRNAs’ sequences and functions is highly valuable
FUTURE APPLICATIONS
Determining the function of lncRNAs is fundamental to our knowledge of the epigenomic regulation of genes and the control of cellular processes. Cancers are caused by misregulated gene expression and it is becoming increasingly clear that lncRNAs may play a significant role. “For a long time, researchers believed that mutations in protein-coding genes or their associated transcriptional factors were the key drivers of many diseases, including cancer,” Zhu explains. “Now, increasing evidence shows that aberrant lncRNA-regulated epigenomic modification results in the misregulation of protein-coding genes. The elucidation of how lncRNAs alter gene expression is therefore valuable for the diagnosis and treatment of cancers and other diseases.”
LongTarget will also allow Zhu to answer some questions from his previous research on the origin and evolution of lncRNAs; comparing sequences of lncRNAs in related species could provide a useful insight into their often short evolutionary history. He is also striving to improve the efficiency of the program itself, and welcomes any suggestions or collaboration enquiries.
Availability – Website of LongTarget: lncrna.smu.edu.cn