Splice site identification in human genome using random forest

Pashaei E., Ozen M., AYDIN N.

HEALTH AND TECHNOLOGY, vol.7, no.1, pp.141-152, 2017 (ESCI) identifier identifier

  • Publication Type: Article / Article
  • Volume: 7 Issue: 1
  • Publication Date: 2017
  • Doi Number: 10.1007/s12553-016-0157-z
  • Journal Indexes: Emerging Sources Citation Index (ESCI), Scopus
  • Page Numbers: pp.141-152
  • Keywords: Splice site prediction, DNA encoding methods, Random Forest classifier, Gene detection, SUPPORT VECTOR MACHINES, PREDICTION, RNA, INFORMATION, SVM
  • Kocaeli University Affiliated: No


Gene identification has been an increasingly important task due to developments of Human Genome Project. Splice site prediction lies at the heart of identifying human genes, thus development of new methods which detect the splice site accurately is crucial. Machine learning classifiers are utilized to detect the splice sites. Performance of those classifiers mainly depends on DNA encoding methods (feature extraction) and feature selection. The feature extraction methods try to capture as much information as the DNA sequences have, while the feature selection methods provide useful biological knowledge by cleaning out the redundant information. According to the literature, Markovian models are popular encoding methods and the support vector machine (SVM) is known as the best algorithm for classification of splice sites. However, random forest (RF) may outperform the SVM in this domain using those Markovian encoding methods. In this study, performance of RF has been investigated as feature selection and classification in splice site domain. We proposed three methods, namely MM1-RF, MM2RF and MCM-RF by combining RF with first order Markov Model (MM1), second order Markov model (MM2), and Markov Chain Model (MCM). We compared the performance of the RF with the SVM competitively on HS3D and NN269 benchmark datasets. Also, we evaluated the efficiency of the proposed methods with other current state of arts methods such as Reduced MM1-SVM, SVM-B and LVMM2. The experimental results show that the RF outperforms the SVM when the same Markovian encoding methods are used on both donor and acceptor datasets. Furthermore, the RF classifier performs much faster than the SVM classifier in detecting the splice sites.