Random Forest Classifier for Early-Stage Protein Structure Prediction

Tomasz Smolarczyk, Katarzyna Stąpor


A tertiary protein structure is hard to measure, so it is common practice to predict it based on secondary structure or an early-stage protein structure, which could be predicted based on primary structure (amino acid chains). The article presents Random Forest classifier applied to early-stage structure prediction using physicochemical features and conformation parameters.


random forest; decision trees; classification; protein early-stage prediction

Full Text:



D. Shortle: Prediction of protein structure, Current Biology, vol. 10, no. 2, p. 49÷51, 2000.

M. Brylinski, L. Konieczny, P. Czerwonko, W. Jurkowski and I. Roterman: Early-Stage Folding in Proteins (In Silico) Sequence-to-Structure Relation, Journal of Biomedicine and Biotechnology, vol. 2, p. 65÷79, 2005.

P. Fabian and K. Stąpor: Developing a new SVM classifier for the extended ES protein structure prediction, in 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), Prague, 2017.

T. K. Ho: Random Decision Forests, in Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, 1995.

G. Biau: Analysis of a Random Forests Model, Journal of Machine Learning Research, vol. 13, p. 1063÷1095, 2012.

L. Breiman: Radom Forest - Random Features. Technical Report 567, University of California, Berkeley, 1999.

V. Y. Kulkarni and P. K. Sinha: Random Forest Classifiers :A Survey and Future Research Directions, International Journal of Advanced Computing, vol. 36, no. 1, p. 1144÷1153, 2013.

I. Kononenko: On biases in estimating multi-valued attributes, in IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence, Montreal, Quebec, Canada, 1995.

M. Robnik-Sikonja: Improving Random Forests, in Machine Learning: ECML 2004, Berlin, 2004.

M. N. Wright and A. Ziegler: ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, Journal of Statistical Software, vol. 77, no. 1, p. 1÷17, 2017.

L. Breiman: Random Forests, Machine Learning, vol. 45, p. 5-32, 2001.

R Development Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2008. [Online]. Available: http://www.R-project.org.

B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio and Z. M. Jones: mlr: Machine Learning in R, Journal of Machine Learning Research, vol. 17, no. 170, p. 1÷5, 2016.

B. Kalinowska, P. Alejster, K. Sałapa, Z. Baster and I. Roterman: Hypothetical in silico model of the early-stage intermediate in protein folding, Journal of Molecular Modeling, vol. 19, no. 10, p. 4259÷4269, 2013.

Y. Yang, J. Gao, J. Wang, R. Heffernan, J. Hanson, K. Paliwal and Y. Zhou: Sixty-five years of the long march in protein secondary structure prediction: the final stretch?, Briefings in Bioinformatics, p. 1÷13, 2016.

I. Roterman: Modelling the optimal simulation path in the peptide chain folding-studies based on geometry of alanine heptapeptide, Journal of Theoretical Biology, vol. 177, no. 3, p. 283÷288, 1995.

J. A. Cuff and G. J. Barton: Application of multiple sequence alignment profiles to improve protein secondary structure prediction, Proteins: Structure, Function, and Bioinformatics, vol. 40, p. 502÷511, 2000.

H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne: The Protein Data Bank, Nucleic Acids Research, vol. 28, no. 1, p. 235÷242, 2000.

B. Grant, A. Rodrigues, K. ElSawy, J. McCammon and L. Caves: Bio3D: An R package for the comparative analysis of protein structures, Bioinformatics, vol. 22, p. 2695÷2696, 2006.

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research, vol. 25, no. 17, p. 3389÷3402, 1997.

Y.-F. Huang and S.-Y. Chen: Extracting Physicochemical Features to Predict Protein Secondary Structure, The Scientific World Journal, vol. 2013, 2013.

S. Kawashima and M. Kanehisa: AAindex: Amino Acid index database, Nucleic Acids Research, vol. 28, no. 1, p. 374, 2000.