Application of Gaussian mixture model and proteomic databases in the mass spectra analysis – architecture of software of comprehensive mass spectrometry data processing

Małgorzata Plechawska


This paper presents assumptions of the protein mass spectra analysis software. All the spectra are modeled with Gaussian Mixture Models. Estimation of model parameters is done by means of an Expectation-Maximization algorithm. The obtained parameters are used for a further biological analysis. The software is integrated with four huge protein databases available on-line. The biological information about proteins may be achieved on the chosen level of some detail.


GMM; mass spectrometry; proteomic knowledge base

Full Text:



Coombes K., Baggerly K., Morris J.: Pre-processing mass spectrometry data, Fundamentals of Data Mining in Genomics and Proteomics. W. Dubitzky, M. Granzow, and D. Berrar, eds. Kluwer, Boston 2007, p. 79-99.

Morris J., Coombes K., Kooman J., Baggerly K., Kobayashi R.: Feature extraction and quantification for mass spectrometry data in biomedical applications using the mean spectrum. Bioinformatics Vol. 21, No. 9, 2005, p. 1764-1775.

Norris J., Cornett D., Mobley J., Anderson M., Seeley E., Chaurand P., Caprioli R.: Processing MALDI mass spectra to improve mass spectral direct tissue analysis. National institutes of health, USA 2007.

Polański A., Polańska J., Pietrowska M., Rzeszowska J., Stobiecki M., Tarnawski R., Skladowski K., Widlak P.: Application of the Gaussian mixture model to proteomic MALDI-ToF mass spectra. Journal of Computational Biology, Gliwice 2007.

Plechawska M.: Simultaneous analysis of multiple Maldi-TOF proteomic spectra using the mean spectra. SMI 2009. Polish Journal of Environmental Studies. wyd. Hard Olsztyn, Vol. 18, No. 3B, 2009.

Fushiki T., Fujisawa H., Eguchi S.: Identification of biomarkers from mass spectrometry data using a “common” peak approach. BMC Bioinformatics Vol. 7, 2006, p. 358.

Yu JS., Ongarello S., Fiedler R., Chen XW., Toffolo G., Cobelli C., Trajanoski Z.: Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics Vol. 21, 2005, p. 2200-2209.

Geurts P., Fillet M., de Seny D., Meuwis MA., Malaise M., Merville MP., Wehenkel L.: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics Vol. 21, 2005, p. 3138-3145.

Yasui Y., Pepe M., Thompson ML., Adam BL., Wright GL. Jr., Qu Y., Potter JD., Winget M., Thornquist M., Feng Z.: A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics Vol. 4, 2003, p. 449-463.

Tibshirani R., Hastie T., Narasimhan B., Soltys S., Shi G., Koong A., Le QT.: Sample classification from protein mass spectrometry, by 'peak probability contrasts'. Bioinformatics Vol. 20, 2004, p. 3034-3044.

Eidhammer I., et al.: Computational methods for mass spectrometry proteomics. Wiley 2007.

Mantini D., et al.: LIMPIC: A computational method for the separation of protein signals from noise. BMC Bioinformatics Vol. 8, 2007, p. 101.

Mantini D., et al.: Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra. Bioinformatics Vol. 24, 2008, p. 63-70.

McLachan G. J., Peel W.: Finite Mixture Distributions, Wiley 2000.

Zhang S.Q., et al: Peak detection with chemical noise removal using Short-Time FFT for a kind of MALDI Data. Proceedings of OSB 2007, Lecture Notes in Operations Research Vol. 7, p. 222-231.

Randolph T., et al.: Quantifying peptide signal in MALDI-TOF mass spectrometry data. Molecular & cellular proteomics: MCP Vol. 4(12), 2005, p. 1990-9.

Dijkstra M., Roelofsen H., Vonk R. J., Jansen R. C: Peak quantification in surface-enhanced laser desorption/ionization by using mixture models, Proteomics Vol. 6, 2006, p. 5106-5116.

Kempka M., Sjodahl J., Bjork A., Roeraade J.: Improved method for peak picking in matrix-assisted laser desorption/ionization time-of-flight mass spectrometry, Rapid Commun. Mass Spectrom. Vol. 18, 2004, p. 1208-1212.

Noy K., Fasulo D.: Improved model-based, platform-independent feature extraction for| mass spectrometry. Bioinformatics Vol. 23, No. 19, 2007, p. 2528-2535.

Everitt B.S., Hand D.J.: Finite Mixture Distributions. Chapman and Hall, New York 1981.

Plechawska M., Polańska J., Polański A., Pietrowska M., Tarnawski R., Widlak P., Stobiecki M., Marczak Ł.: Analyze of Maldi-TOF proteomic spectra with usage of mixture of Gaussian distributions. Man-Machine Interactions, Advances in Intelligent and Soft Computing, Springer 2009.

Plechawska M.: Simultaneous analysis of multiple Maldi-TOF proteomic spectra using the mean spectra. SMI 2009. Polish Journal of Environmental Studies. wyd. Hard Olsztyn. Vol. 18, No. 3B, 2009.

Schwarz, G.: Estimating the dimension of a model. Ann. Stat. Vol. 6, 1978, p. 461.

Plechawska M.: Comparing and similarity determining of Gaussian distributions mixtures. SMI 2008. Polish Journal of Environmental Studies. wyd. Hard Olsztyn. Vol. 17, No. 3B, 2008.

Lustgarten J.L., et al.: EPO-KB: A searchable knowledge base of biomarker to protein links. Bioinformatics Vol. 24(11), 2008, p. 1418-1419.

Lustgarten, J.L., et al.: Knowledge-based variable selection for learning rules from proteomic data. Bioinformatics Vol. 10(Suppl 9), 2009, p. 16.

UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids . Vol.38, 2010, p.D142-D148.

Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J., McGarvey P., Gasteiger E.: Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics Vol. 10, 2009, p. 136.

Wheeler DL et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Vol. 37,2009, p. D5-D15.

Kanehisa M., Araki M., Goto S., Hattori M., Hirakawa M., Itoh M., Katayama T., Kawashima S., Okuda S., Tokimatsu T., Yamanishi Y.: KEGG for linking genomes to life and the environment. Nucleic Acids Vol. 36, 2008, p. D480-D484.