Audiovisual database of Polish speech recordings

Magdalena Igras, Bartosz Ziółko, Tomasz Jadczyk


The biggest audiovisual database of Polish speech (and the only one made in HD quality) is presented. The paper shortly introduces description of similar databases for other languages and the technical specification of the AGH database. The challenges met during the process of building the database are discussed along with the planned applications.


image processing; speech recognition

Full Text:

PDF (Polski)


Ziółko B., Ziółko M.: Przetwarzanie mowy, Wydawnictwa AGH, Kraków 2011.

Demenko G., Grocholewski S., Klessa K., Ogórkiewicz J., Wagner A., Lange M., Śledziński D., Cylwik N.: JURISDIC - Polish speech database for taking dictation of legal texts. Materiały konferencyjne of the International Conference on Language Resources and Evaluation, 2008, s. 1280÷1287.

Young S., Evermann G., Gales M., Hain T., Kershaw D., Moore G., Odell J., Ollason D., Povey D., Valtchev V., Woodland P.: HTK Book. UK: Cambridge University Engineering Department, 2005.

Lamere P., Kwok P., Gouvea E., Raj B., Singh R., Walker W., Wolf, P.: The CMU Sphinx-4 speech recognition system. Sun Microsystems, 2004.

Ziółko M., Gałka J., Ziółko B., Jadczyk T., Skurzok D., Mąsior M.: Automatic Speech Recognition System Dedicated for Polish. Show and tell session, materiały konferencyjne Interspeech, Florencja 2011.

Terry L. H., Katsaggelos A. K.: A phone-viseme dynamic Bayesian network for audio-visual automatic speech recognition. Materiały konferencyjne ICPR, 2008.

Adjoudani A., Benoit C.: On the integration of auditory and visual parameters in an HMM-based ASR, [in:] Stork D. G., Hennecke M. E. (eds.): Speech reading by Humans and Machines: Systems and Applications. Springer-Verlag, Berlin, Germany 1996, s. 461÷472.

Basu S., Oliver N., Pentland A.: 3D modeling and tracking of human lip motions. Materiały konferencyjne International Conference on Computer Vision, Mumbai, India 1998, s. 337÷343.

Borgstrom B. J., Alwan A.: A Low-Complexity Parabolic Lip Contour Model With Speaker Normalization for High-Level Feature Extraction in Noise-Robust Audiovisual Speech Recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, Vol. 38, No. 6, 2008, s. 1273÷1280.

Chen T.: Audiovisual speech processing. IEEE Signal Processing Magazine, Vol. 18, No. 1, 2001, s. 9÷21.

Dupont S., Luettin J.: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, Vol. 2, No. 3, 2000, s. 141÷151.

Gagnon L., Foucher S., Laliberte F., Boulianne G.: A simplified audiovisual fusion model with application to large-vocabulary recognition of French Canadian speech. Canadian Journal of Electrical and Computer Engineering, Vol. 33, No. 2, 2008, s. 109÷119.

Gowdy J., Subramanya A., Bartels C., Bilmes J.: DBN based multi-stream models for audio-visual speech recognition. Materiały konferencyjne IEEE International Conference on Acoustic, Speech and Signal Processing, 2004, s. 993÷996.

Gurban M., Thiran J.: Audio-visual speech recognition with a hybrid SVM-HMM system. Materiały konferencyjne 13th European Signal Processing Conference EUSIPCO, Antalya, Turkey 2005.

Heckmann M., Berthommier F., Kroschel K.: A hybrid ANN/HMM audiovisual speech recognition system. Materiały konferencyjne International Conference on Auditory-Visual Speech Processing, Aalborg, Denmark 2001, s. 190÷195.

Huang F. J., Chen T.: Consideration of Lombard effect for speech reading. Materiały konferencyjne Workshop on Multimedia Signal Processing, Cannes 2001, s. 613÷618.

Huang J., Potamianos G., Neti C.: Improving audio-visual speech recognition with an infrared headset. Materiały konferencyjne Workshop on Audio-Visual Speech Processing, St. Jorioz, France, 2003, s. 175÷178.

Zdansky J., Chaloupka J., Nouza J.: Joint audio-visual processing, representation and indexing of TV news programmes. 2008 IEEE 10th Workshop on Multimedia Signal Processing, 2008, s. 960÷965.

Patterson E. K., Gurbuz S., Tufekci Z., Gowdy J. N.: CUAVE: A new audiovisual database for multimodal human-computer interface research. Materiały konferencyjne International Conference on Acoustics, Speech and Signal Processing, Orlando 2002, s. 2017÷2020.

Messer K., Matas J., Kittler J., Jonsson K.: XM2VTSDB: The Extended M2VTS Database. Materiały konferencyjne Second International Conference on Audio and Video-based Biometric Person Authentication, 1999, s.72÷77.

Goecke R., Millar J. B.: The Audio-Video Australian English Speech Data Corpus AVOZES, 2004.

Karpiński M., Jarmołowicz-Nowikowa E., Malisz Z., Szczyszek M., Juszczyk K.: Rejestracja, transkrypcja i tagowanie mowy oraz gestów w narracji dzieci i dorosłych. Investigationes Linguisticae, Vol. XVI; Poznań 2008.

Kubanek M.: Metoda rozpoznawania audio-wideo mowy polskiej w oparciu o ukryte modele Markowa. Praca doktorska Politechniki Częstochowskiej, 2005.

Summerfield A. Q.: Some preliminaries to a comprehensive account of audio-visual speech perception, [in:] Dodd B., Campbell R. (eds.): Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates, London, United Kingdom 1987, s. 3÷51.

Barker J., Cooke M.: Modelling speaker intelligibility in noise. 2007.

Shivappa S. T., Trivedi M. M., Rao B. D.: Audiovisual Information Fusion in Human-Computer Interfaces and Intelligent Environments: A Survey. Journal: Proceedings of The IEEE-PIEEE, 2010.