A model of genome length estimation based on k-mers detection

Mateusz Garbulowski, Andrzej Polański


The genome length estimation at raw sequencing data level gives a practical knowledge about size of the DNA sequence at early stage of analysis. In our research, we created a model based on random sampling of k-mer (very short DNA fragments), that we used to predict genome size. Furthermore, we made the comparison of model results with empirical whole-genome sequencing data.


genome length estimation; genome size; sequencing model

Full Text:



Polański A., Kimmel M.: Bioinformatics. Springer, 2006, p. 243÷252.

Li X., Waterman M. S.: Estimating the repeat structure and length of DNA sequence using l-tuples. Genomes Res., vol. 13, 2003, p. 1916÷1922.

Lander E. S., Waterman M. S.: Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, vol. 2, 1988, p. 231÷239.

Koronacki J., Mielniczuk J.: Statystyka dla studentów kierunków technicznych i przyrodniczych. Wydawnictwo Naukowo-Techniczne, Warszawa 2006.

Zhang J. et al.: The impact of next-generation sequencing on genomics. J Genet Genomics, vol. 38(3), 2011, p. 95÷109.

Schneeberger K., Weigel D.: Fast forward genetics enabled by new sequencing technologies. Trends in Plant Science, vol. 16, 2011.

Liu L. et al.: Comparison of Next-Generation sequencing systems. Journal of Biomedicine and Biotechnology, vol. 2012, 2012.

Schokralla et al.: Next-generation sequencing technologies for environmental DNA research. Molecular Ecology, 2012, p. 1794÷1805.

Samella L.: Correction of sequencing errors in a mixed set of reads. Bioinformatics, vol. 26, 2010, p. 1284÷1290.

Cock P. J. A.: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variant. Nucleic Acids Research, vol. 38, 2010, p. 1767÷1771.

DOI: http://dx.doi.org/10.21936/si2015_v36.n4.739