Principal component analysis in query selecivity estimation

Dariusz R Augustyn


Query selectivity allows to estimate the size of query results. It is required for obtaining the optimal method of query execution. This is a main goal of a query optimizer activities. Selectivity calculations for queries with a complex multi-attribute selection condition require a non-parametric estimator of multi-dimensional probability density function of distribution of table attribute values. Using a multi-dimensional histogram as a representation of multi-dimensional distribution is very space-consuming for high dimensions. The approach based on Principal Component Analysis allows to reduce dimensionality and makes the representation space efficient. Additionally the attribute value independence rule (with multiplicity of simple selectivities) may be used in a dimensions-reduced space so the method of the PCA-based selectivity estimation becomes simpler and more effective. The paper also presents the implementation of the proposed solution in DBMS Oracle as the extension of the query optimizer by using Oracle Data Cartridge Interface Statistics module.


query optimizer; PCA; query selectivity; Oracle ODCIStats; histogram

Full Text:

PDF (Polski)


Krzanowski W. J.: Principles of Multivariate Analysis: A User's Perspective. Oxford University Press, 2000.

Possala V., Ioannidis Y. E.: Selectivity Estimation without the Attribute Value Independence Assumption. Proc. of the 23rd Int. Conf. on Very Large Databases, The VLDB Journal, Athens 1997.

Gunopulos D., Kollios G., Tsotras V. J.: Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes. ACM SIGMOD 2000, Dallas 2000.

Lee. J., Deok-Hwan K., Chin-Wan Ch.: Multi-dimensional Selectivity Estimation Using Compressed Histogram Estimation Information. Proc. of ACM SIGMOD Int. Conf. on Management of Data. ACM, Philadelphia 1999.

Chakrabarti K., Garofalakis M., Rastogi R., Shim K.: Approximate Query Processing Using Wavelets. VLDB Journal. vol. 10, no. 2-3, Springer-Verlag, New York 2001.

Bruno N., Chaudhuri S., Gravano L.: STHoles: a multidimensional workload-aware histogram. Proc. of ACM SIGMOD Int. Conf. on Management of Data 30(2). ACM, New York 2001.

Getoor L., Taskar B., Koller D.: Selectivity estimation using probabilistic models. Proc. Of ACM SIGMOD Int. Conf. on Management of Data. ACM, New York 2001.

Augustyn. D. R.: Applying advanced methods of query selectivity estimation in Oracle DBMS. Advances in Soft Computing. Man-Machine Interactions. Springer-Verlag, Berlin Heidelberg 2009.

Augustyn. D. R., Warchał Ł.Ś Zastosowanie sieci Bayesa w szacowaniu selektywności zapytań w optymalizatorze zapytań serwera bazy danych Oracle. Studia Informatica, Vol. 32, No. 1A (94), Gliwice 2011.

Oracle 10g. Using extensible optimizer (2010). 01/appdev.101/b10800/dciextopt.htm.

Oracle Database SQL Reference. Analyze (2011). http//