Integrating Query-Feedback Based Statistics into Informix DynamicServer
Abstract
Statistics that accurately describe the distribution of data values in the columns of relational tables are essential for effective query optimization in a database management system. Manually maintaining such statistics in the face of changing data is difficult and can lead to suboptimal query performance and high administration costs. In this paper, we describe a method and prototype implementation for automatically maintaining high quality single-column statistics, as used by the optimizer in IBM Informix Dynamic Server (IDS). Our method both refines and extends the ISOMER algorithm of Srivastava et al. for maintaining a multidimensional histogram based on query feedback (QF). Like ISOMER, our new method is based on the maximum entropy (ME) principle, and therefore incorporates information about the data distribution in a principled and consistent manner. However, because IDS only needs to maintain one-dimensional histograms, we can simplify the ISOMER algorithm in several ways, significantly speeding up performance. First, we replace the expensive STHoles data structure used by ISOMER with a simple binning scheme, using a sweep-line algorithm to determine bin boundaries. Next, we use an efficient method for incorporating new QF into the histogram; the idea is to aggregate, prior to the ME computation, those bins that do not overlap with the new feedback records. Finally, we introduce a fast pruning method to ensure that the number of bins in the frequency distribution stays below a specified upper bound. Besides refining ISOMER to deal efficiently with one-dimensional histograms, we extend previous work by combining the reactive QF approach with a proactive sampling approach. Sampling is triggered whenever (as determined from QF records) actual and estimated selectivities diverge to an unacceptably large degree. Our combined proactive/reactive approach greatly improves the robustness of the estimation mechanism, ensuring very high quality selectivity estimates for queries falling inside the range of available feedback while guaranteeing reasonably good estimates for queries outside of the range. By automatically updating statistics, query execution is improved due to better selectivity estimates, and the total cost of ownership (TCO) is reduced since the database administrator need not update statistics manually for monitored columns. 582
Full Text: PDF