Statistics are another area of SQL that seem to be fairly poorly understood, on what they are, how they are used and how they are maintained..
Statistics store information about the distribution of data in the columns. The information includes the density of the key columns which gives a measure of the uniqueness of the index, and a histogram that stores information on how the distribution of values within the column.
When an index is created, statistics will automatically be created on the index columns. The stats on an index cannot be dropped while the index exists. If the index is a multi-column one, the density is stored for all columns, but the histogram is only stored for the leading column.
It’s also possible to have statistics that aren’t associated with indexes. These are created automatically by the query optimiser (providing auto_create_statistics is enabled) or they can be created manually with the CREATE STATISTICS statement.
It’s not usually necessary to create statistics manually as the optimiser usually knows best what stats are required.
SQL keeps a count of the number of changes to a column and will update the statistics when it deems sufficient changes to have occurred (providing auto_update_stats is enabled).The stats are not updates as part of a data modification. Rather the next select that requires those statistics will trigger the update. In most cases, the auto-update doesn’t cause problems. It runs in the read-uncommitted isolation level and hence won’t cause blocking and SQL will, by default, only sample a portion of the table to calculate the stats.
For larger tables, that threshold is 20% of the rows in the table. Sometimes, especially on ever-increasing columns (identity, date inserted), on larger tables that threshold is too high and the statistics can get very inaccurate before the auto update kicks in. In cases like that it’s sometimes necessary to manually update the statistics on a regular basis using the UPDATE STATISTICS statement.
The other problem with the auto update is that sometimes it’s not accurate enough. In cases where large tables have an uneven distribution of values in the column, the sampled update may miss values or may miscalculate the distribution. In cases like this, it may be necessary to manually update the statistics with a full scan, forcing SQL to evaluate the entire table. That option should be used with care and only where necessary as it may have a negative impact on the performance of the entire system
Finally, when indexes are rebuilt (not reorganised) the statistics on those indexes are updated with a full scan.
How are stats used
Statistics are used by the query optimiser to estimate how many rows each operator in the query plan will operate on. The row count affects the cost of the various operators, some work well on small rowsets, others work well on large row sets. With a good estimate of the row count, the optimiser can get a fairly accurate calculation of the cost of each operator and hence can work out what the cheapest plan is for the particular query.
Without accurate statistics, the optimiser is likely to miscalculate the cost of various query operators and hence miscalculate the cost of the entire plan. This can lead to very poorly performing queries
In a future post I’ll discuss how to identify out of date or inaccurate statistics.