SQL University: Introduction to Indexes, Part the Third

July 23, 2010 at 5:00 am (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , , , )

Nice to see most of you have managed to fight your way through the shoggoths outside to attend another lecture at the Miskatonic branch of SQL University. This will be the third and final part of the introduction to indexes lecture. Please, if you’re going mad, step out into the hall. Our previous two lectures introduced the concept of indexes and then talked about two types of indexes, clustered and nonclustered. This lecture will cover the concept of statistics as they relate to indexes.

If you followed the previous lecture then you know that indexes are stored in a Balanced Tree or B-Tree structure. You know that this storage mechanism is intended to provide fast retrieval of data. But, how can the query engine inside SQL Server know which index has the information it wants? How can the query engine know that a given index will provide a mechanism to speed up the query that is being run? Statistics.

Statistics are information generated defining the selectivity of a column or columns in the database. They may or may not be part of an index. Columns alone can have statistics and indexes can have statistics. The query optimizer uses statistics to determine how it will access data. You may have the greatest index in the world, but if your statistics are wrong, that index won’t get used. By default statistics are automatically created and automatically updated. The automatic update of statistics is based on a formula that depends on changes in the data and changes in structure, as outlined in this Microsoft documentation. In general, it’s a very good practice to leave this alone because SQL Server does a great job of maintaining statistics.

But what are statistics really?

Statistics are basically a sampling of data stored in such a way as to make it possible for the optimizer to understand what the data looks like without actually accessing it. It is, absolutely, meta-data, data about data. The sampling of the data is, usually, a scan across the data, dropping in every certain number of rows as a percentage of the data in the table, but it can be a full scan, where it looks at all the data in the table. This meta data that is collected is broken down into several pieces. Statistics are represented through:

  • The histogram, which provides a picture of the data in a column
  • The header, which provides information about the statistics or meta-data about meta-data
  • Density information, which stores the distribution of the data across the columns being tracked by the statistics

In short, a lot of information. To start understanding what this all means, let’s see how to retrieve a set of statistics:

DBCC SHOW_STATISTICS ("Sales.SalesOrderDetail", IX_SalesOrderDetail_ProductID);

This query uses tables stored in the AdventureWorks2008R2 to display statistics information. Just a side note, so many DBCC functions have been replaced by dynamic management views that I honestly expected one for statistics too, but it’s not there. Running the query results in the following set of information:

The first result set at the top is the header. Most of the information you see there should make sense to you based on what we’ve been talking about. A couple of interesting points worth noting are the “Rows Sampled” column which shows how many of the rows were looked at while the stats were being built, Steps, which gives you an indication of the size of the histogram, and Density, which shows the selectivity of the statistics, but that column is not the one used by the optimizer. The main thing I use header information for is to see when the statistics were last updated.

The second set of data shows the density of the columns used to define the index. You can see that each column is measured as it gets added to the index. This shows you how a compound index makes the data more and more selective. If those numbers are at all big, as opposed to vanishingly small like those presented, you have a selectivity isssue with your index. The less selective the index is, the less likely it will be used.

Finally, the last data set is the histogram. What I have displayed is actually just the first seven of two hundred rows. A histogram is basically a small snapshot of the data in your index. Specifically, the data in the first column of your index. That’s worth knowing. If you choose to, for some crazy reason, build an index with a bit column as the leading edge (that means the first column) and you’re surpised that your index isn’t getting used, this is where you can go to understand why. The histogram is one of the main tools that SQL Server uses to determine which operations it will use to retrieve your data. Here’s how it works. First you get the RANGE_HI_KEY which shows the upper bound column value, the top of the range of rows, that the this entry in  the histogram represents. Then you see the RANGE_ROWS that displays the number of rows within the range represented by the this entry in the histogram (except, for this entry, there is no range, it’s a single value). Next is the EQ_ROWS that tells you how many rows are equal to the value represented by the RANGE_HI_KEY. That number is large here because there is no range of rows, this value is represented by approximately 2838.166 rows inside the data. Following that is DISTINCT_RANGE_ROWS which shows you the distinct number of rows within each stepped range of values. In this case, again, because we’re not dealing with a range, that number is zero. Finally, you can see the AVG_RANGE_ROWS displaying the average number of duplicate values within the range.

Scrolling down within that set of data you can see what a range of values looks like:

In this case the range covers two values with the upper limit being 805 and the previous upper limit being 801.

In short, you can see what the optimizer sees and determine if you should have more up to date statistics. For example, if I query the table and retrieve ProductID values that are between 802 and 805, which would represent step 72 in the histogram, I get 246 rows, not 442, which is what I should see. Since the statistics are a little old on the table, they haven’t been updated since June 23rd, I’ll update them. You can use sp_updatestats, which will sample the data and generate a histogram with some guesses, or you can do this:

UPDATE STATISTICS Sales.SalesOrderDetail WITH FULLSCAN

Now when I run DBCC SHOW_STATISTICS, the header information shows that the rows sampled equal the rows in the table. My new histogram has almost exactly the same distribution, but the row counts are incredibly accurate. In my case the number of RANGE_ROWS is equal to 200 and the number of EQ_ROWS is equal to 46. Which equals the 246 rows that we got from querying the data directly. That represents a very accurate set of statistics.

So how do you know when to update your statistics? It’s really hard to say. If you start seeing different execution plans than you did previously, or if you note that the date on the statistics seems old, you might want to update stats. How do you determine if you should use a full scan or sampled statistics? Again, hard to say. In most circumstances a sampled set of statistics should work, but in some areas, if the data distribution is off, you may want to run a full scan.

There is clearly more and more to discuss about statistics and how to maintain them. There is a lot more to discuss about indexes. I didn’t even get to index fragmentation. Today you should remember:

  • Statistics are used to determine which indexes work well for a query
  • Statistics are composed of multiples sets of data, the header, density information, and the histogram
  • Histograms have a maximum of 200 steps
  • Histograms show the data distribution
  • Despite automatic maintenance, statistics can be out of date
  • The sampled update of statistics doesn’t always create the best set of data

For more information on statistics, I recommend Microsoft’s Books On Line for SQL Server. It’s a great resource.

But I hear the whipoorwills kicking up and I see a funny looking fog developing outside. It’s getting awfully dark for mid-day. One of the other professors stepped on some squealing white… thing… on his way into today and a one of the TA’s went stark raving mad at something they saw in the rectory. Time to end class. Best of luck on your journey back to your dorms. I’m locking myself in my office.

While I’m there. I might churn out one or two more entries on indexes. Look for them in the SQL University E-Book (no doubt to be published posthumously)

Permalink 4 Comments

Index Statistics

September 24, 2008 at 8:04 am (TSQL) (, , , , , )

The other day a developer showed up at my desk. They were getting time-outs in production on a query that didn’t normally give them trouble. With the parameters they provided, I ran the query. It ran for over 30 seconds, the application side timeout, before it returned it’s data. So I ran it again with an execution plan. It had a bunch of index scans with loop joins across thousands of rows and even created a table spool with 700 million rows as part of the process. Clearly not good.

Next I looked at the query plan. It wasn’t too bad, as these things go. It was probably moving too many columns and apparently the business wanted a pivot on the data since they were using an aggregate method to pivot some of the columns. The only thing that jumped out at me about the query initially was the use of MAX(1) instead of TOP 1 for some of the versioned queries (look for an article in the next, and last, SQL Standard magazine on the topic of MAX vs. TOP).

I let the developer know that it was going to take a while to clean up everything and I got to work. A couple of other DBA’s saw what I was looking at and came over to offer suggestions (it sounds worse than it is, these guys are smart and helpful). There was a lot of discussion about the table spool. I said that changing the MAX queries to use TOP would fix it. There was a bit of argument, but the proof is always in the pudding, so I made the change (in a non-production environment, of course) and sure enough the table spool went away. We still had a bunch of index scans and the loop joins looked funny since there was 15,000 rows being filtered against 15,000 rows. One of the DBA’s with me, Scott, suggested that the statistics were out of date. So I updated the stats and reran the query. It didn’t make a bit of a difference. Another DBA, Det, suggested updating the indexes with a FULL SCAN. Sure enough, not only did the index scans go away, but the table spool and the joins were replaced with HASH joins that worked much better with the data we were returning.

A little more investigation found that 5000 rows had been added to the table for the organization we were working with, but somehow the random selection of rows used by the normal update of statistics wasn’t reflecting this change accurately. It needed the FULL SCAN. So, while scheduling the full scan is easy enough, we wanted to know how to know if it really was needed. Determining the age of the statistics is easy:

SELECT ind.name

     ,stats_date(ind.id,ind.indid)

FROM sysindexes ind

WHERE object_name(ind.id) = ‘tablename’

But figuring out if the FULL SCAN is needed is still a bit elusive. Det’s suggestion was to query the statistics and then run a FULL SCAN update and then requery the statistics comparing the two. While that might show that the FULL SCAN, you just did, was needed, I’m still trying to find a more effecient mechanism.

Permalink 2 Comments