Index Statistics

September 24, 2008 at 8:04 am (TSQL) (, , , , , )

The other day a developer showed up at my desk. They were getting time-outs in production on a query that didn’t normally give them trouble. With the parameters they provided, I ran the query. It ran for over 30 seconds, the application side timeout, before it returned it’s data. So I ran it again with an execution plan. It had a bunch of index scans with loop joins across thousands of rows and even created a table spool with 700 million rows as part of the process. Clearly not good.

Next I looked at the query plan. It wasn’t too bad, as these things go. It was probably moving too many columns and apparently the business wanted a pivot on the data since they were using an aggregate method to pivot some of the columns. The only thing that jumped out at me about the query initially was the use of MAX(1) instead of TOP 1 for some of the versioned queries (look for an article in the next, and last, SQL Standard magazine on the topic of MAX vs. TOP).

I let the developer know that it was going to take a while to clean up everything and I got to work. A couple of other DBA’s saw what I was looking at and came over to offer suggestions (it sounds worse than it is, these guys are smart and helpful). There was a lot of discussion about the table spool. I said that changing the MAX queries to use TOP would fix it. There was a bit of argument, but the proof is always in the pudding, so I made the change (in a non-production environment, of course) and sure enough the table spool went away. We still had a bunch of index scans and the loop joins looked funny since there was 15,000 rows being filtered against 15,000 rows. One of the DBA’s with me, Scott, suggested that the statistics were out of date. So I updated the stats and reran the query. It didn’t make a bit of a difference. Another DBA, Det, suggested updating the indexes with a FULL SCAN. Sure enough, not only did the index scans go away, but the table spool and the joins were replaced with HASH joins that worked much better with the data we were returning.

A little more investigation found that 5000 rows had been added to the table for the organization we were working with, but somehow the random selection of rows used by the normal update of statistics wasn’t reflecting this change accurately. It needed the FULL SCAN. So, while scheduling the full scan is easy enough, we wanted to know how to know if it really was needed. Determining the age of the statistics is easy:

SELECT ind.name

     ,stats_date(ind.id,ind.indid)

FROM sysindexes ind

WHERE object_name(ind.id) = ‘tablename’

But figuring out if the FULL SCAN is needed is still a bit elusive. Det’s suggestion was to query the statistics and then run a FULL SCAN update and then requery the statistics comparing the two. While that might show that the FULL SCAN, you just did, was needed, I’m still trying to find a more effecient mechanism.

Permalink 2 Comments

Top Vs Max

June 4, 2008 at 7:38 am (SQL Server 2005, SQL Server 2008, TSQL) (, , )

I finished a set of research and tests. I put it all together and gave it to Chuck Heintzleman at SQL Server Standard (that web site can use some help). Hopefully he likes it and publishes it. I understand that the upcoming issue will be on performance tuning. If he doesn’t like it, I’ll get it over to Peter Ward at SQL Server Performance or to Steve Jones at SQL Server Central.

I’m posting this because I see so many web searches looking for information comparing TOP to MAX.

Here are some short answers for those interested. Data to support these suggestions is included with the article. ROW_NUMBER works best with smaller data sets. TOP has more consistent operation over a range a data. MAX frequently resolves as a TOP operation, but sometimes as an aggregate and it costs more when it resolves as an aggregate. I would suggest a bias toward TOP or ROW_NUMBER, but test this in your own environment. These behaviours were consistent between 2005 and 2008. I didn’t test with 2000.

Permalink Leave a Comment

Top vs. Max

March 21, 2008 at 2:56 pm (TSQL) (, , , , , , , , , )

The company I work for has a very well defined need for versioned data. In a lot of instances, we don’t do updates, we do inserts. That means that you have to have mechanisms for storing the data that enables you to pull out the latest version of all the data or a particular version of all the data, or the data at a particular moment in time, regardless of version.

 That means maintaining a version table and a series of inserts into various tables. Some tables will have pretty much a new row for each version, some tables may only have one or two versions out of a chain. With the help of a very smart Microsoft consultant, Bill Sulcius, we have a mechanism that works very well. However, questions about the ultimate tuning of the procedures remain. So we may have a query that looks like this:

SELECT *
FROM dbo.Document d
INNER JOIN dbo.Version v
ON d.DocumentId = v.DocumentId
AND v.VersionId = (SELECT TOP(1) v2.VersionId
                                    FROM dbo.Version v2
                                    WHERE v2.DocumentId = v.DocumentId
                                    ORDER BY v2.DocumentId DESC, v2.VersionId DESC)

There’s a clustered index on the Version table that has DocumentId & VersionId in it. This query works great.

But you can also write the same query to get the same results using MAX or ROW_NUMBER(). What’s more, those work well too, all nice clean clustered index seeks. You can also use CROSS APPLY rather than JOINS. All these appear to work well, but in different circumstances, some work better than others. That makes establishing a nice clean “if it looks like this, do this” pattern for developers to emulate difficult. I’m creating a series a tests to outline as much of the differences as I can. I’ll write it up and submit it all to Chuck over at SQL Server Standard first. If he doesn’t like it, it’s Steve’s. I’ll also post a few tid bits here.

Permalink Leave a Comment