Welcome once more to the Miskatonic branch of SQL University. Please try to concentrate. I realize the whipoorwills singing outside the window in a coordinated fashion that sounds almost like laboured breathing can be distracting, but we’re talking about indexes here.
We left last class with a general idea what an index is, now it’s time for some specifics. There are several different kinds of indexes, as we talked about last class. But the two you’re probably going to work with the most are clustered, non-clustered. Each of these indexes is stored in a structure called a B-Tree, a balanced tree, not a binary tree. That’s a very important distinction.
A B-Tree is a double-linked list that is defined by the keys of the indexes on the top and intermediate pages, and at the leaf level by the data itself in the case of clustered indexes. Some of you no doubt think I’m quoting from De Vermis Mysteriis. Basically, for our purposes, a B-Tree consists of a series of pages. There is a top page, or root page, that defines the beginning of the index key. It points to a series of intermediate pages. Each intermediate page contains a range, a previous and a next value. These all point to each other, hence, double linked. The idea is that SQL Server can quickly identify which intermediate page has the pointers down to the leaf node, the final node in the stack. The values of these pointers are defined by the key of the index, the column or columns that you define when you create the index. There are always at least two levels, leaf & root, but there can be more, depending on the amount of data and the size of the keys. Just remember, the size of the key, which refers both to the data types in the key and the number of columns, determines how many key values can get on a page, the more key values on a page, the faster access will be, the fewer key values, the more pages that have to be read, and therefore, the slower the performance.
In general the purpose is to be able to quickly navigate to a leaf or set of leaf pages. When a B-Tree is used and the query engine is able to navigate quickly down to the leaf needed, that is an index seek. But when the B-Tree has to be moved through, in whole or in part, scanning for the values, you’re looking at an index scan. Obviously, in most cases, a seek will be faster than a scan becuase it’s going to be accessing fewer pages to get to the leaf needed to satsify the query. Just remember, that’s not always true.
Let’s get on to the indexes. It’s already been mentioned, but it bears repeating, the principle difference between a clustered and non-clustered index is what is at the leaf level. In a non-clustered index, it’s simply the key values and an values added through the use of the INCLUDE option along with a lookup value to either the clustered index key or an identifier within a table. In a clustered index, the data is stored down at the leaf. This is why people will frequently refer to a clustered index as being “better” than a non-clustered index, because you’re always going directly to the data when you’re looking information up within a clustered index. But, as with the scans vs. seek argument, this is not always true either.
I mentioned that a non-clustered index refers back to the clustered index, if there is one on the table. Because the data is stored at the leaf level of the clustered index, when you need to retreive other columns after performing a seek on a non-clustered index, you must go and get those columns from the clustered index. This is known as a key lookup, or in older parlance, a bookmark lookup. This operation is necessary when data not supplied by the non-clustered index, but can be very expensive because you’ve just added extra reads to your query.
What if there isn’t a clustered index on the table? What does the non-clustered index use to find other columns? If the table doesn’t have a clustered index, then that table is referred to as a heap. It’s called a heap because the data is simply stored in a pile, with no logical or physical ordering whatsoever. With a heap, SQL Server takes it on itself to identify the leaf level storage and creates a row id value for all the rows in the table. This row id can be used by the non-clustered index to find the data. That is referred to by the completely arcane and incomprehensible term, row id lookup. You might be thinking, hey, that means I don’t have to create a clustered index because SQL Server will create one for me. You’d be wrong. Maintaining the row id is an expensive operation and it doesn’t help in retrieving the data in an efficient manner. It’s just necessary for SQL Server to get the data back at all. In general, this is something to be avoided.
A non-clustered index doesn’t necessarily have to perform a lookup. If all the columns referred to in a query are stored within a non-clustered index, either as part of the key or as INCLUDE columns at the leaf, it’s possible to get what is called a “covering” query. This is a query where no lookup is needed. Indexes that can provide a covering query everything it needs are referred to as covering indexes. A covering query is frequently one of the fastest ways to get at data. This is because, again, depending on the size of the keys and any INCLUDE columns, a non-clustered index will have more information stored on the page than a clustered index will and so fewer pages will have to be read, making the operation faster.
By and large, a good guideline is to put a clustered index on all tables. SQL Server works extremely well with clustered indexes, and it provides you with a good access mechanism to your data. If you don’t put a clustered index on the table, SQL Server will create and maintain a row ID anyway, but as I said before, this doesn’t save much work on the server and it doesn’t provide you with any performance enhancement.
That’s a basic introduction to the three concepts of the clustered index, the non-clustered index and the heap. The points I’d like you to remember are:
- Indexes are stored in Balanced Trees
- Balanced Trees have, generally, three levels, root page, intermediate page, and leaf page
- In clustered indexes, data is stored at the leaf page
- In non-clustered indexes, a pointer is maintained back to the clustered index or the row id
- A heap is a table without a clustered index
Remember those things and you can really begin to dig down on how indexes work. Understanding how they work will assist you in designing them for your database and your queries.
Next class we’ll go over statistics.
I wouldn’t walk back to your dorm by way of the shore. I’ve seen some rather odd looking people near the docks lately that didn’t give me a good feeling. See you next time… maybe.
Right, all eldritch tomes are to be closed and Elder Signs are to be put away during this course.
Welcome to the History department here at the Miskatonic branch of SQL University. Why the History department? Well, first, because I like history and have frequently thought I would enjoy teaching it. Second, because I needed a hook upon which to hang part of the story I want to tell. What story is that you ask? Why, the story of the Dewey Decimal System. We are interested in studying history and historians must classify our subjects carefully. For advanced students we’ll be covering the Library of Congress Classification System and the…
Right, I give, this is the introductory class on indexes. If you thought we were covering something exciting and sexy like PowerShell, you’re in the wrong room.
Indexes… indexes…. There are, of course, different kinds of indexes. I’m sure that some of you, glancing ahead in your books, are already thinking, “yeah, two.” And you would, of course, be ABSOLUTELY WRONG! That’s why you’re in this class, because you don’t know. There are a large number of different kinds of indexes. Most people think of the standard indexes, of which there are two, clustered and non-clustered. But when pressed they can usually come up with the Full-Text index and possibly even the XML index. But that leaves out Spatial indexes, filtered indexes… more. Microsoft’s documentation lists eight different indexes:
- indexes with included columns
But I’ve seen other people count them other ways and arrive at varying amounts. Is a compound index a different kind of index? If it’s not, is unique really a different kind of index? Things to think about.
Why so many? What the heck is an index good for? They must be useful critters or Microsoft wouldn’t have put so many different sorts (however many that is) into SQL Server. I started off talking about the Dewey Decimal System for a reason. An index, any of the indexes we’re going to talk about, is primarily meant, like the DDS, as a mechanism to make finding things easier. That’s all it is. Pretty simple, right? Wrong. You clearly haven’t spent time with SQL Server indexes or the DDS. It’s really complicated. But, just like the DDS, learning how indexes work will make using them much easier.
Remember, the main purpose of a database, despite what your DBA may secretly feel in his or her heart, is not to keep, store and protect data. No, the main purpose of a database is to feed that data up to your business users, whoever they may be, in a timely and accurate fashion. That’s where indexes come in. They will help your queries get the data out to your users faster. Think about your data like a really huge library and your data like a bunch of books. The index acts like the DDS as a mechanism to speed you through the library and quickly and easily retrieve the book that you want.
Enough comparisons, since this is introductory, I just wanted to get the idea of indexes into your head. In the next installment I’ll take on two (or four, depends on how you count them) different kinds of indexes, starting with the standard two that you expected me to cover, clustered and non-clustered indexes. I’ll also introduce the concept of a heap and we’ll talk about what the heck a B-Tree is.
See you next class, probably. Be careful crossing the quad, I’ve heard Wilbur Whately is back on campus and we all remember what happened last time.
I was feeling quite confident about my new-found abilities with spatial indexes so I did a presentation for my team, to share what I had learned. I had also been sharing with one co-worker as I developed the knowledge of spatial indexes. While I was preparing my presentation, he was preparing his. I had focused on finding a set of data that showed it’s proximity to a test location and then showing how retrieving that set of data was faster because of the spatial index. He took a different approach. He took the idea of saying, here’s a list of different test locations, let’s see which one of our internal locations meet the proximity test. At the same time, he tried three different spatial indexes, one with high granularity, one with medium and a final one with low granularity.
The day came for presentations. I showed how the spatial index improved performance and how you could read from sp_help_spatial_geography_index to see how efficient the index was for the sample data. It all went without a hitch. Then my co-worker showed his tests. Wherein the low density index outperformed high or medium density indexes, despite having a primary_filter at about 5% efficiency. I was floored. I couldn’t explain and nor could my co-worker. But as far he was concerned, this whole spatial index thing was for the birds.
I went away and I started testing. Finally, after feeling like I had a handle on things, I asked my co-worker for his scripts. He had a cursor (which he acknowledged was a lousy way to write the query) that walked through the test locations, one at a time, and ran my query against our data. Sure enough, when I carefully built out the indexes and ran the test, the low density indexes worked better than the high density indexes. I set up a trace event and captured all the statement completion events so I could compare the high & low density runs. That’s when I spotted the issue.
The majority of the test locations didn’t find any matches in our location data. When the query ran against the low density index it would usually have a few hundred reads and a duration around 20ms to return no data. On the other hand, the same query, returning no data, on the high density index had about 5000 reads and took about 34ms. With each one of these cumulative queries running faster against the low density index, it really appeared to be better. But, when you got a query where data was returned, the low density index had more reads and a lot longer duration, about 500ms for the low density index compared to a consistant 34ms for the high density index.
There was my explanation. The cursor masked the fact that over and over again, the query was running but no data was being returned. I’m in the process now of rewriting the query to use the same set of test locations, but JOIN it against our location data to see which of the indexes is faster.
I’m still barely scratching the surface working with spatial data in SQL Server 2008. We’ve ported some of the data into a table where we built a geography spatial data column and we’re begginning to work with point data. The requirements from the developers are, so far, very simple. They’ll feed me a point and I find all the locations “close” to it. We had to go round & round on what defines “close” but finally settled on, I think, 15km.
The query to answer a question like this is ridiculously simple (a few object names have been changed):
SELECT ebe.[Location].STDistance(@Location) AS Distance, ebe.[InterestId], ebe.[Location].Lat AS Latitude, ebe.[Location].Long AS Longitude, ebe.[OrgId] FROM dbo.[ebe] AS ebe WHERE ebe[OrgId] = @OrgId AND ebe.[Location].STDistance(@Location) < @CloseDistance
I’m not even hard-coding the “close” value so it can change when they change their minds. It retrieves exactly what’s needed. But… Well, look at the STATISTICS IO & TIME:
Table 'ebe'. Scan count 3, logical reads 40179 Table 'Worktable'. Scan count 0, logical reads 0 CPU time = 376 ms, elapsed time = 373 ms.
Not exactly snappy, is it? So, the obvious answer is to provide an index it can use so that it doesn’t have to scan the clustered index. Spatial indexes only support certain functions with the spatial column, STIntersects, STEquals, STDistance. Since I’m dealing with STDistance, I have a shot at this working. Here’s the script:
CREATE SPATIAL INDEX [ebe_spatial] ON [dbo].[ebe]
( GRIDS =( LEVEL_1 = MEDIUM, LEVEL_2 = MEDIUM, LEVEL_3 = MEDIUM, LEVEL_4 = MEDIUM),
PAD_INDEX = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON)
When I then run the query, well, here’s the STATISTICS IO & TIME again:
Table 'Worktable'. Scan count 0, logical reads 0 Table 'extended_index_469576711_384000'. Scan count 2076 Table 'ebe'. Scan count 0, logical reads 448064 Table 'Worktable'. Scan count 0, logical reads 0 CPU time = 1155 ms, elapsed time = 1393 ms.
You read that correctly, it used the spatial index which caused the performance to decrease. There are adjustments you can make to spatial indexes. You can change the number of cells per object or set the detail level on the grid, but it’s hard to understand what this does to the index.
After fighting with it for a bit, I sent a tweet out on twitter, just whining about the index causing a slow-down. Surprised as anything, I get a response from Paul Randall, uh, wow, suggesting an approach. This is responded to by Isaac Kunen, you know the Microsoft PM & spatial expert. You just have to love Twitter. I took the information they gave me and did a few searches on the web. This lead to Bob Beauchemin’s blog and my introduction to some new Dynamic Management Views, sp_help_spatial_geometry_index and sp_help_spatial_geography_index. Now we’re talking. A mechanism to identify how useful the spatial data index is, similar to looking at sys.dm_db_index_physical_stats to see information about a clustered index. This lead to the following query and output:
DECLARE @LocationGEOGRAPHY SET @Location = GEOGRAPHY::STPointFromText('POINT (-86.674582 36.148843)',4326) EXEC sp_help_spatial_geography_index @tabname = 'EWH_BIM_Extract2' ,@indexname='ebe_spatial' ,@verboseoutput = 1 ,@query_sample = @Location
The question is, how efficient is the index? In my case, the Primary_Filter_Efficiency was listed as 3.33333. Yes, that’s 3% efficient. No wonder it killed the performance. What do I do to make the index more efficient? I haven’t a clue. That may be the next blog post. But at least now I know how to evaluate the usefulness of a spatial index.
All I can really add to this is, yeah, me too. If you want some absolutely great advice on indexes, read this post. It’s a must.
And might I add, I’ve been the bad guy in Tim’s example. Once, many, many years ago, I was reading from the SQL Server 7.0 documentation. It suggested that compound indexes were no longer needed since the optimizer could build them on the fly using index intersection. I had a performance problem and a consultant was telling me to use a compound index. I swore up and down it wouldn’t work because Microsoft said so. He kept pushing and I kept pushing back. Finally, after a rather heated discussion in which I was convinced I had the upper hand, I got off the phone resolved to show this “ID 10 T” he didn’t know what he was talking about… Let me just say that after running some tests I did NOT enjoy the next phone call. Crow really tastes nasty.
Great post Tim. I’m looking forward to the book.
It’s kind of scary to see someone else put down thought processes that could have been your own. That’s what Gail did with this post. It’s worth a close read because it’s offering very good advice and supplying the reasoning behind that advice.
I forgot all about this, but a script I wrote on using all the new functionality of dynamic management views & functions to do index defragmentation and rebuilds got published over at SQL Server Central.
It could stand a bit of tweaking, but gets the job done on several of the systems I’ve tested it on so far.