Welcome once more to the Miskatonic branch of SQL University. Please try to concentrate. I realize the whipoorwills singing outside the window in a coordinated fashion that sounds almost like laboured breathing can be distracting, but we’re talking about indexes here.
We left last class with a general idea what an index is, now it’s time for some specifics. There are several different kinds of indexes, as we talked about last class. But the two you’re probably going to work with the most are clustered, non-clustered. Each of these indexes is stored in a structure called a B-Tree, a balanced tree, not a binary tree. That’s a very important distinction.
A B-Tree is a double-linked list that is defined by the keys of the indexes on the top and intermediate pages, and at the leaf level by the data itself in the case of clustered indexes. Some of you no doubt think I’m quoting from De Vermis Mysteriis. Basically, for our purposes, a B-Tree consists of a series of pages. There is a top page, or root page, that defines the beginning of the index key. It points to a series of intermediate pages. Each intermediate page contains a range, a previous and a next value. These all point to each other, hence, double linked. The idea is that SQL Server can quickly identify which intermediate page has the pointers down to the leaf node, the final node in the stack. The values of these pointers are defined by the key of the index, the column or columns that you define when you create the index. There are always at least two levels, leaf & root, but there can be more, depending on the amount of data and the size of the keys. Just remember, the size of the key, which refers both to the data types in the key and the number of columns, determines how many key values can get on a page, the more key values on a page, the faster access will be, the fewer key values, the more pages that have to be read, and therefore, the slower the performance.
In general the purpose is to be able to quickly navigate to a leaf or set of leaf pages. When a B-Tree is used and the query engine is able to navigate quickly down to the leaf needed, that is an index seek. But when the B-Tree has to be moved through, in whole or in part, scanning for the values, you’re looking at an index scan. Obviously, in most cases, a seek will be faster than a scan becuase it’s going to be accessing fewer pages to get to the leaf needed to satsify the query. Just remember, that’s not always true.
Let’s get on to the indexes. It’s already been mentioned, but it bears repeating, the principle difference between a clustered and non-clustered index is what is at the leaf level. In a non-clustered index, it’s simply the key values and an values added through the use of the INCLUDE option along with a lookup value to either the clustered index key or an identifier within a table. In a clustered index, the data is stored down at the leaf. This is why people will frequently refer to a clustered index as being “better” than a non-clustered index, because you’re always going directly to the data when you’re looking information up within a clustered index. But, as with the scans vs. seek argument, this is not always true either.
I mentioned that a non-clustered index refers back to the clustered index, if there is one on the table. Because the data is stored at the leaf level of the clustered index, when you need to retreive other columns after performing a seek on a non-clustered index, you must go and get those columns from the clustered index. This is known as a key lookup, or in older parlance, a bookmark lookup. This operation is necessary when data not supplied by the non-clustered index, but can be very expensive because you’ve just added extra reads to your query.
What if there isn’t a clustered index on the table? What does the non-clustered index use to find other columns? If the table doesn’t have a clustered index, then that table is referred to as a heap. It’s called a heap because the data is simply stored in a pile, with no logical or physical ordering whatsoever. With a heap, SQL Server takes it on itself to identify the leaf level storage and creates a row id value for all the rows in the table. This row id can be used by the non-clustered index to find the data. That is referred to by the completely arcane and incomprehensible term, row id lookup. You might be thinking, hey, that means I don’t have to create a clustered index because SQL Server will create one for me. You’d be wrong. Maintaining the row id is an expensive operation and it doesn’t help in retrieving the data in an efficient manner. It’s just necessary for SQL Server to get the data back at all. In general, this is something to be avoided.
A non-clustered index doesn’t necessarily have to perform a lookup. If all the columns referred to in a query are stored within a non-clustered index, either as part of the key or as INCLUDE columns at the leaf, it’s possible to get what is called a “covering” query. This is a query where no lookup is needed. Indexes that can provide a covering query everything it needs are referred to as covering indexes. A covering query is frequently one of the fastest ways to get at data. This is because, again, depending on the size of the keys and any INCLUDE columns, a non-clustered index will have more information stored on the page than a clustered index will and so fewer pages will have to be read, making the operation faster.
By and large, a good guideline is to put a clustered index on all tables. SQL Server works extremely well with clustered indexes, and it provides you with a good access mechanism to your data. If you don’t put a clustered index on the table, SQL Server will create and maintain a row ID anyway, but as I said before, this doesn’t save much work on the server and it doesn’t provide you with any performance enhancement.
That’s a basic introduction to the three concepts of the clustered index, the non-clustered index and the heap. The points I’d like you to remember are:
- Indexes are stored in Balanced Trees
- Balanced Trees have, generally, three levels, root page, intermediate page, and leaf page
- In clustered indexes, data is stored at the leaf page
- In non-clustered indexes, a pointer is maintained back to the clustered index or the row id
- A heap is a table without a clustered index
Remember those things and you can really begin to dig down on how indexes work. Understanding how they work will assist you in designing them for your database and your queries.
Next class we’ll go over statistics.
I wouldn’t walk back to your dorm by way of the shore. I’ve seen some rather odd looking people near the docks lately that didn’t give me a good feeling. See you next time… maybe.
Right, all eldritch tomes are to be closed and Elder Signs are to be put away during this course.
Welcome to the History department here at the Miskatonic branch of SQL University. Why the History department? Well, first, because I like history and have frequently thought I would enjoy teaching it. Second, because I needed a hook upon which to hang part of the story I want to tell. What story is that you ask? Why, the story of the Dewey Decimal System. We are interested in studying history and historians must classify our subjects carefully. For advanced students we’ll be covering the Library of Congress Classification System and the…
Right, I give, this is the introductory class on indexes. If you thought we were covering something exciting and sexy like PowerShell, you’re in the wrong room.
Indexes… indexes…. There are, of course, different kinds of indexes. I’m sure that some of you, glancing ahead in your books, are already thinking, “yeah, two.” And you would, of course, be ABSOLUTELY WRONG! That’s why you’re in this class, because you don’t know. There are a large number of different kinds of indexes. Most people think of the standard indexes, of which there are two, clustered and non-clustered. But when pressed they can usually come up with the Full-Text index and possibly even the XML index. But that leaves out Spatial indexes, filtered indexes… more. Microsoft’s documentation lists eight different indexes:
- indexes with included columns
But I’ve seen other people count them other ways and arrive at varying amounts. Is a compound index a different kind of index? If it’s not, is unique really a different kind of index? Things to think about.
Why so many? What the heck is an index good for? They must be useful critters or Microsoft wouldn’t have put so many different sorts (however many that is) into SQL Server. I started off talking about the Dewey Decimal System for a reason. An index, any of the indexes we’re going to talk about, is primarily meant, like the DDS, as a mechanism to make finding things easier. That’s all it is. Pretty simple, right? Wrong. You clearly haven’t spent time with SQL Server indexes or the DDS. It’s really complicated. But, just like the DDS, learning how indexes work will make using them much easier.
Remember, the main purpose of a database, despite what your DBA may secretly feel in his or her heart, is not to keep, store and protect data. No, the main purpose of a database is to feed that data up to your business users, whoever they may be, in a timely and accurate fashion. That’s where indexes come in. They will help your queries get the data out to your users faster. Think about your data like a really huge library and your data like a bunch of books. The index acts like the DDS as a mechanism to speed you through the library and quickly and easily retrieve the book that you want.
Enough comparisons, since this is introductory, I just wanted to get the idea of indexes into your head. In the next installment I’ll take on two (or four, depends on how you count them) different kinds of indexes, starting with the standard two that you expected me to cover, clustered and non-clustered indexes. I’ll also introduce the concept of a heap and we’ll talk about what the heck a B-Tree is.
See you next class, probably. Be careful crossing the quad, I’ve heard Wilbur Whately is back on campus and we all remember what happened last time.
I’m getting my first look at a full-fledged nHibernate database developed by consultants for our company. I thought I’d share my initial impressions. I’ll be capturing trace events from the database over the next couple of weeks, so I’ll be following up on the behavior of nHibernate within this database as well.
The first thing I saw & thought was, “Foreign key constraints. Thank the gods.” That really is good news. I was frankly concerned that they might go with the “let the code handle it” approach. There are quite a few null columns. I’m also seeing tons & tons of nvarchar(255) which must the default string size. Lots of bit fields too. They also used bigint in a lot of places too. None of this is definitively good or bad, just observations.
There are tables that lack a primary key. That raises a bit of a concern. The primary keys I’ve looked at have all been clustered, which isn’t too problematic since that’s probably the primary access path. There are a few unique constraints on some of the tables too.
Overall though, I don’t see anything that, at first glance, makes me want to run screaming from the room (or pick up a big stick & start looking for developers). The devil is no doubt in the details. Time to get started looking at the trace events.
I’ve been tagged by a misplaced yankee, uh, New Englander, whatever. The question is, how do I/we use SQL Server where I work. That’s a tough one. It would make a much shorter, and easier, blog post to describe the things we don’t use it for. However, keeping with the spirit of these tags, I’ll try to lay out it.
For those that don’t know, I work for a rather large insurance company. This means that we have lots and lots of databases, but not much data. We also are cheap. That means we’ll run an app into the ground rather than spend the money & time to replace it. We have apps still running from the 70’s and 80’s propped up by ancient men with pocket protectors, spit, bailing wire and happy thoughts. This also means that we have apps running on SQL Server 7, 2000, 2005 and 2008. Give me a couple of weeks and I’m sure I can get an R2 app deployed. There is also a few Oracle databases, our warehouse and Peoplesoft in particular. We even have a DB2 and, I think, one Sybase database somewhere.
I don’t want to go into lots of details about the type of data we store, lest I get in trouble, but basically, think insurance and you’ll get a pretty good idea of a lot of it. Add in the fact that my company prides itself on engineering to avoid risk and you’ll know that we gather quite a bit of information about the things that we insure. There are lots and lots of very small databases. Our largest db’s are just breaking 100gb, but must are in the 20-60gb range. We have a ton of OLTP systems gathering all the different data. These have been developed in just about every possible way. We even have a couple of systems using nHibernate under development. We move, mostly, pretty standard structured data. We have a few processes that are using XML, mostly from third party sources, to import data, so we’ve learned how to shred that stuff into the database. Spatial data, insurance remember, is the big new thing on the block. We’re seeing lots more implementations taking advantage of this. We don’t see much in the way of unstructured data, but some of the reports from the engineers falls into this realm. We also get quite a few photo’s from them that want us to store. We’re working on using FileStream to keep those in sync with the db rather than storing them within the database itself.
Currently, and I hate this, the overwhelming majority of our OLTP data is collected in SQL Server. All our datamarts used for reporting are in SQL Server. BUT, in the middle sits our Oracle warehouse. So we have to convert our data from SQL Server into Oracle and then back into SQL Server. It’s fun. Swapping data types over & over, shrinking column names only to re-expand them into something a human being can read… It’s lots of fun.
We use SSIS for almost all our ETL processes, although we have a few DTS packages still chugging away on some of the 2000 servers. We’re running a bit of replication, but nothing major. We have several fail-over clusters in production. We’re running virtual machines in lots of places. We’re expanding our Reporting Services implementation pretty rapidly. After attending Buck Woody’s sessions at PASS this year we’re getting Central Management Servers and Policy Based Management implemented.
Most of my time is spent working with development teams. We do most our deployment work using Visual Studio. I do database design, stored procedure writing and tuning, data access methods… you name it and I’m involved with it. The level of our involvement varies from team to team, but by & large, we get involved early in most development projects and are able to influence how databases are developed.
For monitoring we’re using Microsoft System Center Operations Manager (or whatever it’s called this week). We’ve modified it somewhat, adding special rules & monitors to keep an eye on the system better. We also use Idera’s SQL Diagnostic Manager to help keep a more detailed eye on the systems. I already mentioned that we use Visual Studio for most of our development & deployments. We also use several Red Gate products, Compare, Data Compare, Prompt, pretty much every day.
That’s pretty much it. We keep busy and keep the systems up, running & happy most of the time.
Quest Connect 2009, taking place in October 21 for 24 hours, looks like it’s going to have 64 different sessions, live and recorded, by a variety of the names in the industry. It’s another chance to dig in and learn the details on a variety of topics from some of the top names in the business. Can you say Tom LaRock? How about Tim Ford? I know you want to hear from Brent Ozar. Those are just some of the featured speakers. There are a whole slew of others, it’s worth pursuing, and did I mention, the price is right.
I recorded a session for them last night. It’s on the basics of understanding execution plans.
My first one over there. It’s discussing whether or not you should do two things, build your own monitoring tool, come out in particular favor of one tool or suite of tools from a single vendor. Please read it and watch the video. And, even more importantly, leave a comment in the discussion.
I got the question the other day, when are you likely to see a spool in an execution plan? Easy, whenever SQL Server needs to walk through the data multiple times, usually in JOIN operations… Yeah, well, once again, my flip answers are not quite the entire story.
Spool operations are temporary storage of the data for later reuse in a query plan. There are two types of spool operations, eager spool and lazy spool. A spool is basically a temporary table created within the execution of the query that is used when it’s likely that data will be needed again, and again during the execution of the query. This is not an explicit #temp temporary table, but a work table for operations within the processing necessary for a given query’s behavior. A spool is created when the optimizer thinks that it can work better with a semi-permanent sub-set of data rather than have to perform multiple seeks or scans against a table or index or in other places where data re-use is important (more in a bit).
So how does this work? Take a look at a simple query:
UPDATE Person.Person SET FirstName = 'Ted' WHERE FirstName = 'Ted';
When the execution plan for this query is generated, it looks like this:
In this case, an eager spool is used as part of the roll back mechanism and to prevent the Halloween scenario. An eager spool is one where the data is retrieved immediately.
It’s possible to see the other type of spool in a query that looks like this (straight out of the Books Online):
WITH DirectReports(ManagerID, EmployeeID, EmployeeLevel) AS ( SELECT ManagerID, EmployeeID, 0 AS EmployeeLevel FROM HumanResources.Employee WHERE ManagerID IS NULL UNION ALL SELECT e.ManagerID, e.EmployeeID, EmployeeLevel + 1 FROM HumanResources.Employee e INNER JOIN DirectReports d ON e.ManagerID = d.EmployeeID ) SELECT ManagerID, EmployeeID, EmployeeLevel FROM DirectReports ;
Which would result in this execution plan:
Now you see a table spool that is called a lazy spool. This means that it only loads data as the data is requested. This makes a lot of sense because the lazy spool is operating as the means for gathering the recursive data together. So it’s not going to go and get all the data available, like an eager spool. Instead it’s going to only load the data as needed, lazy.
These two scenarios are much more likely than the typical join to show a table spool. Yes, it can, and does, appear in join operations, but as I said at the beginning, that’s such a flip answer. Much better to try to be complete.
Joe Sack has started a new team blog for the Microsoft SQL Server Premier Field Engineers. If you don’t know who they are, you should. The first post is just introductory, but this blog is likely to become a great resource. These are the guys that MS zip lines into tough situations with the expectations that they’ll improve them. I’d strongly suspect these are fellows worth listening to.
I just finished chapter 1 of Alastair Aitchison’snew book on SQL Server spatial data, “Beginning Spatial with SQL Server 2008.” If this is the beginners book… oh boy. The advanced book must be insane. Seriously though, Mr. Aitchison seems to have written a fantastic book. I’m going to tear through it as fast as I can because I’ve got two projects that are looking to start using spatial data and quite frankly, I’m a bit lost.
There’s a great discussiongoing on over at SSC as to the worth of technical books for DBA’s. It’s based on this editorialby Tony Davis. I’m surprised by the number of people who say they don’t use books. It seems that a lot more people use blogs and articles and discussion groups to learn. Maybe I’m showing my age a bit, but I don’t think a blog post or an article is going to get the depth and knowledge that Mr. Aitchison is displaying in this book. I know I’m regularly opening Kalen Delaney’s Inside SQL Server 2005 (and the new one for 2008 just came out) to look up bits & pieces of information that just isn’t as readily available on the web. Also, it’s worth pointing out, except for the editing that comes from people who read this blog, no technical review is done of this information. I might be right about the things I post, but I could be VERY wrong. Same with any other blog you read, including blogs by the big names. Despite the errors that creep into books (and trust me, they do), books are very carefully scrutinized by multiple sets of eyes to try to catch those errors prior to publication. They miss some, but they try not to miss any. Few blogs are like that. Not that many technical publications are terribly strict about technical accuracy either. I generally find more good information in the right books than anywhere else.
End of rant. I need to get back to reading this excellent book.
I’ve said it over and over again, the costs on operators in execution plans, even in actual execution plans are estimates. You need to understand that when looking at your execution plans. It’s vital because you need to be able to distinguish between the truly costly parts of a plan and the less costly parts of a plan. Don’t believe me? Take a look at this picture and see if you can spot the discrepancy:
Spot it yet?
Add up the costs for the operators visible in the part of the plan…
Yep 125%. And there’s more to the plan that I’m not showing. I think this one must total near 200%. The statistics are up to date and there’s no consistency errors in the database. These estimates are just off sometimes.
This is a rather costly query being run against Microsoft Dynamics CRM. There were two missing queries identified by the optimizer and about four key lookup operations that I fixed with two index changes. This jumped out and I wanted to share. The fact is, right now, the query is working well. That particular table has one scan and some reads, identified from looking at the STATISTICS I/O, and it’s not even close to 100% of the cost of the query, but I would sure be in a panic if I believed the estimated operation cost.