SQL University: Introduction to Indexes, Part the First

July 19, 2010 at 5:00 am (PowerShell, spatial data, SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , )

Right, all eldritch tomes are to be closed and Elder Signs are to be put away during this course.

Welcome to the History department here at the Miskatonic branch of SQL University. Why the History department? Well, first, because I like history and have frequently thought I would enjoy teaching it. Second, because I needed a hook upon which to hang part of the story I want to tell. What story is that you ask? Why, the story of the Dewey Decimal System. We are interested in studying history and historians must classify our subjects carefully. For advanced students we’ll be covering the Library of Congress Classification System and the…

Right, I give, this is the introductory class on indexes. If you thought we were covering something exciting and sexy like PowerShell, you’re in the wrong room.

Indexes… indexes…. There are, of course, different kinds of indexes. I’m sure that some of you, glancing ahead in your books, are already thinking, “yeah, two.” And you would, of course, be ABSOLUTELY WRONG! That’s why you’re in this class, because you don’t know. There are a large number of different kinds of indexes. Most people think of the standard indexes, of which there are two, clustered and non-clustered. But when pressed they can usually come up with the Full-Text index and possibly even the XML index. But that leaves out Spatial indexes, filtered indexes… more. Microsoft’s documentation lists eight different indexes:

  • clustered
  • non-clustered
  • unique
  • indexes with included columns
  • Full-Text
  • Spatial
  • Filtered
  • XML

But I’ve seen other people count them other ways and arrive at varying amounts. Is a compound index a different kind of index? If it’s not, is unique really a different kind of index? Things to think about.

Why so many? What the heck is an index good for? They must be useful critters or Microsoft wouldn’t have put so many different sorts (however many that is) into SQL Server. I started off talking about the Dewey Decimal System for a reason. An index, any of the indexes we’re going to talk about, is primarily meant, like the DDS, as a mechanism to make finding things easier. That’s all it is. Pretty simple, right? Wrong. You clearly haven’t spent time with SQL Server indexes or the DDS. It’s really complicated. But, just like the DDS, learning how indexes work will make using them much easier.

Remember, the main purpose of a database, despite what your DBA may secretly feel in his or her heart, is not to keep, store and protect data. No, the main purpose of a database is to feed that data up to your business users, whoever they may be, in a timely and accurate fashion. That’s where indexes come in. They will help your queries get the data out to your users faster. Think about your data like a really huge library and your data like a bunch of books. The index acts like the DDS as a mechanism to speed you through the library and quickly and easily retrieve the book that you want.

Enough comparisons, since this is introductory, I just wanted to get the idea of indexes into your head. In the next installment I’ll take on two (or four, depends on how you count them) different kinds of indexes, starting with the standard two that you expected me to cover, clustered and non-clustered indexes. I’ll also introduce the concept of a heap and we’ll talk about what the heck a B-Tree is.

See you next class, probably. Be careful crossing the quad, I’ve heard Wilbur Whately is back on campus and we all remember what happened last time.

Permalink 4 Comments

Maps in Reports!

June 14, 2010 at 11:07 am (spatial data, SQL Server 2008, SQLServerPedia Syndication) (, , )

I’ve been playing with SQL Server 2008 R2 for quite a while in the CTP’s and what not. But, I hadn’t made a concerted effort to look at the new version of Reporting Services… HUGE mistake. There are a number of, not insubstantial, updates to Reporting Services that, probably, are the primary selling points of 2008 R2. The big one, for me, was the incorporation of mapping directly into reports. Did I say big? I’m sorry, I meant, enormous, gigantic, galactic, really, really important… Why you ask? I work for an insurance company. We insure factories, warehouses, that sort of thing. Funny bit of information about a factory, when floods, high wind or earthquakes come calling, they just don’t seem to be able to get out of the way adequately. Weird, huh? With that in mind, it’s really useful to be able to know, not simply where a building is located, but if that location is in a 50 year or a 500 year flood zone, is it subject to high winds, does the New Madrid fault line run straight across the factory floor… Because that factory isn’t going anywhere, you want to know how it should be built to withstand that flood, wind, earth movement, etc.. In other words, spatial data is a big deal where I work.

Using maps in reports is incredibly simple. First you need a query. Here’s one that runs against AdventureWorks2008R2:

SELECT a.City ,
a.PostalCode ,
a.AddressLine1 ,
sp.Name ,
FROM Person.Address AS a
JOIN Person.StateProvince AS sp ON a.StateProvinceID = sp.StateProvinceID
WHERE sp.Name = 'Arizona'

With that query, creating a report is simple. Open up BIDS, add a data source, create a report, and plug that query into it. Click Next, Next, Next however many times you need to in order to arrive a report. You can blank it out, removing all columns or leave it alone, any way you want, but, click on the Tools menu choice and select Map. Click on the report to place a map and you’ll get this dialogue box. You can then drill down a select a number of different maps, built right into Reporting Services, just as I did for the map of Arizona:

Once you have the map you want, click Next again and you’ll see a layout of what the map piece of the report is going to look like. You can work with a number of options, up to and including adding information from Bing Maps for, for free. This is slick stuff:

You can finish from there and you get a map. But, there’s no data on it. Now you need to click on the map and add a layer, again, you can use a wizard, the Layer Wizard, to select a spatial query, maybe the one we created at the top of this experiment. This is then placed on the map as points or shapes, depending on the data. Below is the finished map:

No big deal right? Except, this was free, I didn’t type a bit of code, I didn’t set any properties or muck about with all sorts of menu choices or do much of anything at all, but I ended up with what could be a useful report. It took longer for me to type this sentence than it did to create the map. Imagine what you can do if you actually tried to put some work into it. This is a fantastic new resource and one that I’m going to be spending a lot of time exploring. I think many of you will find it useful as well.

Permalink 3 Comments

Learning Spatial Data in SQL Server 2008

June 23, 2009 at 6:39 am (PASS, spatial data) (, , , , )

PASS just published a new Top 10 list of mine over at the their web site. This one is the Top 10 Articles on the web if you’re trying to learn about spatial data. I’m not trying to say that I know what I’m doing with spatial data. I’m still feeling my way forward. These articles have proved to be the most useful in the learning I’ve done so far. I wanted to publish a little more information than we put into our Top 10 lists over at PASS. Having the list is good, but I thought it needed a bit of commentary to be complete.

All these articles are good and all the web sites hosting the articles have good stuff on them. A few of the web sites stand out. Paul Randal‘s site covers quite a lot more than spatial data, but he writes with such clarity that his posts are a must read. When you’re ready for more, you need to read the stuff at Bob Beauchemin’s blog. Again, there’s more than just spatial data to be had there, but Mr. Beauchemin has really done the research and he’s good enough to compile it for the rest of us. Those are, sort of, the introductory sites. When you’re ready to really and truly just go with all things spatial, the two sites that are going to prove most useful (or, at least that I’ve found most useful) are Spatial Ed(Ed Katibah) and Isaac Kunen. These are two of the people responsible for creating the spatial engine inside SQL Server. We can thank them for that, but better still, these guys are good about communicating what they’ve done, what it means, how it works, applications, ideas… You get the drift. If you’re really pursuing spatial data as an important component of your enterprise data, you need to read their stuff.

There are a few links that I couldn’t easily fit into the top 10 since they’re not discrete articles. I’ve already blogged about and reviewed Alastair Aitchison’s excellent book, Beginng Spatial with SQL Server 2008, but it’s worth another plug. You will also want to browse through the functionality being posted at the SQL Server Spatial Tools site over at CodePlex. Stuff there is pretty useful for getting your own functionality… functional. Finally, when you get stuck, if you get stuck, one of the best places to get unstuck is on the MSDN spatial forum, where a lot of the people already mentioned are answering questions and posting.

These are the resources I’ve found most useful in the little bit of spatial I’ve learned so far. I hope the top 10 list and this explanation of it prove useful.

UPDATE: Fixed link problem.

Permalink Leave a Comment

More Spatial Headaches

June 11, 2009 at 2:38 pm (spatial data, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , )

I keep thinking I’ve got a handle on spatial data and I keep finding out I’m wrong. I rewrote the cursor that my co-worker used into a query that joined the list of test criteria onto our list of locations. I used an OUTER APPLY to more or less run a check for each of the test criteria since, except for the proximity to the locations, there’s no actual relationship between the test criteria and the location data for me to join on.

The query looked something like this:

,l.[Location].Lat AS LocationLatitude
,l.[Location].Long AS LocationLongitude
FROM dbo.Alert AS a
		FROM dbo.MyLocs x
		WHERE x.OrgID = 42
		AND x.[Location].STDistance(a.AlertLocation) < 50000) AS l

The cursor was taking almost a full minute to return just under 400 rows worth of data. This query took about 25 seconds. SCORE!

Unfortunately when I looked at the execution plan, the optimizer was not able to recognize that my spatial index would increase the performance. So I modified the query to provide a hint:

FROM dbo.MyLocs x WITH (INDEX(xsHigh))

Now the query returned in 5 seconds. BIG SCORE! Then I remembered my co-worker and his low density index. So I quickly rebuilt it and changed the INDEX hint to use the low density index. It ran in just over 2 seconds. Yeah, it’s a score, but why? I thought the high density index was the way to go with point locations. I decided I just wanted to see the middle behavior and see if a medium density index took 3.5 seconds about mid-way between the low and high density indexes… It ran in just over 1 second. WHAT THE F…

I then ran a whole series of experiments with my query. I kept changing the density and the cells per object value for the three indexes and recording times. In every test, the medium density index outperformed the low and high density indexes. It never outperformed the low density index by a whole lot, usually running ~50% faster as opposed to running 5 times faster than the high density index. But it really did run better.

My current working theory, and I’m trying to figure out how I can validate this, is that the low and medium density indexes were more efficient at eliminating the negative values, the places where the criteria wasn’t in proximity to my locations. Then, the medium density index was just better at retrieving the data than the low density index. I base this on the fact that the primary filter efficiency on the medium density index is  better than the low (although not as good as the high).

Permalink 2 Comments

Spatial Indexes and a Lack of Data

June 9, 2009 at 1:44 pm (spatial data, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , , )

I was feeling quite confident about my new-found abilities with spatial indexes so I did a presentation for my team, to share what I had learned. I had also been sharing with one co-worker as I developed the knowledge of spatial indexes. While I was preparing my presentation, he was preparing his. I had focused on finding a set of data that showed it’s proximity to a test location and then showing how retrieving that set of data was faster because of the spatial index. He took a different approach. He took the idea of saying, here’s a list of different test locations, let’s see which one of our internal locations meet the proximity test. At the same time, he tried three different spatial indexes, one with high granularity, one with medium and a final one with low granularity.

The day came for presentations. I showed how the spatial index improved performance and how you could read from sp_help_spatial_geography_index to see how efficient the index was for the sample data. It all went without a hitch. Then my co-worker showed his tests. Wherein the low density index outperformed high or medium density indexes, despite having a primary_filter at about 5% efficiency. I was floored. I couldn’t explain and nor could my co-worker. But as far he was concerned, this whole spatial index thing was for the birds.

I went away and I started testing. Finally, after feeling like I had a handle on things, I asked my co-worker for his scripts. He had a cursor (which he acknowledged was a lousy way to write the query) that walked through the test locations, one at a time, and ran my query against our data. Sure enough, when I carefully built out the indexes and ran the test, the low density indexes worked better than the high density indexes. I set up a trace event and captured all the statement completion events so I could compare the high & low density runs. That’s when I spotted the issue.

The majority of the test locations didn’t find any matches in our location data. When the query ran against the low density index it would usually have a few hundred reads and a duration around 20ms to return no data. On the other hand, the same query, returning no data, on the high density index had about 5000 reads and took about 34ms. With each one of these cumulative queries running faster against the low density index, it really appeared to be better. But, when you got a query where data was returned, the low density index had more reads and a lot longer duration, about 500ms for the low density index compared to a consistant 34ms for the high density index.

There was my explanation. The cursor masked the fact that over and over again, the query was running but no data was being returned. I’m in the process now of rewriting the query to use the same set of test locations, but JOIN it against our location data to see which of the indexes is faster.

Permalink 3 Comments

Spatial Index & Performance & Brain Pain

May 28, 2009 at 10:36 am (spatial data, SQL Server 2008, TSQL) (, , , , , )

In my previous post, Spatial Data Hurts My Brain, I showed how a query used the spatial index, but hurt performance. Several people jumped in to try to help out. Thanks to Bob Beauchamin, Isaac Kunin, Gail Shaw, and Valerie Yakich (via Twitter).

I’ve definately learned a bit more than I knew previously about spatial data and spatial indexes. I’ve posted about it before, but it bears repeating, if you’re just starting out, I can’t recommend Alistair Aitchison’s book, Beginning Spatial with SQL Server 2008, enough. Here’s where I’m at. I think I was confusing/confused/confounded/something about what to expect from a spatial index. I’m going to present two queries and two indexes and try to explain why each works well, or not together, mainly as a means for enabling my understanding, not so much teaching anyone else. The first query is simply a retrieval of the data from a database where the distance between a supplied point and the points stored in the geography column are less than a certain value:

FROM dbo.ebe
WHERE ebe.[Location].STDistance(@location) < 15000

When this is run against the table with only a clustered index on the PK, this is the execution plan:


About what you would expect. The query returns almost 5000 rows of the 500000 in the table and runs for 14 seconds on my machine. Now I’m going to create this index and run the query again.

CREATE SPATIAL INDEX [ebe_High] ON [dbo].[ebe] 

When the query is run a second time, here’s the execution plan.


This time the query ran in under a second to return the same rows. Clearly, spatial indexes do work to speed up execution. The execution plan is hard to see here I realize. I can’t upload zipped files or .sqlplan files, so if you really want a copy of the plan, I’ve pasted both into a Word doc: spatial_plans

I’m not going to try to explain everything happening with the plan, because I’m not entirely sure what all is. I’m still figuring that out. Suffice to say that the table valued function and the constant scan operator feed data into a seek against the clustered index (which is missing statistics for some reason) where it’s fed to a compute scalar and then finally it’s loop joined to the clustered index in what I assume is the equivalent of a key lookup. Again, I’m still figuring this part out. I skipped some steps.

Now, I want to modify the query because the goal is to not simply get all locations near this point of interest, but the locations of a specified organization:

FROM dbo.ebe
WHERE ebe.[Location].STDistance(@location) < 15000
AND ebe.[OrgId] = 4464

When I run this query, I get the following execution plan:


The query ran in 96ms and returned 148 rows. It’s all you could ask for. Or is it? What if I add this index:

) INCLUDE (InterestId,Location)

Now when I run the query, I get the following execution plan:


Much, much less complex and quite easy to explain. I have a covering index that does a simple seek to retrieve the initial set of data and then it’s filtered to remove the locations that are too far from the point of interest. But, here’s where it gets interesting. The execution time was 126ms. The spatial index shaved about 30ms, or about 25%, off the time of what would normally be the fastest access path, a covering non-clustered index. That is a substantial savings.

I’ve done a lot of testing over the last few days and I didn’t always see this kind of performance benefit. Several times I’ve seen worse performance. Also, I’m not sure why a brand new index has missing statistics, as seen in the execution plans when the where the spatial index was used. Further, the optimizer is choosing the nonclustered index over the spatial index, clearly because of the estimated costs, when the spatial index does, in fact, perform better in this instance. I’m not sure what to do about that, apart from dropping the nonclustered index. That’s OK. It just means more to learn. Starting with, explaining exactly what’s happening in the spatial index execution plan. That may be the next post.

I forgot to mention. I ran the sp_help_spatial_geography_index DMV to see what information it could give, but when I supplied the parameter value listed above, I received a filter efficiency of zero. I’m clearly not understanding how to apply this DMV.

Permalink 3 Comments

Spatial Data Hurts My Brain

May 27, 2009 at 1:03 pm (spatial data, SQLServerPedia Syndication, TSQL) (, , , )

I’m still barely scratching the surface working with spatial data in SQL Server 2008. We’ve ported some of the data into a table where we built a geography spatial data column and we’re begginning to work with point data. The requirements from the developers are, so far, very simple. They’ll feed me a point and I find all the locations “close” to it. We had to go round & round on what defines “close” but finally settled on, I think, 15km.

The query to answer a question like this is ridiculously simple (a few object names have been changed):

SELECT ebe.[Location].STDistance(@Location) AS Distance,
ebe.[Location].Lat AS Latitude,
ebe.[Location].Long AS Longitude,
FROM dbo.[ebe] AS ebe
WHERE ebe[OrgId] = @OrgId
AND ebe.[Location].STDistance(@Location) < @CloseDistance

I’m not even hard-coding the “close” value so it can change when they change their minds. It retrieves exactly what’s needed. But… Well, look at the STATISTICS IO & TIME:

Table 'ebe'. Scan count 3, logical reads 40179
Table 'Worktable'. Scan count 0, logical reads 0
CPU time = 376 ms, elapsed time = 373 ms.

Not exactly snappy, is it? So, the obvious answer is to provide an index it can use so that it doesn’t have to scan the clustered index. Spatial indexes only support certain functions with the spatial column, STIntersects, STEquals, STDistance. Since I’m dealing with STDistance, I have a shot at this working. Here’s the script:

CREATE SPATIAL INDEX [ebe_spatial] ON [dbo].[ebe]

When I then run the query, well, here’s the STATISTICS IO & TIME again:

Table 'Worktable'. Scan count 0, logical reads 0
Table 'extended_index_469576711_384000'. Scan count 2076
Table 'ebe'. Scan count 0, logical reads 448064
Table 'Worktable'. Scan count 0, logical reads 0
CPU time = 1155 ms, elapsed time = 1393 ms.

You read that correctly, it used the spatial index which caused the performance to decrease. There are adjustments you can make to spatial indexes. You can change the number of cells per object or set the detail level on the grid, but it’s hard to understand what this does to the index.

After fighting with it for a bit, I sent a tweet out on twitter, just whining about the index causing a slow-down. Surprised as anything, I get a response from Paul Randall, uh, wow, suggesting an approach. This is responded to by Isaac Kunen, you know the Microsoft PM & spatial expert. You just have to love Twitter. I took the information they gave me and did a few searches on the web. This lead to Bob Beauchemin’s blog and my introduction to some new Dynamic Management Views, sp_help_spatial_geometry_index and sp_help_spatial_geography_index. Now we’re talking. A mechanism to identify how useful the spatial data index is, similar to looking at sys.dm_db_index_physical_stats to see information about a clustered index. This lead to the following query and output:

SET @Location = GEOGRAPHY::STPointFromText('POINT (-86.674582 36.148843)',4326)
EXEC sp_help_spatial_geography_index @tabname = 'EWH_BIM_Extract2'
,@verboseoutput = 1
,@query_sample = @Location

The question is, how efficient is the index? In my case, the Primary_Filter_Efficiency was listed as 3.33333. Yes, that’s 3% efficient. No wonder it killed the performance. What do I do to make the index more efficient? I haven’t a clue. That may be the next blog post. But at least now I know how to evaluate the usefulness of a spatial index.

Permalink 13 Comments

Spatial Data

April 21, 2009 at 7:33 am (spatial data, SQL Server 2008, TSQL) (, )

I work for an insurance company. If you think that maybe, we might be interested in the physical location of the things we insure, you’d be right. Actually, we’re an insurance company predicated on the idea that risk can be managed. That means that not only do we know where your factory is located. We know the wind zone, earthquake zone, flood zone, rain zone and temperature zone it’s in. We send engineers out to the site to inspect it and recommend upgrades. We track the upgrades and the condition of your facility.

With all that location specific information, just how important do you think it is that with SQL Server 2008 we’re finally getting a spatial data type? Yeah, exactly.

We’re in the process of launching our first full implementation of the spatial data type and, quite frankly, I was not ready. I had read several of the BOL entries and a few articles online, but nothing had given me enough information for me to say I understood how spatial data worked. Now I at least feel like I’ve got a basis for understanding. That’s because of Alistair Aitchison‘s book Beginning Spatial.

This was not an easy read for me. The first few chapters are frankly difficult. The concepts behind spatial data are not exactly simple. However, Mr. Aitchison did a great job of presenting the information in well explained, digestible chunks. I was able to get through. Coming out the other side, I’ve still got a lot of questions and concerns (especially around spatial indexes, he could have done another three or four chapters on just that topic), but I’m ready to support our project going forward now.

If you’re dealing with Spatial data and you’re not already a guru on the topic, I’d strongly suggest you pick this up. If you’re an expert, this book isn’t for you. There are plenty of examples and Mr. Aitchison walks you through some important concepts, such as importing spatial data. Did you realize that Microsoft didn’t include any mechanisms for dealing with in SSIS? Instead you either need to get creative with TSQL or, better still, use a third party product. I got a copy of Shape2SQL and I’ve tried it out. It seems to do what we need. We still might end up buying a commercial product (several were listed in the book). But it was the information in the book that told me what I needed to know to get started working with our spatial data.

I need to track down more info on spatial indexes now.

Permalink Leave a Comment

Reading to Learn

March 30, 2009 at 10:55 am (spatial data, SQL Server 2008, SQLServerPedia Syndication) (, , , , , , )

I just finished chapter 1 of Alastair Aitchison’snew book on SQL Server spatial data, “Beginning Spatial with SQL Server 2008.” If this is the beginners book… oh boy. The advanced book must be insane. Seriously though, Mr. Aitchison seems to have written a fantastic book. I’m going to tear through it as fast as I can because I’ve got two projects that are looking to start using spatial data and quite frankly, I’m a bit lost.

There’s a great discussiongoing on over at SSC as to the worth of technical books for DBA’s. It’s based on this editorialby Tony Davis. I’m surprised by the number of people who say they don’t use books. It seems that a lot more people use blogs and articles and discussion groups to learn. Maybe I’m showing my age a bit, but I don’t think a blog post or an article is going to get the depth and knowledge that Mr. Aitchison is displaying in this book. I know I’m regularly opening Kalen Delaney’s Inside SQL Server 2005 (and the new one for 2008 just came out) to look up bits & pieces of information that just isn’t as readily available on the web. Also, it’s worth pointing out, except for the editing that comes from people who read this blog, no technical review is done of this information. I might be right about the things I post, but I could be VERY wrong. Same with any other blog you read, including blogs by the big names. Despite the errors that creep into books (and trust me, they do), books are very carefully scrutinized by multiple sets of eyes to try to catch those errors prior to publication. They miss some, but they try not to miss any. Few blogs are like that. Not that many technical publications are terribly strict about technical accuracy either. I generally find more good information in the right books than anywhere else.

End of rant. I need to get back to reading this excellent book.

Permalink 4 Comments