More Spatial Headaches

June 11, 2009 at 2:38 pm (spatial data, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , )

I keep thinking I’ve got a handle on spatial data and I keep finding out I’m wrong. I rewrote the cursor that my co-worker used into a query that joined the list of test criteria onto our list of locations. I used an OUTER APPLY to more or less run a check for each of the test criteria since, except for the proximity to the locations, there’s no actual relationship between the test criteria and the location data for me to join on.

The query looked something like this:

SELECT a
.CriteriaDesc
,a.CriteriaLoc
,l.[Location].Lat AS LocationLatitude
,l.[Location].Long AS LocationLongitude
,l.LocationDesc
FROM dbo.Alert AS a
OUTER APPLY (SELECT x.[Location]
		FROM dbo.MyLocs x
		WHERE x.OrgID = 42
		AND x.[Location].STDistance(a.AlertLocation) < 50000) AS l

The cursor was taking almost a full minute to return just under 400 rows worth of data. This query took about 25 seconds. SCORE!

Unfortunately when I looked at the execution plan, the optimizer was not able to recognize that my spatial index would increase the performance. So I modified the query to provide a hint:


FROM dbo.MyLocs x WITH (INDEX(xsHigh))

Now the query returned in 5 seconds. BIG SCORE! Then I remembered my co-worker and his low density index. So I quickly rebuilt it and changed the INDEX hint to use the low density index. It ran in just over 2 seconds. Yeah, it’s a score, but why? I thought the high density index was the way to go with point locations. I decided I just wanted to see the middle behavior and see if a medium density index took 3.5 seconds about mid-way between the low and high density indexes… It ran in just over 1 second. WHAT THE F…

I then ran a whole series of experiments with my query. I kept changing the density and the cells per object value for the three indexes and recording times. In every test, the medium density index outperformed the low and high density indexes. It never outperformed the low density index by a whole lot, usually running ~50% faster as opposed to running 5 times faster than the high density index. But it really did run better.

My current working theory, and I’m trying to figure out how I can validate this, is that the low and medium density indexes were more efficient at eliminating the negative values, the places where the criteria wasn’t in proximity to my locations. Then, the medium density index was just better at retrieving the data than the low density index. I base this on the fact that the primary filter efficiency on the medium density index is  better than the low (although not as good as the high).

Permalink 2 Comments

Spatial Indexes and a Lack of Data

June 9, 2009 at 1:44 pm (spatial data, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , , , , )

I was feeling quite confident about my new-found abilities with spatial indexes so I did a presentation for my team, to share what I had learned. I had also been sharing with one co-worker as I developed the knowledge of spatial indexes. While I was preparing my presentation, he was preparing his. I had focused on finding a set of data that showed it’s proximity to a test location and then showing how retrieving that set of data was faster because of the spatial index. He took a different approach. He took the idea of saying, here’s a list of different test locations, let’s see which one of our internal locations meet the proximity test. At the same time, he tried three different spatial indexes, one with high granularity, one with medium and a final one with low granularity.

The day came for presentations. I showed how the spatial index improved performance and how you could read from sp_help_spatial_geography_index to see how efficient the index was for the sample data. It all went without a hitch. Then my co-worker showed his tests. Wherein the low density index outperformed high or medium density indexes, despite having a primary_filter at about 5% efficiency. I was floored. I couldn’t explain and nor could my co-worker. But as far he was concerned, this whole spatial index thing was for the birds.

I went away and I started testing. Finally, after feeling like I had a handle on things, I asked my co-worker for his scripts. He had a cursor (which he acknowledged was a lousy way to write the query) that walked through the test locations, one at a time, and ran my query against our data. Sure enough, when I carefully built out the indexes and ran the test, the low density indexes worked better than the high density indexes. I set up a trace event and captured all the statement completion events so I could compare the high & low density runs. That’s when I spotted the issue.

The majority of the test locations didn’t find any matches in our location data. When the query ran against the low density index it would usually have a few hundred reads and a duration around 20ms to return no data. On the other hand, the same query, returning no data, on the high density index had about 5000 reads and took about 34ms. With each one of these cumulative queries running faster against the low density index, it really appeared to be better. But, when you got a query where data was returned, the low density index had more reads and a lot longer duration, about 500ms for the low density index compared to a consistant 34ms for the high density index.

There was my explanation. The cursor masked the fact that over and over again, the query was running but no data was being returned. I’m in the process now of rewriting the query to use the same set of test locations, but JOIN it against our location data to see which of the indexes is faster.

Permalink 3 Comments

ORDER BY Speed

June 3, 2009 at 10:26 am (SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , )

I answered a question on SSC with the comment that while an INT would perform better than a DATETIME in an ORDER BY query, assuming each has a viable index, that the difference wouldn’t be all that terribly substantial. Then I realized, maybe that’s not true. So I ran up a quick test, just to see.

First I created a little test table with the right indexes and loaded it with data:

CREATE TABLE dbo.IntDate
(IntCol INT NOT NULL,
DateCol DATETIME NOT NULL);

CREATE INDEX ixInt
ON dbo.IntDate(IntCol);
CREATE INDEX ixDate
ON dbo.IntDate(DateCol);

SELECT TOP 10000 IDENTITY( INT,1,1 ) AS n
INTO #Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2;

INSERT INTO dbo.IntDate (
IntCol
,DateCol)
SELECT t.n,
DATEADD(dd,- t.n,GETDATE() )
FROM #Tally AS t;

DROP TABLE #Tally;

Then I ran these two queries. Each one correctly accessed the index, doing a scan, to retrieve the data:

SELECT id.DateCol 
FROM dbo.IntDate AS id
ORDER BY BY DateCol DESC

SELECT  id.IntCol
FROM dbo.IntDate AS id
ORDER BY IntCol DESC

Here are the execution times:

Table 'IntDate'. Scan count 1, logical reads 26, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 70 ms.
(10000 row(s) affected)
Table 'IntDate'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 59 ms.

But what happens if I reverse the order?

SELECT id.DateCol 
FROM dbo.IntDate AS id
ORDER BY DateCol ASC 

SELECT id.IntCol
FROM dbo.IntDate AS id
ORDER BY IntCol ASC

Then results are:

Table 'IntDate'. Scan count 1, logical reads 26, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 52 ms.
(10000 row(s) affected)
Table 'IntDate'. Scan count 1, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 5 ms.

So, there can be a somewhat significant difference, although at least part of that is explained by the differences in the number of reads. Not exactly earth shattering, but good to know.

Permalink 2 Comments

Spatial Index & Performance & Brain Pain

May 28, 2009 at 10:36 am (spatial data, SQL Server 2008, TSQL) (, , , , , )

In my previous post, Spatial Data Hurts My Brain, I showed how a query used the spatial index, but hurt performance. Several people jumped in to try to help out. Thanks to Bob Beauchamin, Isaac Kunin, Gail Shaw, and Valerie Yakich (via Twitter).

I’ve definately learned a bit more than I knew previously about spatial data and spatial indexes. I’ve posted about it before, but it bears repeating, if you’re just starting out, I can’t recommend Alistair Aitchison’s book, Beginning Spatial with SQL Server 2008, enough. Here’s where I’m at. I think I was confusing/confused/confounded/something about what to expect from a spatial index. I’m going to present two queries and two indexes and try to explain why each works well, or not together, mainly as a means for enabling my understanding, not so much teaching anyone else. The first query is simply a retrieval of the data from a database where the distance between a supplied point and the points stored in the geography column are less than a certain value:

SELECT *
FROM dbo.ebe
WHERE ebe.[Location].STDistance(@location) < 15000

When this is run against the table with only a clustered index on the PK, this is the execution plan:

 spatialindex_1

About what you would expect. The query returns almost 5000 rows of the 500000 in the table and runs for 14 seconds on my machine. Now I’m going to create this index and run the query again.

CREATE SPATIAL INDEX [ebe_High] ON [dbo].[ebe] 
(
[Location]
)
USING GEOGRAPHY_GRID 
WITH
(GRIDS =(LEVEL_1 = HIGH,LEVEL_2 = HIGH,LEVEL_3 = HIGH,LEVEL_4 = HIGH)
, CELLS_PER_OBJECT= 1024) ON [PRIMARY]
GO

When the query is run a second time, here’s the execution plan.

 spatialindex_2

This time the query ran in under a second to return the same rows. Clearly, spatial indexes do work to speed up execution. The execution plan is hard to see here I realize. I can’t upload zipped files or .sqlplan files, so if you really want a copy of the plan, I’ve pasted both into a Word doc: spatial_plans

I’m not going to try to explain everything happening with the plan, because I’m not entirely sure what all is. I’m still figuring that out. Suffice to say that the table valued function and the constant scan operator feed data into a seek against the clustered index (which is missing statistics for some reason) where it’s fed to a compute scalar and then finally it’s loop joined to the clustered index in what I assume is the equivalent of a key lookup. Again, I’m still figuring this part out. I skipped some steps.

Now, I want to modify the query because the goal is to not simply get all locations near this point of interest, but the locations of a specified organization:

SELECT *
FROM dbo.ebe
WHERE ebe.[Location].STDistance(@location) < 15000
AND ebe.[OrgId] = 4464

When I run this query, I get the following execution plan:

 spatialindex_3

The query ran in 96ms and returned 148 rows. It’s all you could ask for. Or is it? What if I add this index:

CREATE NONCLUSTERED INDEX [ebe_nc] ON [dbo].[EBE] 
(
[OrgId]
ASC
) INCLUDE (InterestId,Location)
GO

Now when I run the query, I get the following execution plan:

 spatialindex_4

Much, much less complex and quite easy to explain. I have a covering index that does a simple seek to retrieve the initial set of data and then it’s filtered to remove the locations that are too far from the point of interest. But, here’s where it gets interesting. The execution time was 126ms. The spatial index shaved about 30ms, or about 25%, off the time of what would normally be the fastest access path, a covering non-clustered index. That is a substantial savings.

I’ve done a lot of testing over the last few days and I didn’t always see this kind of performance benefit. Several times I’ve seen worse performance. Also, I’m not sure why a brand new index has missing statistics, as seen in the execution plans when the where the spatial index was used. Further, the optimizer is choosing the nonclustered index over the spatial index, clearly because of the estimated costs, when the spatial index does, in fact, perform better in this instance. I’m not sure what to do about that, apart from dropping the nonclustered index. That’s OK. It just means more to learn. Starting with, explaining exactly what’s happening in the spatial index execution plan. That may be the next post.

UPDATE:
I forgot to mention. I ran the sp_help_spatial_geography_index DMV to see what information it could give, but when I supplied the parameter value listed above, I received a filter efficiency of zero. I’m clearly not understanding how to apply this DMV.

Permalink 3 Comments

Execution Plan Estimated Operator Cost

March 19, 2009 at 1:09 pm (SQL Server 2008, SQLServerPedia Syndication) (, , , , , , )

I’ve said it over and over again, the costs on operators in execution plans, even in actual execution plans are estimates.  You need to understand that when looking at your execution plans. It’s vital because you need to be able to distinguish between the truly costly parts of a plan and the less costly parts of a plan. Don’t believe me? Take a look at this picture and see if you can spot the discrepancy:

cost

Spot it yet?

Add up the costs for the operators visible in the part of the plan…

Yep 125%.  And there’s more to the plan that I’m not showing. I think this one must total near 200%. The statistics are up to date and there’s no consistency errors in the database. These estimates are just off sometimes.

This is a rather costly query being run against Microsoft Dynamics CRM. There were two missing queries identified by the optimizer and about four key lookup operations that I fixed with two index changes. This jumped out and I wanted to share. The fact is, right now, the query is working well. That particular table has one scan and some reads, identified from looking at the STATISTICS I/O, and it’s not even close to 100% of the cost of the query, but I would sure be in a panic if I believed the estimated operation cost.

Permalink 14 Comments

A Fix for nHibernate’s Parameter Problem

March 11, 2009 at 6:26 am (nHibernate, Visual Studio) (, , , )

I had previously outlined a problem with nHibernate code that, in my opinion, was going to cause a lot of problems on production systems especially with memory and the procedure cache. However, it does appear that there is a possible solution. That doesn’t eliminate my concerns over nHibernate, but it sure does help.

Permalink 1 Comment

Bad Performance Tip

March 4, 2009 at 2:23 pm (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (, , )

I saw a performance tip that just didn’t make any sense to me:

In cases where you are using the IN clause, try to order the list of values so that the most frequently found values are placed first.

That just didn’t make any sense to me. The IN clause is not like EXISTS where the query will stop as soon as it finds a good match. So I set up a test with AdventureWorks. I found a few different ProductId values, the highest, the lowest and a few in between and ran a very simple query to test this tip:

/*
ProductID        RowCount
870                        4688 rows
877                        1327 rows
972                        380 rows
823                        148 rows
723                        52 rows
897                        2 rows*/

DBCC FReeproccache()
DBCC dropcleanbuffers()
GO
SELECT  sod.ProductID
       ,sod.SalesOrderDetailID
FROM    Sales.SalesOrderDetail AS sod
WHERE   sod.ProductID IN (870, 877, 972, 723, 897)
GO
DBCC FReeproccache()
DBCC dropcleanbuffers()
GO
SELECT  sod.ProductID
       ,sod.SalesOrderDetailID
FROM    Sales.SalesOrderDetail AS sod
WHERE   sod.ProductID IN (897, 723, 972, 877, 870)
GO
DBCC FReeproccache()
DBCC dropcleanbuffers()
GO
SELECT  sod.ProductID
       ,sod.SalesOrderDetailID
FROM    Sales.SalesOrderDetail AS sod
WHERE   sod.ProductID IN (972, 870, 877, 897, 723)

It resulted in three identical execution plans, every one of them a nonclustered index seek. In the three tests, the most frequently found values are first, last & mixed. All three queries had I/O that looked like this:

(6449 row(s) affected)
Table ‘SalesOrderDetail’. Scan count 5, logical reads 26, physical reads 7, read-ahead reads 37

And the execution times were:

Query     Compile     Run
1           45ms        47ms
2           14ms        28ms
3           4ms          30ms

I then ran all three again, in reverse order:

Query     Compile     Run
3           34ms        52ms
2           25ms        46ms
1           5ms          25ms

The times scattered around the same values, but were essentially the same. No change in the order of the data affected the query positively or negatively. So where does this tip come from?

If people post a simple list of tips but don’t explain what they mean, show some sample code or in any way attempt to educate you as to why you should do X in your queries, I would not trust it.

Permalink 7 Comments

More Refinements on the Missing Indexes Query

March 2, 2009 at 8:03 am (SQL Server 2005, SQL Server 2008, TSQL) (, , , )

Greg Larson posted a really nice query to find worst performing stored procedures in your system. He put in all kinds of attributes to make it customizable, changing the definition of “worst” to different measures,etc. Great query in general. In it he linked to sys.dm_exec_plan_attributes and got the db_id attribute. Duh! So instead of spelunking through the XML to retrieve the database name, I can pull the db_id and use the DB_NAME function. Cleans things up considerably. Thanks Greg.

Here’s the cleaned up code:
WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/showplan'
AS sp)
SELECT DB_NAME(CAST(pa.value AS INT))
,s.sql_handle
,s.total_elapsed_time
,s.last_execution_time
,s.execution_count
,s.total_logical_writes
,s.total_logical_reads
,s.min_elapsed_time
,s.max_elapsed_time
--,s.query_hash
,p.query_plan
,p.query_plan.value(N'(sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan/sp:MissingIndexes/sp:MissingIndexGroup/sp:MissingIndex/@Table)[1]',
'NVARCHAR(256)') AS TableName
,p.query_plan.value(N'(/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan/sp:MissingIndexes/sp:MissingIndexGroup/sp:MissingIndex/@Schema)[1]',
'NVARCHAR(256)') AS SchemaName
,p.query_plan.value(N'(/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan/sp:MissingIndexes/sp:MissingIndexGroup/@Impact)[1]',
'DECIMAL(6,4)') AS ProjectedImpact
,ColumnGroup.value('./@Usage', 'NVARCHAR(256)') AS ColumnGroupUsage
,ColumnGroupColumn.value('./@Name', 'NVARCHAR(256)') AS ColumnName
FROM (SELECT TOP 20
s.sql_handle
,s.plan_handle
,s.total_elapsed_time
,s.last_execution_time
,s.execution_count
,s.total_logical_writes
,s.total_logical_reads
,s.min_elapsed_time
,s.max_elapsed_time
--,s.query_hash
FROM sys.dm_exec_query_stats s
ORDER BY s.total_elapsed_time DESC
) AS s
CROSS APPLY sys.dm_exec_query_plan(s.plan_handle) AS p
CROSS APPLY p.query_plan.nodes('/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan/sp:MissingIndexes/sp:MissingIndexGroup/sp:MissingIndex/sp:ColumnGroup')
AS t1 (ColumnGroup)
CROSS APPLY t1.ColumnGroup.nodes('./sp:Column') AS t2 (ColumnGroupColumn)
CROSS APPLY sys.dm_exec_plan_attributes(s.plan_handle) pa
WHERE p.query_plan.exist(N'/sp:ShowPlanXML/sp:BatchSequence/sp:Batch/sp:Statements/sp:StmtSimple/sp:QueryPlan//sp:MissingIndexes') = 1
AND pa.attribute = 'dbid'

Permalink 4 Comments

View vs. Table Valued Function vs. Multi-Statement Table Valued Function

August 13, 2008 at 8:36 am (TSQL) (, , , , , )

About five years ago, I was checking an app before it went to production. I hadn’t seen the app before then and a junior dba had worked with the developers designing and building the app. It didn’t use a single stored procedure or view. Instead, it was built entirely of multi-statement UDF’s. These UDF’s called other UDF’s which joined to UDF’s… It was actually a very beautiful design in terms of using the functions more or less like objects within the database. Amazing. It also would not, and could not, perform enough to function, let alone scale. It was a horror because they thought they were done and ready to go to production, but no one had ever tested more than a couple of rows of data in any of the tables. Of course, a couple of rows of data worked just fine. It was when we put in 10, 1000, a few million, that the thing came to a total and complete halt. We spent weeks arguing about the stupid thing. The developers instisted that since it was “possible” to do what they did, that, in fact, it was OK to do what they did.

Anyway, with the help of a Microsoft consultant, we finally cleaned up the app and got it on it’s feet. Ever since then, I’ve preached the dangers of the multi-statement table valued function. The thing to remember is, there are no statistics generated for these things. That means the optimizer thinks they return a single row of data. When they do only return a few rows, everything is fine. When they return even as little as a hundred rows, like the example I’m posting below, they stink.

Anyway, I boiled up this silly example because some developer accused me and several other DBA’s of spreading Fear, Undertainty, and Doubt because we suggested that the multi-statement UDF is something to avoid if possible. Actually, he pretty all but stated that we didn’t know what we were talking about. I was peeved. Hence this example. Feel free to check it out. Oh, and if you check the execution plans, note that the multi-statement UDF is marked as the least costly even though it actually performs twice as slow as the others. One more example of execution plans being wrong.

Here are the time results from one run of the view & UDF’s:

(99 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

(99 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.

(99 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 3 ms.

And the code to test for yourself:

CREATE TABLE dbo.Parent
(ParentId int identity(1,1)
,ParentDate datetime)

CREATE TABLE dbo.Child
(ChildId int identity(1,1)
,ParentId int
,ChildDate datetime)

DECLARE @i int
DECLARE @j int
SET @i = 1
SET @j = 1
WHILE @i < 100
BEGIN
INSERT INTO dbo.Parent
(ParentDate)
SELECT GETDATE()
WHILE @j < 100
BEGIN
INSERT INTO dbo.Child
(ParentId
,ChildDate)
SELECT @i
,GETDATE()
SET @j = @j + 1
END
SET @i = @i + 1
END

CREATE VIEW dbo.vJoin
AS
SELECT p.ParentId
,p.ParentDate
,c.ChildId
,C.ChildDate
FROM dbo.Parent p
JOIN dbo.Child c
ON p.ParentId = c.ParentId

CREATE FUNCTION dbo.SingleUDF ()
RETURNS TABLE
AS
RETURN
(
SELECT p.ParentId
,p.ParentDate
,c.ChildId
,C.ChildDate
FROM dbo.Parent p
JOIN dbo.Child c
ON p.ParentId = c.ParentId
)

CREATE Function dbo.MultiUDF ()
RETURNS @Multi TABLE
(ParentId int
,ParentDate datetime
,ChildId int
,ChildDate datetime)
AS
BEGIN
INSERT INTO @Multi
(ParentId
,ParentDate
,ChildId
,ChildDate)
SELECT p.ParentId
,p.ParentDate
,c.ChildId
,C.ChildDate
FROM dbo.Parent p
JOIN dbo.Child c
ON p.ParentId = c.ParentId
RETURN
END

set statistics time on
select * from vJoin
select * from SingleUDF()
select * from MultiUDF()
set statistics time off

Permalink 6 Comments

Why are Subqueries Dangerous?

June 13, 2008 at 1:51 pm (TSQL) (, )

If you go around to the various forums, you’ll see postings, including some I’ve put up, that say using subqueries, especially correlated subqueries in the SELECT statements of queries is bad because it effectively acts as a cursor.

I got called on it. So I had to do a bit of research.

Books Online talks about it right off when describing subqueries. That’s not proof though.

I did some more looking around. Adam Machanic does a nice job of slicing up Functions, which are basically the same thing. But it’s not exactly the same.

I kept looking. This blog entry is just flat wrong, but the second comment points out the fallacy. Jeff Moden would also like this example since it shows the disparity between the actual cost of queries and the estimated cost used by the graphical execution plans to display results. But this isn’t in the SELECT statement. Moving on.

Of course. The real answers always come from the SOURCE: Itzik

Permalink 1 Comment

« Previous page