SQL Server 2008 | Home of the Scary DBA

Switching Off Parameter Sniffing

November 15, 2010 at 8:00 am (SQL Server 2008, SQLServerPedia Syndication, TSQL) (parameter, parameter sniffing, parameters, variables)

Or, another way to put it, in most cases, shooting yourself in the foot.

I was not aware that the cumulative update for SQL Server 2008 back in June included a switch that allows you to turn parameter sniffing off within SQL Server. Thanks to Kendra Little (blog|twitter) for letting me know about it (although she let me know by “stumping the chump” during my lightening talk at the Summit, thanks Kendra!).

When I first saw the switch, I thought about the places where turning off parameter sniffing could be helpful. But, as I thought about it, the more I realized that removing parameter sniffing was an extremely dangerous switch. Why? Because, most people only ever hear about parameter sniffing when they run into a problem. Someone says “Parameter sniffing” and you see people cringe. Too many people will take this information in and go, “Hey, I can just switch parameter sniffing off and I’ll have a much faster system, all the time.” But… even when you’re not hitting a problem with parameter sniffing, you’re still getting parameter sniffing. Here is where I see a problem. Let’s discuss what parameter sniffing is.

Parameter sniffing is applicable to stored procedures and parameterized queries. What happens is, when a value is passed to a parameter, the optimizer has the ability to read, or “sniff,” the value of that parameter. It can do this because it knows exactly what the value is when the proc/query is called. This is not applicable to local variables, because the optimizer can’t really know what those values might be, where as it knows exactly what the values of parameters are going in. Why does it do this? One word: statistics. Statistics are what the optimizer uses to determine how queries will be executed. If the optimizer is given a specific value, it can then compare that value to the statistics on the index or table in question and get as good an answer as is possible from those statistics as to how selective this value may be. That information determines how the optimizer will run the query and because it is using specific values, it’s looking at specific information within the stats. If the parameters are not sniffed, the statistics are sampled and a generic value is assumed, which can result in a different execution plan.

The problem with parameter sniffing occurs when you have out of date statistics or data skew (certain values which return a wildly different set of results compared to the rest of the data within the table). The bad statistics or skew can result in an execution plan that is not consistent with most of the data that the stats represent. However, most of the time, in most situations, this is an edge case. Notice that hedging though. When parameter sniffing goes bad, it hurts.

Most of the time we’re going to gain huge benefits from parameter sniffing because the use of specific values leads to more accurate, not less accurate, execution plans. Sampled data, basically an average of the data in the statistics, can lead to a more stable execution plan, but a much less accurate one. Switching parameter sniffing off means that all queries will use sampled data, which creates a serious negative impact on performance. Most of the time, most of us are benefitting wildly from the strengths of parameter sniffing and only occasionally are we seeing the problems.

Unless you know, and I mean know, not suspect, that your system has major and systematic issues with parameter sniffing, leave this switch alone and let the optimizer make these choices for you. If you don’t, it’s very likely that you’ll see a performance hit on your system.

Permalink Leave a Comment

Powershell is Really Easy… If you know what you’re doing

October 18, 2010 at 8:00 am (PowerShell, SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication) (PowerShell)

Sigh… I just spent almost a full work day trying to come up with, what turned out to be a single line script. The requirements were simple. After patching of servers is completed, and everything is rebooted, sometimes, not often, but sometimes, SQL Server Agent doesn’t restart correctly. I was asked to put together a script that will check a list of servers to see if Agent is running. Any PowerShell guru’s reading this are either laughing or crying, already.

So, I started exploring how to connect to the server using using some type of PS-Drive, but I quickly figured out that I’d have to loop through the servers that way. Plus, I don’t think there’s a way to check Agent status through the SQL drive. So that went out the window. Then, it was to use Invoke-SQLCmd. After all, I can call multiple servers through the -ServerInstance property, so I just need to find a good T-SQL statement to check Agent status… Googling & Binging didn’t turn up a good T-SQL statement to check Agent status (not saying there isn’t a way, but I couldn’t find one easily, plus, this is a story about PowerShell). On to SMO. All I had to do was figure out how to connect to the server and the SQL Agent object through SMO. I’d have to loop through the servers, but for such a simple script, that shouldn’t be hard. Ever tried to connect directly to the Agent through SMO? Me neither. It’s a pain.

Finally, I referred back to Don Jones (blog|twitter) book, Windows Powershell 2.0, TFM (and a more appropriately named book, there isn’t), and noticed a statement, Get-Service… Hey! Isn’t SQL Agent basically a service? Yes. Yes it is. Here’s the final script, in all it’s simplistic glory:

Get-Service -DisplayName "SQL Server Agent*" -ComputerName (Get-Content "servers.txt") | Select-Object MachineName,Status

Yep, it was that easy. Kids, don’t try this at home. Take the time to learn what you’re doing. Don’t assume you know.

Permalink 13 Comments

How to Tell if Execution Plans are Reused

October 4, 2010 at 8:00 am (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (compiles, dynamic management object, execution plans, plan recompile, plan reuse, recompiles)

I try to watch the search phrases that point people to the blog because sometimes, you get a sense of what problems people are running into. The latest question or phrase I’ve seen a lot lately is along the lines of “how do you know if an execution plan is being reused.”

Since compiling an execution plan can be an extremely expensive operation, it’s worth your time to understand how well a given plan is getting reused. If you’ve seen me present, I’ll frequently talk about the application that had a query with an 86 table join. Recompiles on that thing were frequent and extremely costly. The only good news was, they were recompiles. If we weren’t getting plan reuse it would have been an even worse system than it was.

There are a number of ways you can see if a plan is being reused. One of the easiest is to hit the DMOs that look into the plan cache. You can look at either sys.dm_exec_procedure_stats or sys.dm_exec_query_stats. These DMOs are somewhat different and somewhat the same. Actually, they sort of cross each other. For sys.dm_exec_procedure_stats, you get aggregate information about stored procedures. For sys.dm_exec_query_stats you get aggregate information about queries, which, may be run within stored procedures. So either or both could be useful depending on what you’re attempting to measure. However, both return a very useful counter, execution_count. Quite simply, that tells you that the plan, whether for the statement or the procedure, is being reused. To get maximum gain out using either of these, you’ll want to combine them with other DMOs. Something along these lines can show you the plans for procedures that have been running against a system, ordered by the number of times they’ve been reused:

SELECT deps.execution_count ,
OBJECT_NAME(deps.object_id, deps.database_id) 'Procedure' ,
deqp.query_plan
FROM sys.dm_exec_procedure_stats AS deps
CROSS APPLY sys.dm_exec_query_plan(deps.plan_handle) AS deqp
ORDER BY deps.execution_count DESC

The only thing wrong with using the DMO like this is that you can only see what’s currently in cache. This means no history, depending on how volatile the cache is on your system.

Another way to tell if a plan is being reused, is to set up a server side trace and capture the event for the SP:CacheMiss or SP:CacheHit. This is pretty straight forward. If the query was not in, it generates a miss event. If it was there, it generates a hit. But, this has to be running in order for you to know if you had a hit or a miss. The good news is, if it’s running, you’ve got historical information since this captures the event as it occurs.

You can also catch recompile events using SQL:StmtRecompile. Yes, you can get SP:Recompile if you’re only ever dealing with procedures, but if you’ve got any kind of ad-hoc querying going on in the system or triggers, you’ll miss recompile events. Basically, because recompiles are at the statement level starting in 2005, BOL recommends only using SQL:StmtRcompile.

That’s pretty much it. Using these methods in the appropriate place will let you know if the plan is being reused or not.

Permalink 1 Comment

Parameter Name Size And Performance

September 27, 2010 at 8:00 am (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (execution plan, execution plans, name size, parameter, parameter name size, parameters)

I saw an odd statement the other day, “The size of the name of the parameter does not affect performance.” My first thought was, “Well, duh!” But then, I had one of those, “Ah, but are you sure” thoughts. And you know what, I wasn’t sure.

If size of the parameter name did affect performance, I figured, the one sure place where that would be evident is in the size of the execution plan. Right? I mean, if there was an impact on memory, and hence on performance, that’s probably where you’d see evidence of it. I wrote two queries:

 DECLARE @ThisIsAVeryVeryLongParameterNameThatIsTrulyRidiculousButItIllustratesThePointThatParameterLengthDoesNotAffectPerformance int
SET @ThisIsAVeryVeryLongParameterNameThatIsTrulyRidiculousButItIllustratesThePointThatParameterLengthDoesNotAffectPerformance = 572
SELECT soh.SalesOrderID
,sod.SalesOrderDetailID
FROM Sales.SalesOrderHeader AS soh
JOIN Sales.SalesOrderDetail AS sod
ON soh.SalesOrderID = sod.SalesOrderID
WHERE soh.SalesOrderID = @ThisIsAVeryVeryLongParameterNameThatIsTrulyRidiculousButItIllustratesThePointThatParameterLengthDoesNotAffectPerformance

DECLARE @v int
SET @v = 572
SELECT soh.SalesOrderID
,sod.SalesOrderDetailID
FROM Sales.SalesOrderHeader AS soh
JOIN Sales.SalesOrderDetail AS sod
ON soh.SalesOrderID = sod.SalesOrderID
WHERE soh.SalesOrderID = @v

If you run this against AdventureWorks2008R2 you’ll get two distinct, but identical, execution plans:

You can see that they look identical, but how do I know they’re distinct? If you run this query:

 SELECT deqs.creation_time,
deqs.query_hash,
deqs.query_plan_hash
FROM sys.dm_exec_query_stats AS deqs
CROSS APPLY sys.dm_exec_query_plan(deqs.plan_handle) AS deqp
CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest
WHERE dest.text LIKE '%SELECT soh.SalesOrderID%'

You’ll get this back as a result:

creation_time                               query_hash                                 query_plan_hash
2010-09-23 18:18:09.347      0x8D0FB9D524B8DD4D       0x13707445560737BA
2010-09-23 18:18:16.223       0x8D0FB9D524B8DD4D       0x13707445560737BA

Two distinct queries, but with identical hash values, so the plans generated are nearly the same, but clearly different, due to the fact that they were built with different parameters, including the monster name. So, how to see if there is a difference in the plan generated that could affect performance? How about the execution plan properties. First, the property sheet for the SELECT operator for the query with the long parameter name:

Of particular note is the Cache Plan Size. Let’s compare it to the same property sheet for the small parameter name:

If you compare the two, you’ll see that they’re the same. In fact, if you look at almost all the values, you’ll see that the Compile CPU, Compile Memory, and Compile Time are all identical. Based on all this information, I have to conclude that no, the size of the name of the parameter doesn’t affect performance, positively or negatively. But why?

I’m actually not 100% sure, but based on some things I know, here’s what I think. The Algebrizer within the Query Optimizer breaks down all the objects referred to within a query plan. It assigns them all values and identifiers for that plan, part of getting the information together to feed the plan into the mathematical part of the Optimizer. I’ll bet it just assigns values to parameters that are the same type of value, if not the same exact value, and that value is the same size from one execution plan to the next.

This means that you don’t save memory by assigning parameters @a, @b, @c when in fact you mean @ReferenceCount, @MaxRetries, @BeginDate. Do I think you should put in parameters of the silly length I put in before? No, of course, not, because it makes the TSQL code less clear. But, so does putting in equally silly, short, parameter names.

Don’t make your TSQL code hard to read. It doesn’t help performance.

Permalink 3 Comments

Recursive Hiearchies in Reporting Services

September 3, 2010 at 8:44 am (SQL Server 2008, SQLServerPedia Syndication, Uncategorized) (recursive hieararchy, reporting services, sql server reporting services, SSRS)

I am not a Reporting Services guru and nor do I play one on TV. I am however forced to be all things Microsoft Data where I work. So I frequently find myself stretching way beyond my abilities. I just had to get a report running that feeds from a web service and has a recursive hiearchy with customized aggregation on multiple fields with drill down to a different set of details. Yeah, through the internet I can see the SSRS monsters rolling their eyes at the ease of this task. But for us mere mortals it was work. Since I spent so much time learning how to do it, I thought I’d share.

XML as a Source

First, because we have a very paranoid (and appropriately so) group of PeopleSoft administrators, I couldn’t get direct access to the Oracle database. Instead, they provided me with a web service. Easy enough to consume, but it comes back as XML. Good news is Reporting Services can consume XML through a URL. Bad news is that it has a sort of proprietary XQuery language that is extremely obtuse (or I find it so, but then I’ve had trouble with SQL Server’s XQuery as well).

Setting up the Data Source is extremely simple. When you select XML from the Type dialogue, it’s going to ask you for a Connection String. Supply the URL. Done.

The work comes when you need to set up the DataSet. When you set the Data Source the Query Type will change to Text. No options. And you’ll be looking at a big blank box of nothing. My initial XML data set was this stacked hiearchy that had nested departments, accurately portraying the structure of the data. To query this XML you can do one of two things, set up the XML path as described in this excellent Microsoft white paper or allow SSRS to parse the XML for you. I tried working through the path, but I kept excluding parts of the structure. Basically I needed a method to recursively union the data within XML and, frankly, that was too hard. So I tried the automatic route. What’s the query look like for the automatic route?

That was tough. But, the same problem occurred. According to the white paper referenced above, letting SSRS figure out how to parse the XML means it will walk through and identify the first repeating group within the XML and that will be the structure it uses for the rest of the data. So, in my example, I have Departments and Personnel. The Personnel are inside the Department and Departments are inside Departments which have other Personnel… etc. It looks something like this:

<?xml version="1.0"?>
<ROOT_SEGMENT>
<REPORT_TITLE>Monster Hunters Status</REPORT_TITLE>
<SUMMARY>
<DEPTID>997</DEPTID>
<PARENT_DEPTID></PARENT_DEPTID>
<DETAIL>
<EMPLID>000001</EMPLID>
<NAME>Shackleford, Julie</NAME>
<TERMINATED>N</TERMINATED>
<RETIRED>N</RETIRED>
</DETAIL>
<DETAIL>
<EMPLID>000002</EMPLID>
<NAME>Jones, Trip</NAME>
<TERMINATED>Y</TERMINATED>
<RETIRED>N</RETIRED>
</DETAIL>
<SUMMARY>
<DEPTID>998</DEPTID>
<PARENT_DEPTID>997</PARENT_DEPTID>
<DETAIL>
<EMPLID>000003</EMPLID>
<NAME>Pitt, Owen</NAME>
<TERMINATED>N</TERMINATED>
<RETIRED>N</RETIRED>
</DETAIL>
<DETAIL>
<EMPLID>000003</EMPLID>
<NAME>Newcastle, Holly</NAME>
<TERMINATED>N</TERMINATED>
<RETIRED>N</RETIRED>
</DETAIL>
<SUMMARY>
<DEPTID>342</DEPTID>
<PARENT_DEPTID>998</PARENT_DEPTID>
<DETAIL>
<EMPLID>000022</EMPLID>
<NAME>Harbinger, Earl</NAME>
<TERMINATED>Y</TERMINATED>
<RETIRED>Y</RETIRED>
</DETAIL>
</SUMMARY>
</SUMMARY>
</SUMMARY>
</ROOT_SEGMENT>

Problem is, the first repeating group didn’t include the nesting. That was a deviation, so it didn’t read in the same way. What I had to do, in order to use the automated parsing, was flatten the structure, moving the SUMMARY areas outside of each other. With the new structure, the query returned all the data. Now the trick was to get the department hiearchy into the report

Recursive Hiearchies

Thankfully, after a bit of searching, I found this in the documentation on SSRS. It shows exactly what I needed, the, incredibly simple, method for creating a recursive hiearchy. The only trick was to have the Parent field stored with the child records. You can see that in the XML above, but the original didn’t have it. Once that modification was in place, it was simple. Follow the directions. In my case, DEPTID became the grouping field. To support other functions I also changed the name of the group so it could be referenced in functions.

Once it was created, simply going into the Advanced tab in the Row Groups property window and setting PARENT_DEPTID as the recursive parent was all that was needed.

Way too easy. But, how to get the drill down and the aggregates?

Drill Down & Aggregates

With that in place, the query will return hiearchical data, grouping on the DEPTID and keeping the parent child relationships in order. To establish drill down, it’s just a matter of going into the Row Group properties for the Department group again. In the Visibility tab, you set the visibility to Hide and check “Display can be toggled by this report item:”

Once that’s done, the recursive groups are only displayed as the little plus signs expand and contract the groups. It works great. You can even get fancy and add an indent function as shown in this bit of the documentation. But, how to do get the totals to display recursively? Not tricky at all. In fact, pretty easy. Since the data coming out has a set of flags that I have to check for positive or negative values, I have to use a expression to check them anyway. Something like this:

=Sum(CInt(IIf(Fields!HIRE_FLG.Value.Equals("Y"),1,0))

Luckily, built right into the function is a method to make it work recursively, so that you get totals of the children displayed with the parent. All that’s necessary is to supply the group, which I named earlier, Department, and tell it to total this stuff in a recursive manner, like this:

=Sum(CInt(IIf(Fields!Terminated.Value.Equals("Y"),1,0)),"Department",Recursive)

Put one of these with the appropriate field and you have a nice neat report.

Conclusion

To finish up, none of this is rocket science. It’s just a question of knowing where to go and how to put it all together. Being a newb when it comes to Reporting Services, I spent a lot of time struggling. Now, you won’t have to.

Permalink 5 Comments

Viva Las Vegas!

August 11, 2010 at 11:00 am (SQL Server 2005, SQL Server 2008, TSQL) (execution plans, presenting, virtual presentations)

I won’t be going to Las Vegas, but I will be presenting to the Las Vegas SQL Server Users Group, S3OLV. The sesssion will be “Introduction to Execution Plans.” Please swing by if you’re in the area.

I’m not sure if they’ll make the LiveMeeting available to the public or if they’ll record the session. But if they do, please attend that way too.

Permalink Leave a Comment

24 Hours of PASS: Summit Preview

August 11, 2010 at 8:58 am (PASS, SQL Server 2008, SQLServerPedia Syndication, TSQL) (24 hours of PASS, PASS, PASS Community Summit, performance, performance metrics, performance monitoring, performance tuning, professional association of sql server users)

Registration is open for the second 24 Hours of PASS this year. This one is going to be a preview of the Summit itself. So all the sessions are tied, in some manner, to sessions being given at the summit.Here’s a link to go and register.

I’m very excited to be able to say that I’ll be presenting in this 24HOP. One of my presentations at the Summit this year is Identifying and Fixing Performance Problems Using Execution Plans. It covers pretty much what it says, methods for fixing performance problems by exploring the information available within execution plans. But, how do you know you have a performance problem? That’s where my preview session comes in. Identifying Costly Queries will show you several ways to gather metrics on your system so that you can understand which queries are causing you the most pain. Once you know which queries need tuning, you can use execution plans to tune them. Whether you’ll be attending the PASS Summit or not, and whether or not you’ll go to my session once you’re there, I think this 24HOP session will be useful to help you understand where the pain points are within your own systems. I hope you’ll attend.

More importantly though, check out all the other great sessions. This is an excellent collection of presenters and presentations. For anyone who has ever said “PASS doesn’t do anything for me,” I want you especially to take a look at the amazing training opportunities being offered by PASS, for free. The volunteers that run PASS do amazing things and this is just one of them. Take advantage of this opportunity and, hopefully, recognize that PASS is doing things for you. This just barely scratches the surface of all that PASS offers.

Permalink 3 Comments

Powershell Remoting with SQL Server

July 27, 2010 at 8:00 am (PowerShell, SQL Server 2008, SQLServerPedia Syndication) (PowerShell, remote management, remoting)

One of the best things to come out with Powershell V2 is remoting and asynchronous calls. Between the two of these, you can basically send commands simultaneously to a number of SQL Server instances. BUT… and there always seems to be one of those, there is a lot of work required to get this running right. I’m going to outline what I did recently to test the ability of PowerShell to begin administering my servers remotely. Hopefully this provide the basis for a checklist and a how-to. I’ll update this post over time so that I get things right.

Enable remoting on the machines you wish to call

This requires admin privileges, but it’s pretty simple unless you need to modify which ports are available, etc. But to get it going the easiest way:

Enable-PSRemoting

You’ll get a warning outlining what this will do to the system and asking if you want to continue:

Running command “Set-WSManQuickConfig” to enable this machine for remote management through WinRM service.
This includes:
    1. Starting or restarting (if already started) the WinRM service
    2. Setting the WinRM service type to auto start
    3. Creating a listener to accept requests on any IP address
    4. Enabling firewall exception for WS-Management traffic (for http only).

Do you want to continue?
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is “Y”):

There may be other prompts about the operations that are listed above (in fact, “will be” is a better way to phrase it). In general, work through the help files to understand what all that means.

Configure Sessions

We’re talking about working with SQL Server here, so you know that you need to load the snapin. Easy right. In fact, you probably already have it loading by default through your configuration. Ah, but here’s the issue. When you’re running remotely, the session on that other machine is not running under your context. So it’s not going to use your configuration at all. Instead you need to define a session. There are several ways you could do this. Steven Muraski (blog|twitter) has a mechanism for forcing configurations to the remote machines. I haven’t tried this yet. I did the brute force approach.

First you create a script that holds the session information you want on each of the machines, in this case:

add-pssnapin SqlServerCmdletSnapin100

add-pssnapin SqlServerProviderSnapin100

That loads the necessary snapins for SQL Server. Once that’s done, you have to register this as a named configuration on the system:

[sourcode language=”powershell”]Register-PSSessionConfiguration -Name SQLShell -StartupScript C:\scripts\sql.ps1[/sourcecode]

Again, this is run on each machine that you wish to remote to. You can do this through remote sessions, funny enough. Once you’ve enabled remoting and set up the sessions, they’ll stay in place, as is, even after rebooting the system. This means you only have to do this once.

Remoting to SQL Server

You can create multiple sessions and then call them asynchronously, but instead, I tried a different tack. This is my full script:


$instances = @{"Server1"="Server1\InstanceA"; "Server2"="Server2\InstanceB"}

$job = Invoke-Command -ComputerName (Get-Content "c:\scripts\serverlist.txt") -ScriptBlock {param($rinstances) Invoke-Sqlcmd -ServerInstance $rinstances.$env:computername -Query "SELECT * FROM sys.dm_exec_requests WHERE session_id > 50"} -JobName tsql -ConfigurationName SqlShell -ArgumentList $instances

Wait-Job tsql

$results = Receive-Job -Name tsql
$results

Stop-Job -Name tsql
Remove-Job -Name tsql

I created a list of servers and put it in a file, serverlist.txt. Because I’m dealing with instances, I need a way to match the list of servers to the instances. I did this with the associative array (aka hash table), $instances. I’m calling Invoke-Command and passing it a list of computer names through the serverlist.txt file. I pass that a script block, more about that in a minute, and I set a JobName as tsql. This makes the Invoke-Command into a remote job, but in my case, a set of remote jobs. I pass it the the configuration we created earlier with -ConfigurationName.

The ScriptBlock is pretty straight forward from there. The one extra wrinkle that I had to get right, and thanks to Steven Muraski, who got me through this little roadblock as well as several others (all through Twitter by the way), is that you have to get your local parameter into the script block through the use of param and -ArgumentList. Then I just called it based on the server name I was running on (yes, what if I had more than one instance, I didn’t, OK?).

Then, because I wanted to see it and didn’t want to go through a process of collecting information as it finished, I just told it to wait on the job, by name. Then I collected the results, displayed them through the default and shut down the job and removed it.

So this is one, admittedly less than perfect, approach to getting remoting working with SQL Server in PowerShell. I’m going to spend more time on this to refine the process.

For a completely different view of the exact same problem, go check out what Aaron Nelson (blog|twitter) did. We were working on this at roughly the same time and exchanging notes. He helped me, I hope I helped him a bit.

Permalink 4 Comments

SQL University: Introduction to Indexes, Part the Third

July 23, 2010 at 5:00 am (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (fragmentation, indexes, indexing, sql university, statistics, statistics update, TSQL)

Nice to see most of you have managed to fight your way through the shoggoths outside to attend another lecture at the Miskatonic branch of SQL University. This will be the third and final part of the introduction to indexes lecture. Please, if you’re going mad, step out into the hall. Our previous two lectures introduced the concept of indexes and then talked about two types of indexes, clustered and nonclustered. This lecture will cover the concept of statistics as they relate to indexes.

If you followed the previous lecture then you know that indexes are stored in a Balanced Tree or B-Tree structure. You know that this storage mechanism is intended to provide fast retrieval of data. But, how can the query engine inside SQL Server know which index has the information it wants? How can the query engine know that a given index will provide a mechanism to speed up the query that is being run? Statistics.

Statistics are information generated defining the selectivity of a column or columns in the database. They may or may not be part of an index. Columns alone can have statistics and indexes can have statistics. The query optimizer uses statistics to determine how it will access data. You may have the greatest index in the world, but if your statistics are wrong, that index won’t get used. By default statistics are automatically created and automatically updated. The automatic update of statistics is based on a formula that depends on changes in the data and changes in structure, as outlined in this Microsoft documentation. In general, it’s a very good practice to leave this alone because SQL Server does a great job of maintaining statistics.

But what are statistics really?

Statistics are basically a sampling of data stored in such a way as to make it possible for the optimizer to understand what the data looks like without actually accessing it. It is, absolutely, meta-data, data about data. The sampling of the data is, usually, a scan across the data, dropping in every certain number of rows as a percentage of the data in the table, but it can be a full scan, where it looks at all the data in the table. This meta data that is collected is broken down into several pieces. Statistics are represented through:

The histogram, which provides a picture of the data in a column
The header, which provides information about the statistics or meta-data about meta-data
Density information, which stores the distribution of the data across the columns being tracked by the statistics

In short, a lot of information. To start understanding what this all means, let’s see how to retrieve a set of statistics:

DBCC SHOW_STATISTICS ("Sales.SalesOrderDetail", IX_SalesOrderDetail_ProductID);

This query uses tables stored in the AdventureWorks2008R2 to display statistics information. Just a side note, so many DBCC functions have been replaced by dynamic management views that I honestly expected one for statistics too, but it’s not there. Running the query results in the following set of information:

The first result set at the top is the header. Most of the information you see there should make sense to you based on what we’ve been talking about. A couple of interesting points worth noting are the “Rows Sampled” column which shows how many of the rows were looked at while the stats were being built, Steps, which gives you an indication of the size of the histogram, and Density, which shows the selectivity of the statistics, but that column is not the one used by the optimizer. The main thing I use header information for is to see when the statistics were last updated.

The second set of data shows the density of the columns used to define the index. You can see that each column is measured as it gets added to the index. This shows you how a compound index makes the data more and more selective. If those numbers are at all big, as opposed to vanishingly small like those presented, you have a selectivity isssue with your index. The less selective the index is, the less likely it will be used.

Finally, the last data set is the histogram. What I have displayed is actually just the first seven of two hundred rows. A histogram is basically a small snapshot of the data in your index. Specifically, the data in the first column of your index. That’s worth knowing. If you choose to, for some crazy reason, build an index with a bit column as the leading edge (that means the first column) and you’re surpised that your index isn’t getting used, this is where you can go to understand why. The histogram is one of the main tools that SQL Server uses to determine which operations it will use to retrieve your data. Here’s how it works. First you get the RANGE_HI_KEY which shows the upper bound column value, the top of the range of rows, that the this entry in the histogram represents. Then you see the RANGE_ROWS that displays the number of rows within the range represented by the this entry in the histogram (except, for this entry, there is no range, it’s a single value). Next is the EQ_ROWS that tells you how many rows are equal to the value represented by the RANGE_HI_KEY. That number is large here because there is no range of rows, this value is represented by approximately 2838.166 rows inside the data. Following that is DISTINCT_RANGE_ROWS which shows you the distinct number of rows within each stepped range of values. In this case, again, because we’re not dealing with a range, that number is zero. Finally, you can see the AVG_RANGE_ROWS displaying the average number of duplicate values within the range.

Scrolling down within that set of data you can see what a range of values looks like:

In this case the range covers two values with the upper limit being 805 and the previous upper limit being 801.

In short, you can see what the optimizer sees and determine if you should have more up to date statistics. For example, if I query the table and retrieve ProductID values that are between 802 and 805, which would represent step 72 in the histogram, I get 246 rows, not 442, which is what I should see. Since the statistics are a little old on the table, they haven’t been updated since June 23rd, I’ll update them. You can use sp_updatestats, which will sample the data and generate a histogram with some guesses, or you can do this:

UPDATE STATISTICS Sales.SalesOrderDetail WITH FULLSCAN

Now when I run DBCC SHOW_STATISTICS, the header information shows that the rows sampled equal the rows in the table. My new histogram has almost exactly the same distribution, but the row counts are incredibly accurate. In my case the number of RANGE_ROWS is equal to 200 and the number of EQ_ROWS is equal to 46. Which equals the 246 rows that we got from querying the data directly. That represents a very accurate set of statistics.

So how do you know when to update your statistics? It’s really hard to say. If you start seeing different execution plans than you did previously, or if you note that the date on the statistics seems old, you might want to update stats. How do you determine if you should use a full scan or sampled statistics? Again, hard to say. In most circumstances a sampled set of statistics should work, but in some areas, if the data distribution is off, you may want to run a full scan.

There is clearly more and more to discuss about statistics and how to maintain them. There is a lot more to discuss about indexes. I didn’t even get to index fragmentation. Today you should remember:

Statistics are used to determine which indexes work well for a query
Statistics are composed of multiples sets of data, the header, density information, and the histogram
Histograms have a maximum of 200 steps
Histograms show the data distribution
Despite automatic maintenance, statistics can be out of date
The sampled update of statistics doesn’t always create the best set of data

For more information on statistics, I recommend Microsoft’s Books On Line for SQL Server. It’s a great resource.

But I hear the whipoorwills kicking up and I see a funny looking fog developing outside. It’s getting awfully dark for mid-day. One of the other professors stepped on some squealing white… thing… on his way into today and a one of the TA’s went stark raving mad at something they saw in the rectory. Time to end class. Best of luck on your journey back to your dorms. I’m locking myself in my office.

While I’m there. I might churn out one or two more entries on indexes. Look for them in the SQL University E-Book (no doubt to be published posthumously)

Permalink 4 Comments

SQL University: Introduction to Indexes, Part the Second

July 21, 2010 at 5:00 am (SQL Server 2005, SQL Server 2008, SQLServerPedia Syndication, TSQL) (index, indexes, indexing, SQL Server, sql university)

Welcome once more to the Miskatonic branch of SQL University. Please try to concentrate. I realize the whipoorwills singing outside the window in a coordinated fashion that sounds almost like laboured breathing can be distracting, but we’re talking about indexes here.

We left last class with a general idea what an index is, now it’s time for some specifics. There are several different kinds of indexes, as we talked about last class. But the two you’re probably going to work with the most are clustered, non-clustered. Each of these indexes is stored in a structure called a B-Tree, a balanced tree, not a binary tree. That’s a very important distinction.

A B-Tree is a double-linked list that is defined by the keys of the indexes on the top and intermediate pages, and at the leaf level by the data itself in the case of clustered indexes. Some of you no doubt think I’m quoting from De Vermis Mysteriis. Basically, for our purposes, a B-Tree consists of a series of pages. There is a top page, or root page, that defines the beginning of the index key. It points to a series of intermediate pages. Each intermediate page contains a range, a previous and a next value. These all point to each other, hence, double linked. The idea is that SQL Server can quickly identify which intermediate page has the pointers down to the leaf node, the final node in the stack. The values of these pointers are defined by the key of the index, the column or columns that you define when you create the index. There are always at least two levels, leaf & root, but there can be more, depending on the amount of data and the size of the keys. Just remember, the size of the key, which refers both to the data types in the key and the number of columns, determines how many key values can get on a page, the more key values on a page, the faster access will be, the fewer key values, the more pages that have to be read, and therefore, the slower the performance.

In general the purpose is to be able to quickly navigate to a leaf or set of leaf pages. When a B-Tree is used and the query engine is able to navigate quickly down to the leaf needed, that is an index seek. But when the B-Tree has to be moved through, in whole or in part, scanning for the values, you’re looking at an index scan. Obviously, in most cases, a seek will be faster than a scan becuase it’s going to be accessing fewer pages to get to the leaf needed to satsify the query. Just remember, that’s not always true.

Let’s get on to the indexes. It’s already been mentioned, but it bears repeating, the principle difference between a clustered and non-clustered index is what is at the leaf level. In a non-clustered index, it’s simply the key values and an values added through the use of the INCLUDE option along with a lookup value to either the clustered index key or an identifier within a table. In a clustered index, the data is stored down at the leaf. This is why people will frequently refer to a clustered index as being “better” than a non-clustered index, because you’re always going directly to the data when you’re looking information up within a clustered index. But, as with the scans vs. seek argument, this is not always true either.

I mentioned that a non-clustered index refers back to the clustered index, if there is one on the table. Because the data is stored at the leaf level of the clustered index, when you need to retreive other columns after performing a seek on a non-clustered index, you must go and get those columns from the clustered index. This is known as a key lookup, or in older parlance, a bookmark lookup. This operation is necessary when data not supplied by the non-clustered index, but can be very expensive because you’ve just added extra reads to your query.

What if there isn’t a clustered index on the table? What does the non-clustered index use to find other columns? If the table doesn’t have a clustered index, then that table is referred to as a heap. It’s called a heap because the data is simply stored in a pile, with no logical or physical ordering whatsoever. With a heap, SQL Server takes it on itself to identify the leaf level storage and creates a row id value for all the rows in the table. This row id can be used by the non-clustered index to find the data. That is referred to by the completely arcane and incomprehensible term, row id lookup. You might be thinking, hey, that means I don’t have to create a clustered index because SQL Server will create one for me. You’d be wrong. Maintaining the row id is an expensive operation and it doesn’t help in retrieving the data in an efficient manner. It’s just necessary for SQL Server to get the data back at all. In general, this is something to be avoided.

A non-clustered index doesn’t necessarily have to perform a lookup. If all the columns referred to in a query are stored within a non-clustered index, either as part of the key or as INCLUDE columns at the leaf, it’s possible to get what is called a “covering” query. This is a query where no lookup is needed. Indexes that can provide a covering query everything it needs are referred to as covering indexes. A covering query is frequently one of the fastest ways to get at data. This is because, again, depending on the size of the keys and any INCLUDE columns, a non-clustered index will have more information stored on the page than a clustered index will and so fewer pages will have to be read, making the operation faster.

By and large, a good guideline is to put a clustered index on all tables. SQL Server works extremely well with clustered indexes, and it provides you with a good access mechanism to your data. If you don’t put a clustered index on the table, SQL Server will create and maintain a row ID anyway, but as I said before, this doesn’t save much work on the server and it doesn’t provide you with any performance enhancement.

That’s a basic introduction to the three concepts of the clustered index, the non-clustered index and the heap. The points I’d like you to remember are:

Indexes are stored in Balanced Trees
Balanced Trees have, generally, three levels, root page, intermediate page, and leaf page
In clustered indexes, data is stored at the leaf page
In non-clustered indexes, a pointer is maintained back to the clustered index or the row id
A heap is a table without a clustered index

Remember those things and you can really begin to dig down on how indexes work. Understanding how they work will assist you in designing them for your database and your queries.

Next class we’ll go over statistics.

I wouldn’t walk back to your dorm by way of the shore. I’ve seen some rather odd looking people near the docks lately that didn’t give me a good feeling. See you next time… maybe.

Permalink 3 Comments

Home of the Scary DBA

Switching Off Parameter Sniffing

Powershell is Really Easy… If you know what you’re doing

How to Tell if Execution Plans are Reused

Parameter Name Size And Performance

Recursive Hiearchies in Reporting Services

XML as a Source

Recursive Hiearchies

Drill Down & Aggregates

Conclusion

Viva Las Vegas!

24 Hours of PASS: Summit Preview

Powershell Remoting with SQL Server

Enable remoting on the machines you wish to call

Configure Sessions

Remoting to SQL Server

SQL University: Introduction to Indexes, Part the Third

SQL University: Introduction to Indexes, Part the Second

RSS Feed

About

Blogroll

Books

SCOM

Special Awards

SQL Server

Twits, Tweets, Whatever

Pages

Archives

Meta