Michael J. Swart

February 26, 2015

When Parameter Sniffing Caused Deadlocks

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:06 am

Last week I was asked to troubleshoot some deadlocks in production. I was surprised to find out that parameter sniffing was one of the causes. I describe the investigation below.

Parameter Sniffing

SQL Server does this neat trick when you give it a query with parameters. The query optimizer will take the parameter values into account when making cardinality estimates. It finds the best query plan it can for these values. This is called parameter sniffing.

But parameter sniffing combined with query plan caching means that SQL Server seems to only care about parameter values the first time it sees a query. So ideally the first parameter values should be typical parameter values.

When people talk about parameter sniffing problems, it’s usually because this assumption doesn’t hold. Either the query was compiled with atypical values, or maybe the data has an uneven distribution (meaning that “typical” parameter values don’t exist).

But normally:

Does this smell funny to you?

The Problem

The problem I saw in production involved some fairly typical looking tables. They looked something like this:

CollectionERD

This is how they were defined.

create table dbo.Collections
(
  CollectionId bigint identity primary key, 
  Name nvarchar(20) default ('--unnamed--') not null,
  Extrastuff char(100) not null default ('')
);
 
create table dbo.CollectionItems
(
  CollectionItemId bigint identity primary key,
  CollectionId bigint not null references dbo.Collections(CollectionId),
  Name nvarchar(20) default ('--unnamed--') not null,
  ExtraStuff char(100) not null default ('')
);
 
create index ix_CollectionItems
on dbo.CollectionItems(CollectionId);

The errors we were getting were deadlock errors and the deadlock graphs we collected were always the same. They looked something like this:

CollectionsDeadlockGraph

See that procedure called s_CopyCollection? It was defined like this:

create procedure dbo.s_CopyCollection 
  @CollectionId bigint
as
 
  set nocount on;
 
  declare @NewCollectionId bigint;
 
  if @CollectionId = 0
     return;
 
  if not exists (select 1 from dbo.Collections where CollectionId = @CollectionId)
     return;
 
  set xact_abort on;
  begin tran;
 
    insert dbo.Collections (Name, Extrastuff)
    select Name, ExtraStuff
    from dbo.Collections
    where CollectionId = @CollectionId;
 
    set @NewCollectionId = SCOPE_IDENTITY();
 
    insert dbo.CollectionItems (CollectionId, Name, ExtraStuff)
    select @NewCollectionId, Name, ExtraStuff
    from dbo.CollectionItems
    where CollectionId = @CollectionId;
 
  commit;

It’s a pretty standard copy procedure right? Notice that this procedure exits early if @CollectionId = 0. That’s because 0 is used to indicate that the collection is in the “recycle bin”. And in practice, there can be many recycled collections.

Some Digging

I began by reproducing the problem on my local machine. I used this method to generage concurrent activity. But I couldn’t reproduce it!. The procedure performed well and had no concurrency issues at all.

This meant more digging. I looked at the procedures behavior in production and saw that they were performing abysmally. So I grabbed the query plan from prod and here’s what that second insert statement looked like:

CollectionsBadPlan

This statement inserts into CollectionItems but it was scanning Collections. That was a little confusing. I knew that the insert needed to check for the existence of a row Collections in order to enforce the foreign key, but I didn’t think it had to scan the whole table. Compare that to what I was seeing on my local database:

CollectionGoodPlan

I looked at the compilation parameters (SQL Sentry Plan Explorer makes this easy) of the plan seen in production and saw that the plan was compiled with @CollectionId = 0. In this case, the assumption about parameter sniffing I mentioned earlier (that the compilation parameters should be typical parameters) did not hold.

This procedure was performing poorly in production (increasing the likelihood of overlapping executions times) but also, each one was taking shared locks on the whole Collections table right after having inserted into it. The whole procedure uses an explicit transaction and that’s a recipe for deadlocks.

Doing Something About It

Here are things I considered and some actions I took. My main goal was to avoid the bad plan shown above.

  • Never call the procedure with @CollectionId = 0. The early-exit in the procedure was not enough to avoid bad query plans. If the procedure never gets called with @CollectionId = 0, then SQL Server can never sniff the value 0.
  • I began to consider query hints. Normally I avoid them because I don’t like telling SQL Server “I know better than you”. But in this case I did. So I began to consider hints like: OPTIMIZE FOR (@CollectionId UNKNOWN).
  • I asked some experts. I know Paul White and Aaron Bertrand like to hang out at SQLPerformance. So I asked my question there. It’s a good site which is slightly better than dba.stackexchange when you want to ask about query plans.
  • Aaron Bertrand recommended OPTION (RECOMPILE). A fine option. I didn’t really mind the impact of the recompiles, but I like keeping query plans in cache when I can, just for reporting purposes (I can’t wait for the upcoming Query Store feature)
  • Paul White recommended a LOOP JOIN hint on the insert query. That makes the INSERT query look like this:
       insert dbo.CollectionItems (CollectionId, Name, ExtraStuff)
        select 432, Name, ExtraStuff
        from dbo.CollectionItems
        where CollectionId = 21
        option (loop join);

    That was something new for me. I thought LOOP JOIN hints were only join hints, not query hints.

  • Paul White also mentioned some other options, a FAST 1 hint or a plan guide and he also suggested OPTION (RECOMPILE).

So I stopped calling the procedure with @CollectionId = 0 and I used a query hint to guarantee the better plan shown above. The performance improved and the procedure was no longer vulnerable to inconsistent performance due to parameter sniffing. 

 In general, there seem to be only two ways to avoid deadlocks. The first way minimizes the chance that two queries are executing at the same time. The second way carefully coordinates the locks that are taken and the order they’re taken in. Most deadlock solutions boil down to one of these methods. I was happy with this solution because it did both.

January 23, 2015

Designing Indexed Views for OLTP Workloads

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

When I look at indexed views defined on OLTP databases, I’m encouraged when their join diagrams resemble snowflake schemas.

It must be nice in Vermont this time of the year.

When you create an indexed view, SQL Server will enforce a number of restrictions. These restrictions ensure that your views are deterministic and easy to maintain. The restrictions are more than a recommendation, SQL Server simply won’t let you create the index if your view doesn’t meet those criteria.

Indexed Views Can Sometimes Cause Poor Performance

I once thought that if I followed Microsoft’s prerequisites for indexed views, then the maintenance of those indexed views was guaranteed to always be safe. I thought the restrictions would guarantee performance comparable to the maintenance of a regular index. But I was wrong, sometimes it can be much worse. Let’s look an example I invented for this post. Check out the following UPDATE statement. SQL Server reports eighteen logical reads:

use AdventureWorks2012
 
SET STATISTICS IO ON
 
-- An update of the Product table (no indexed views defined)
UPDATE Production.Product
SET Color = 'Midnight' 
WHERE Color = 'Black';
 
-- Table 'Product'. Scan count 1, logical reads 18

But when I create this indexed view:

CREATE VIEW dbo.v_AggregateQuantityByColor WITH SCHEMABINDING
AS 
 
SELECT 
  p.Color,
  SUM(th.Quantity) AS [total quantity],
  COUNT_BIG(*) AS [transaction count]
FROM Production.TransactionHistory th
JOIN Production.Product p
  ON th.ProductID = p.ProductID
GROUP BY p.Color;
 
GO
 
CREATE UNIQUE CLUSTERED INDEX ix_v_AggregateQuantityByColor
  ON dbo.v_AggregateQuantityByColor (Color)
 
GO

Then the same UPDATE statement becomes significantly more expensive requiring over 1000 reads:

use AdventureWorks2012
 
SET STATISTICS IO ON
 
-- An update of the Product table (maintenance of an indexed view is required)
UPDATE Production.Product
SET Color = 'Midnight' 
WHERE Color = 'Black';
 
/*
Table 'v_AggregateQuantityByColor'. Scan count 1, logical reads 6, 
Table 'Workfile'.                   Scan count 0, logical reads 0, 
Table 'Worktable'.                  Scan count 2, logical reads 377, 
Table 'TransactionHistory'.         Scan count 1, logical reads 797, 
Table 'Product'.                    Scan count 1, logical reads 18, 
*/

You can see the extra work caused by the indexed view in the query plan:

Indexed View Maintenance

click to embiggen

The maintenance cost for this UPDATE statement got significantly worse. If statements like this are executed frequently it could be disastrous. That’s one of the reasons that Microsoft promotes indexed views as ideal for read-heavy scenarios such as those seen in data warehousing.

But I think that indexed views still have a place in OLTP systems. It’s just that extra care must be taken so that no SQL statement causes indexed view maintenance to be significantly worse than the regular table index maintenance. I want to talk about some things I look for when I evaluate views meant for OLTP databases.

Look For A Join Diagram Like a Snowflake Schema

Look at your view’s select statement. Specifically focus on the tables in the FROM clause and draw a “join diagram” for yourself. I’ve got a shortcut for that work. I start by running a query like this:

SELECT * FROM dbo.v_AggregateQuantityByColor OPTION (EXPAND VIEWS);

This gets me the query plan for the statement. I open the query plan in SQL Sentry’s Plan Explorer which has a handy dandy Join Diagram tab.

Using your join diagram, ask yourself these questions:

  • Does the join diagram look like a snowflake schema with one “fact” table?
  • Do the joins correspond to defined indexed foreign keys?
  • Are the columns included in the “dimension” tables modified infrequently?

Whenever I’ve dealt with poor performance caused by indexed views, these views have always given a “no” to at least one of these questions.

Example

Let’s apply these questions to dbo.v_AggregateQuantityByColor from my example. Here’s the join diagram:

Join Diagram

This diagram does in fact look like a snowflake schema with the TransactionHistory table acting as the fact table and the Product table acting as the dimension table. The one join follows an actual foreign key FK_TransactionHistory_Product_ProductID. And this foreign key is indexed (IX_TransactionHistory_ProductID).

Now lets answer the last question “Are the columns included in the dimension tables modified infrequently?”. In the context of this question, that’s the Color column in the Product table. Now it’s impossible to actually tell how frequently colors get updated in the Product table because this is a hypothetical example. But it’s unlikely that any OLTP workload would update product colors that often. So lets give the answer: “infrequently updated”.

So according to my criteria, this indexed view gets the green light. Even though it seems like it could be expensive to maintain, I don’t have any automatic objections with it because product colors are rarely updated.

FAQ

Q: Is it possible to ignore these rules and still create effective indexed views?
A: Yes!

Q: Is it possible to follow these rules and still create indexed views that cause performance problems?
A: Yes!

Q: If I follow these rules, can I skip any performance testing steps?
A: No!

Q: So why the heck am I reading this post?
A: Lots of reasons. At the very least, it is useful when you want to identify indexed views and testing scenarios that deserve extra scrutiny.

December 12, 2014

Obvious and Not-So-Obvious Writing Tips

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 10:54 am

Takeaway: I leave SQL Server behind this week and I give two tips for technical bloggers,

  1. An obvious tip: Practice a lot
  2. A not-so-obvious tip: Help your readers skip reading your article

First the obvious tip.

Practice in Volume

As far as tips go, practice makes perfect is kind of obvious, and ultimately a little disappointing. Just like “Eat right and exercise”, the phrase “Go practice more” is one of those things that is easier said than done.

I first heard about a Composition Derby when I read The Underachieving School by John Holt. John Holt was an English teacher and author and he describes the Composition Derby as a device he used to help kids practice writing. The kids in his English class get divided into teams and they are asked to write about anything they want (spelling and grammar doesn’t count). At the end of the competition, the team who has written the most words wins. That’s the only criteria, number of words. When kids don’t worry about making mistakes they feel free to practice more. And that frees them to improve faster.

But I think the volume of practice is the key here. I believe in Malcolm Gladwell’s 10,000 hours rule. The rule claims that it takes 10,000 hours to become an expert at something. I like the idea of the 10,000 hour rule, but the one thing I don’t like is that it gives a definite number. Eight hours of writing practice can yield results and 10,000 hours implies a finish line. For example, compare these two illustrations I drew. They both use the same reference photo but they’re spaced apart by about 1,000 hours of practice.
An upset looking E. F. CoddTed Codd

It’s easy to compare illustrations when presented side by side. It’s not as easy to compare writing but feel confident that with practice, you’ll improve and your readers will notice.

Make Your Article Skippable

The second tip is a little counter-intuitive. Make it easy for your readers to skim your article or even skip reading your article all together.

You have something important to write, and I get that. But when thinking about the reader-writer relationship, your article is all about your readers. Their need to read actually outweighs your need to write and ultimately your readers will decide what’s important. I’m notoriously bad at predicting whether a post of mine will be well received or not. And so I make my blog posts skippable. The readers who find what I write important will stick around.

Here are some methods I use that help readers stop reading. Consider using these methods in your own writing

  • Topic sentence (which I frame as a takeaway). Condense your whole blog into a tweet-sized sentence. Give everything away as quickly and clearly as you can. Leave suspense-building for mystery writers. For example, if you only read SQL Server articles, you probably haven’t made it this far. You probably didn’t make it past the first sentence.
  • Organize your article into sections with headings that can stand alone as an outline. It improves skimmability.
  • In general, put a high value on your reader’s time. Make every word count in helping you say the one thing you want to say and don’t say anything else.

Now here’s the crazy part, when you make your article skippable it actually has the opposite effect. These methods I use actually help readers stick around. Readers have a better mental roadmap of the content and they stay (see, you’ve stuck around this far!).

December 3, 2014

Materialized Views in SQL Server

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:28 am

What’s the difference between Oracle’s “materialized views” and SQL Server’s “indexed views”? They both persist the results of a query, but how are they different? Sometimes it’s difficult to tell.

I'm on the left (or am I?)

I’m on the left (or am I?)

One difference is that SQL Server’s indexed views are always kept up to date. In SQL Server, if a view’s base tables are modified, then the view’s indexes are also kept up to date in the same atomic transaction.

Let’s take a look at Oracle now. Oracle provides something similar called a materialized view. If Oracle’s materialized views are created without the REFRESH FAST ON COMMIT option, then the materialized view is not modified when its base tables are. So that’s one major difference. While SQL Server’s indexed views are always kept current, Oracle’s materialized views can be static.

Static Materialized Views In SQL Server?

Yeah, we just call that a table. You can use a SELECT INTO statement and it’s pretty easy. In fact, for fun I wrote a procedure that does the work for you. Given the name of a view it can create or refresh a table:

/* This is a proof-of-concept and is written for illustration purposes, don't use this in production */
create procedure dbo.s_MaterializeView
  @viewName nvarchar(300),
  @yolo bit = 0 -- use @yolo = 1 to execute the SQL immediately
as
 
declare @persistedViewName nvarchar(300);
 
if not exists (select 1 from sys.views where object_id = object_id(@viewName))
  THROW 50000, N'That @viewName does not exist', 1;
 
select 
  @viewName = QUOTENAME(object_schema_name(object_id)) 
  + N'.'
  + QUOTENAME(object_name(object_id)),
  @persistedViewName = QUOTENAME(object_schema_name(object_id)) 
  + N'.'
  + QUOTENAME(N'persisted_' + object_name(object_id))
from sys.views
where object_id = object_id(@viewName);
 
set xact_abort on;
begin tran
  declare @sql nvarchar(2000);
  set @sql = N'
    IF OBJECT_ID(''' + @persistedViewName + N''') IS NOT NULL
      DROP TABLE ' + @persistedViewName + N';
 
    SELECT *
	INTO ' + @persistedViewName + N'
    FROM ' + @viewName + N';'
 
  if (@yolo = 1)
    exec sp_executesql @sql;  
  else 	
    print @sql;
commit

Which can be used to generate sql something like this:

    IF OBJECT_ID('[dbo].[persisted_vSomeView]') IS NOT NULL
      DROP TABLE [dbo].[persisted_vSomeView];
 
    SELECT *
	INTO [dbo].[persisted_vSomeView]
    FROM [dbo].[vSomeView];

Are Such Static Materialized Views Useful?

Yes:

  • They can be used to get around all the constraints placed on regular indexed views. And if you’ve ever implemented indexed views, you understand that that’s a lot of constraints. I think this benefit is what makes this whole blog post worth consideration.
  • Because it’s static, you can avoid all the potential performance pitfalls that accompany the maintenance of an indexed view (more on this next week).
  • Good or bad, the view doesn’t have to be created with SCHEMABINDING.
  • Indexing is strictly do-it-yourself. Chances are you want more than a single heap of data for your materialized view.

… and no:

  • Most obviously, the data is static, which is another way of saying stale. But notice how Microsoft promotes indexed views. They say that indexed views are best suited for improving OLAP, data mining and other warehousing workloads. Such workloads can typically tolerate staleness better than OLTP workloads. And so maybe materialized views are a feasible alternative to indexed views.
  • You have to manage when these views get refreshed. This means scheduling jobs to do extra maintenance work (yuck). For me that’s a really high cost but it’s less costly if I can incorporate it as part of an ETL process.
  • Using Enterprise Edition, SQL Server’s query optimizer can choose to expand indexed views or not. It can’t do that with these materialized views.

I didn’t write the procedure for any important reason, I just wrote it because it was fun. But I have used this materialized view technique in SQL Server at work and I’ve been quite successful with it. It’s not something that should be used often, but it’s always worth considering if you can understand the trade-offs.

October 3, 2014

Watch Out for Misleading Behaviour From SQL Server

Takeaway: To get consistent behaviour from SQL Server, I share a set of statements I like to run when performing tuning experiments.

Inconsistent Behaviour From SQL Server?

I often have conversations where a colleague wants to understand why SQL Server performs faster in some cases and slower in other cases.

The conversation usually starts “Why does SQL Server perform faster when I…” (fill in the blank):

  1. … changed the join order of the query
  2. … added a transaction
  3. … updated statistics
  4. … added a comment
  5. … crossed my fingers
  6. … simply ran it again

What’s Going On?

It can actually seem like SQL Server performs differently based on its mood. Here are some reasons that can affect the duration of queries like the ones above

  • You changed something insignificant in the query. What you may be doing is comparing the performance of a cached plan with a newly compiled plan. Examples 1 – 4 might fall under this scenario. If that’s the case, then you took a broken thing and gave it a good thump. This percussive maintenance may be good for broken jukeboxes, but maybe not for SQL Server.
  • What about those last two? Say you hit F5 to execute a query in Management Studio, and wait a minute for your results. You immediately hit F5 again and watched the same query take fifteen seconds. Then I like to point out that maybe all that data is cached in memory.

In order to do tune queries effectively, we need consistent behaviour from SQL Server, if only to test theories and be able to rely on the results. SQL Server doesn’t seem to want to give us consistent behaviour…

So Is It Possible To Get Straight Answers?

Best line from all Star Wars

But maybe we can get straight answers from SQL Server. Here’s a test framework that I like to use before all experiments when I want consistent behaviour:

/* Only do this on dev sql servers! */
CHECKPOINT 
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
SET STATISTICS IO, TIME ON
-- Ctrl+M in Management Studio to include actual query plan

The first two statements are meant to clear SQL Server’s cache of data. Because of write ahead logging, SQL Server will write data changes to disk immediately, but may take its time writing data changes to disk. Executing CHECKPOINT makes SQL Server do that immediately. After the checkpoint there should be no dirty buffers. That’s why DBCC DROPCLEANBUFFERS will succeed in dropping all data from memory.

The DBCC FREEPROCCACHE command will remove all cached query plans.

These commands give SQL Server a fresh starting point. It makes it easier to compare behaviour of one query with the behaviour of another.

The SET STATISTICS IO, TIME ON and the Ctrl+M are there in order to retrieve better information about the performance of the query. Often CPU time, Logical IO, and the actual query plan are more useful when tuning queries than elapsed time.

September 18, 2014

SQL Server Ignores Trailing Spaces In Identifiers

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:27 am

Takeaway: According to SQL Server, an identifier with trailing spaces is considered equivalent to the same identifier with those spaces removed. That was unexpected to me because that’s not how other programming languages work. My investigation was interesting and I describe that here.

The First Symptom

Here’s the setting, I work with a tool developed internally that reads metadata from a database (table names, column names, column types and that sort of thing). Recently the tool told me that a table had an unexpected definition. In this case, a column name had an extra trailing space. I expected the column name "Id" (2 characters), but my tool was reporting an actual value of  "Id " (notice the blank at the end, 3 characters). That’s what started my investigation.

But that’s really weird. What would lead to a space accidentally getting tacked on to a column name? I couldn’t think of any reason. I also noticed a couple other things. Redgate SQL Compare was reporting no discrepancies and the database users weren’t complaining at all, they seemed just fine. A bug in the in-house tool seemed most likely. My hunch was that there was a problem with the way we collecting or storing these column names (how did a space sneak in there?).

Where Are Column Names Stored?

I wanted to look at the real name of the column – straight from the source – so I ran:

SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME LIKE 'Id%'

It told me that my tool wasn’t wrong. That the column was actually named "Id " with the space. So maybe Red Gate is getting its metadata from somewhere else? I know of a few places to get column information. Maybe Red Gate is getting it from one of those? Specifically I wanted to look closer at these views:

  • sys.columns
  • sys.syscolumns
  • INFORMATION_SCHEMA.COLUMNS

Because these objects are views, I used sp_helptext to learn that all the column names ultimately come from a system table called sys.syscolpars. But sys.syscolpars is a system table and you can’t look at its contents unless you connect to the database using the dedicated administrator connection. And that’s exactly what I did.

I learned that there is only one version of column names, only one place on disk that sql server persists the name of the column. It’s interesting because this implies that Red Gate’s SQL Compare trims trailing spaces from identifier names.

But Doesn’t SQL Server Care?

Well, there’s one way to check:

CREATE TABLE [MyTest] ( [id ] INT );
INSERT INTO [MyTest] VALUES (1);
 
SELECT [id ], [id] -- one column name with a space, one column name without
FROM [MyTest]; 
-- returns a dataset with column names as specified in the query.
go

Just like Red Gate’s SQL Compare, it seems like SQL Server doesn’t care about trailing spaces in identifiers either.

Google? Stackoverflow? Want to Weigh In?

A quick search led me to the extremely relevant Is SQL Server Naming trailing space insensitive?.

And that question has answers which link to the Books Online page documenting Delimited Identifiers. That page claims that “SQL Server stores the name without the trailing spaces.” Hmmm, they either mean in memory, or the page is inaccurate. I just looked at the system tables a moment ago and the trailing spaces are definitely retained.

The stackoverflow question also led me to a reported defect, the Connect item Trailing space in column names. This item was closed as “by design”. So this behavior is deliberate.

What do other SQL Vendors do?

I want to do experiments on SQL databases from other vendors but my computer doesn’t have a large number of virtual machines or playground environments. But do you know who does? SQL Fiddle
It’s very easy to use this site to see what different database vendors do. I just pick a vendor and I can try out any SQL I want. It took very little effort to be able to compile this table:

RDBMS CREATE TABLE... SELECT "id ", "id"...
MySQL Incorrect column name 'id '
Oracle Success "id": invalid identifier
PostgreSQL Success Column "id" does not exist
SQLite Success could not prepare statement (1 no such column: id)
SQL Server Success
ID ID
1 1

And What Does the ANSI standard say?

Look at the variety of behaviors from each vendor. I wonder what the “standard” implementation should be.

Mmmm... SQL Syntax rules.

I googled “ANSI SQL 92″ and found its wikipedia page and that led me to the SQL-92 Standard itself.

ANSI (paraphrased) says that

<delimited identifier> ::= <double quote><one or more characters><double quote>

And it also says explicitly that delimited identifiers can include spaces.

What About String Comparisons In General?

During my experiments on SQL Server I found myself executing this query:

SELECT *
FROM sys.columns
WHERE name = 'Id'

I was surprised to find out that my three-character "Id " column came back in the results. This means that SQL Server ignores trailing spaces for all string comparisons, not just for identifiers.

I changed my google search and looked for “sql server string comparison trailing space”. This is where I found another super-relevant document from Microsoft: INF: How SQL Server Compares Strings with Trailing Spaces.

Microsoft pointed to the ANSI standard again. I mean they explained exactly where to look, they pointed straight to (Section 8.2, , General rules #3) which is the section where ANSI explains how the comparison of two character strings is determined. The ANSI standard says that for string comparisons, the shorter string is effectively padded with trailing spaces so that comparisons can always performed on strings with an equal number of characters. Why? I don’t know.

And that’s where identifier comparisons come in. I found another part of the standard (Syntax rule #11) which tells me that Identifiers are equivalent if they compare as equivalent according to regular string comparison rules. So that’s the link between string comparisons and identifier comparisons.

Summary

There’s a number of things I learned about string comparisons. But does any of this matter? Hardly. No one deliberately chooses to name identifiers using trailing spaces. And I could have decided to sum this whole article up in a single tweet (see the title).

But did you figure out the head fake? This blog post is actually about investigation. The investigation is the interesting thing. This post describes the tools I like to use and how I use them to find things out for myself including:

  • Queries against SQL Server itself, the obvious authority on SQL Server behavior.
    • Made use of sp_helptext
    • Made use of the Dedicated Adminstrator Connection to look at system tables
  • Microsoft’s Books Online (used this twice!)
  • Microsoft Connect
  • Google
  • Stackoverflow
  • SQLFiddle
  • Wikipedia
  • the ANSI Standard

Maybe none of these resources are new or exciting. You’ve likely used many of these in the past. But that’s the point, you can find out about any topic in-depth by being a little curious and a little resourceful. I love to hear about investigation stories. Often how people find things can be at least as interesting as the actual lesson.

June 27, 2014

Trivia about Trivial Plans

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:43 pm

Takeaway: I found an example of a query plan which performs better than the “trivial” query plan.

This post is trivia in that it won’t help you do your job as a developer or DBA. But it’s interesting anyway. It offers a look into an interesting part of SQL Server’s query optimizer.

The Setup

I use the 2012 AdventureWorks database and I mess around with the indexes. It’s a setup that Kendra Little developed in order to demonstrate index intersection.

use AdventureWorks2012
GO
 
DROP INDEX Person.Person.IX_Person_LastName_FirstName_MiddleName;
GO
 
CREATE INDEX [IX_Person_FirstName_LastName] ON [Person].[Person] 
( FirstName, LastName ) WITH (ONLINE=ON);
GO
 
CREATE INDEX [IX_Person_MiddleName] ON [Person].[Person] 
( MiddleName ) WITH (ONLINE=ON);
GO

The Trivial Plan

In management studio, include the actual query plan and run this query:

SET STATISTICS IO ON
 
SELECT FirstName, MiddleName, LastName
FROM Person.Person
 
-- 19972 rows returned
-- 1 scan, 3820 logical reads
-- optimization level: TRIVIAL
-- estimated cost: 2.84673

With such a simple query – one against a single table with no filters – SQL Server will choose to scan the narrowest covering index and it won’t bother optimizing the plan any further. That’s what it means to say the optimization level is TRIVIAL.

For this query, the only index that contains all three columns is the clustered one. So it seems there’s no alternative but to scan it. That sounds reasonable right? That’s what we see in the query plan, it looks looks like this:

ClusteredScan

But notice that SQL Server is doing a lot of reading with this plan choice. The table Person.Person has a column called Demographics. This xml field makes the table very wide, so wide that a typical page in Person.Person can only fit about 5 or 6 rows on average.

The Better-Than-Trivial Plan

Now look at this query:

SELECT FirstName, MiddleName, LastName
FROM Person.Person
WHERE FirstName LIKE '%'
 
-- 19972 rows returned
-- 2 scans, 139 logical reads
-- optimization level: FULL
-- estimated cost: 1.46198

The filter is put in place to have no logical effect.  It complicates things just enough so that SQL Server won’t use a trivial plan. SQL Server fully optimizes the query and the query plan now looks like this:

IndexIntersect

Notice that the plan has scans on two indexes nonclustered indexes and a hash join. SQL Server figures (correctly) that scans of two narrow indexes plus a hash join are still cheaper than the single scan of the fat clustered index.

Careful

I don’t think I need to say this, but I do not recommend adding WHERE column like '%' anywhere except maybe in contrived examples for demo purposes.

(MJS — Enjoy the summer, See you in September!)

May 22, 2014

Enabling the New Cardinality Estimator in SQL Server 2014

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 8:33 am

Takeaway: SQL Server 2014 will make use of its newly re-written Cardinality Estimator when the database’s compatibility mode is at least 120. But there’s more to the story.

What’s a Cardinality Estimator (CE)?

Say you’ve been hired to phone everyone on a particular list. If it’s a list of all Americans taller than seven feet, you might manage quite well on your own. But if it’s a list of all Americans shorter than seven feet, you’ll probably need help from others. That’s not surprising because the sizes of the lists are wildly different. One list could have 300 people on it and the other could have 300 million. The expected sizes of the lists influence how you tackle this problem.

This was after phones, but before the do not call list.

SQL Server does the same thing. It uses statistics to find the best ways to execute queries. To find a good query plan, SQL Server often needs to make many choices (which join type, join order, parallelism etc…) It needs to estimate the cost of each choice and it uses educated guesses to evaluate these costs. That’s what the CE was built to do. It provides educated guesses about the number of rows a query plan has to process. That’s why it’s called the cardinality estimator. The accuracy of these estimates will influence the quality of query plans, and consequently, the performance of queries.

With SQL Server 2014, Microsoft released a rewritten version of SQL Server’s CE. I can’t wait to take advantage of it. I’m looking forward to tuning fewer poorly performing queries. Queries that seem to be written well, but are vulnerable to bad query plans.

Risk of Regressions

The CE is part of the query optimizer, so the rewrite represents a significant change to the database engine. And with any pervasive change, there’s always a risk of regressions. While rare, some workloads are expected to perform worse with the new CE. Joe Sack’s excellent white paper Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator has some essential tips and suggestions on how to assess and deal with these potential regressions.

Some users may want to continue using the legacy CE. And some users may want to decouple the adoption of the new CE with the adoption of SQL Server 2014. Microsoft anticipated this and so they give DBAs a choice. DBAs have the option to either use the new CE or to stick with the legacy CE.

Enabling the New CE – the Official Details

Simply put, CE behavior can be controlled using the compatibility mode and/or trace flags:

  • The new CE is enabled when compatibility mode is 120 and disabled when it is less than that. The compatibility mode of a database is not modified automatically during an upgrade to 2014, so remember to adjust it accordingly.
  • New trace flags are introduced. Trace flag 9481 can force SQL Server to use the legacy CE when it would otherwise use the new one. Conversely, trace flag 2312 can force SQL Server to use the new CE. And if flags 9481 and 2312 are ever both enabled (in any context), then neither flag takes effect. They cancel each other out and the CE behavior is determined only by the compatibility mode.

Just those two things allow you to influence the CE behavior depending on the granularity you require:

  • For a single query – You could use the QUERYTRACEON hint but it’s not a tempting option. Sysadmin privileges or a forced plan are required.
  • Based on your session – Use session trace flags (again, sysadmin privileges are required).
  • Based on the database you’re connected to – Use compatibility mode.
  • For the whole server – Use global trace flags.

Again, Joe Sack’s white paper explains this in more detail. He provides syntax examples and methods to determine which CE was used based on a query plan.

Corner Use Cases

This leads to some surprising behaviors:

Connect to a System Database to Avoid Compatibility Mode issues

For example, this works:

use master -- in SQL Server 2014, master will always be at compatibility mode 120
GO
 
-- any query (regardless of participating tables) will now use the new CE. e.g.:
SELECT COUNT(*) 
FROM Adventureworks2012.Sales.SalesOrderHeader;

But it’s just a trick and not a technique I would recommend. Besides, this trick doesn’t work when calling stored procedures from other databases.

Using a Trace Flag to Cancel Another One

Trace flags 2312 and 9481 don’t play together well. There is no scenario where one takes precedence over the other. If they’re both enabled, then they cancel each other out:

use Adventureworks2012 -- at compatibility mode 110
GO
DBCC TRACEON( 9481 );
 
SELECT COUNT(*) 
FROM Sales.SalesOrderHeader
OPTION( QUERYTRACEON 2312 ); -- 2312 normally enables the new CE 
-- the 2312 hint is canceled by the 9481 trace flag, the legacy CE is still used.

Again, I avoid this scenario so that I don’t need to worry.

How I Plan To Adopt the New CE

I’d like to begin using the new CE as soon as I upgrade to 2014.

But if I wanted to, I would feel comfortable using compatibility mode as a feature toggle for the new CE. There are other behavior differences between compatibility modes 110 and 120. But I don’t use them and won’t encounter them. They’re obscure and easy to review. So for me, I can ignore those other features and use compatibility mode 120 as the CE feature toggle.

The trace flags 2312 and 9481 are new in SQL Server 2014. So if SQL Server is not at version 2014, it will ignore those trace flags. I intend to do the same no matter what version I’m using. I don’t expect to see many queries showing serious regressions with the new CE, but if I encounter any I’m not going to manage them with these trace flags. Instead, I plan to:

  1. Use hints (whether that means index hints, join hints or query hints) to stabilize the plan temporarily.
  2. Spend time tuning or rewriting the query so that it performs well without these hints.

Further Reading

April 23, 2014

Removing Comments from SQL

Filed under: Miscelleaneous SQL,SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:20 am

Check out the following deliberately crazy SQL Script:

create table [/*] /* 
  -- huh? */
(
    "--
     --" integer identity, -- /*
    [*/] varchar(20) /* -- */
         default '*/ /* -- */' /* /* /* */ */ */
); 
go

It’s not surprising that my blog’s syntax colorer has trouble with this statement. But SQL Server will run this statement without complaining. Management Studio doesn’t even show any red squiggly lines anywhere. The same statement without comments looks like this:

create table [/*] 
(
    "--
     --" integer identity, 
    [*/] varchar(20) 
         default '*/ /* -- */' 
); 
go

I want a program to remove comments from any valid SQL and I want it to handle even this crazy example. I describe a handy method that lets me do that.

Using C#

  • In your C# project, find and add a reference to Microsoft.SqlServer.TransactSql.ScriptDom. It’s available with SQL Server 2012’s Feature Pack (search for “ScriptDom” and download).
  • Add using Microsoft.SqlServer.Management.TransactSql.ScriptDom; to your “usings”.
  • Then add this method to your class:
    public string StripCommentsFromSQL( string SQL ) {
     
        TSql110Parser parser = new TSql110Parser( true );
        IList<ParseError> errors;
        var fragments = parser.Parse( new System.IO.StringReader( SQL ), out errors );
     
        // clear comments
        string result = string.Join ( 
          string.Empty,
          fragments.ScriptTokenStream
              .Where( x => x.TokenType != TSqlTokenType.MultilineComment )
              .Where( x => x.TokenType != TSqlTokenType.SingleLineComment )
              .Select( x => x.Text ) );
     
        return result;
     
    }

… and profit! This method works as well as I hoped, even on the given SQL example.

Why I Prefer This Method

A number of reasons. By using Microsoft’s own parser, I don’t have to worry about comments in strings, or strings in comments which are problems with most T-SQL-only solutions. I also don’t have to worry about nested multiline comments which can be a problem with regex solutions.

Did you know that there’s another sql parsing library by Microsoft? It’s found at Microsoft.SqlServer.Management.SqlParser.Parser. This was the old way of doing things and it’s not supported very well. I believe this library is mostly intended for use by features like Management Studio’s Intellisense. The ScriptDom library is better supported and it’s easier to code with.

Let Me Know If You Found This Useful

Add comments below. Be warned though, if you’re a spammer, I will quickly remove your comments. I’ve had practice.

April 11, 2014

Implementing the Recycle Bin Pattern In SQL

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 8:00 am

Kitchener Ontario, recycling since 1983I participated in a week long hackathon recently. It was great to be able to spend the whole week on a self-directed project. I’m excited to write about what my team accomplished, but actually I want to blog about what another team accomplished. That team implemented a really nice “send to recycle bin” feature and they gave me the green light to write about it here.

The recycle bin feature is ultimately a data-hiding feature. Users don’t necessarily want to destroy data, they just don’t want to look at it right now. There are a lot of ways to implement this feature, but one way is by making a few changes in the database (as opposed to the application).

What Needs To Change?

Surprisingly not much. Take your table and give it a nullable RecycleDate column. This is all you need to track the recycled rows. Then create a view that filters out recycled items. That’s pretty much it. Afterwards, if you rename the table, then the view can take its place. This is what that would look like on Adventureworks’ Sales.ShoppingCartItems table:

ALTER TABLE Sales.ShoppingCartItem
  ADD RecycleDate DATE NULL
    CONSTRAINT DF_ShoppingCartItem_RecycleDate DEFAULT NULL;
 
GO
 
EXEC sp_rename 'Sales.ShoppingCartItem', 'AllShoppingCartItems'
 
GO
 
CREATE VIEW Sales.ShoppingCartItem
WITH SCHEMABINDING
AS
    SELECT  ShoppingCartItemID ,
            ShoppingCartID ,
            Quantity ,
            ProductID ,
            DateCreated ,
            RecycleDate
    FROM    Sales.AllShoppingCartItems
    WHERE   RecycleDate IS NULL;
 
GO
 
CREATE PROCEDURE Sales.s_RecycleShoppingCartItem
    (
      @ShoppingCartItemId INT
    )
AS 
    UPDATE  Sales.ShoppingCartItem
    SET     RecycleDate = GETDATE()
    WHERE   ShoppingCartItemID = @ShoppingCartItemId;
 
GO

DML Impact

So what’s the impact on other Delete, Insert, Update or Select statements that are executed against your modified table?

  • Delete statements shouldn’t be affected. You’ll notice that recycle bin contents can’t be deleted via the view. That’s okay.
  • Old Insert statements should work as expected with no adjustments, especially if you name your columns in a column list.
  • Update statements? Check, they’ll continue to work.
  • Select statements will also be unaffected. Especially if you’ve avoided SELECT *.

What About Foreign Keys?

Okay, this is where it gets little tricky. If you don’t use ON DELETE or ON UPDATE clauses with your foreign keys, then you have to be a little careful. I want to show just one example of how things can get a bit messy. Returning to our Adventureworks example, lets think about a query that deletes “shopping carts” as long as it has no items.

DELETE Sales.ShoppingCart
WHERE ShoppingCartId = @ShoppingCartIdToDelete
AND NOT EXISTS
  (
    -- any items in the cart?
    SELECT 1
    FROM Sales.ShoppingCartItem
    WHERE ShoppingCartId = @ShoppingCartIdToDelete
  )

In the old world, this works no problem. But our check for items in the cart misses items that have been recycled and so this query would fail. You’ll have to remember to find queries like this and update them to check Sales.AllShoppingCartItems instead.

Data Lifecycle Policy Concerns

You have a policy right? The lack of one can make it too easy to retain data indefinitely. The concern isn’t necessarily storage, but whether you’re meeting any policies or regulations concerning privacy or other things like that.

The recycle bin feature may make it a little easier to accidentally retain data you didn’t mean to. It may be worth regression testing any delete or purge functionality that you have.

Indexing

Depending on how much data is hidden in the recycle bin, you shouldn’t have to re-evaluate your indexing strategy. Your indexes should probably serve you just as well after this implementation. But if you find yourself storing more than 90% of your data as recycled data, then you may want to start considering re-assessing the table’s indexes. You could consider things like filtered indexes, filtered stats and/or partitioned tables. But before you do, see Data Lifecycle Policy Concerns above.

Other Things To Watch

Any changes to schema or any code should lead to extra testing and the changes I’m proposing are no different.

You have to know your app and environment. Is your recycle bin against a table that participates in downstream Business Intelligence projects? How about Change-Data-Capture? Service Broker? Notification Services? You know better than I do.

Other Reycling Bin Implementations

There are lots of methods.

For example, You don’t have to implement this pattern using SQL. You can implement it in your application. Hiding recycled data via the application makes a lot of sense. Especially if your more of a programmer than a SQL developer (By the way, where’d you come from? Who let you in here?)

It’s worth giving this some thought. Without a recycle bin, the demand to retrieve “deleted” data can be great enough to prompt someone to dig through a restored backup. Digging through restored backups actually counts as a recycle bin implementation even if it is an unintentional and painful one.

Older Posts »

Powered by WordPress