Michael J. Swart

August 15, 2013

A Tuning Technique For Parallel Queries

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Takeaway: I describe a technique I use to tune parallel queries. I tune the serial version of the query using MAXDOP 1. It’s a simpler query plan. Faster understanding leads to faster solutions.

In a parallel universe, I'm blogging about cartography and projection mapping.

How better to explain than by example!

The Example

For this example, I use AdventureWorks2012 and I beef up the sales order tables using Jonathan Kehayias’s script Enlarging the AdventureWorks Sample Databases. For my example, I want to know which salesperson sold the most red items in their last ten orders (looking at orders with at least one red item).

My first crack at the query gives me this:

SELECT P.FirstName, P.LastName, SUM(OrderQty) Items
FROM Person.Person P
CROSS APPLY (
    SELECT TOP 10 SUM(OrderQty) as OrderQty, H.SalesOrderID
    FROM Sales.SalesOrderDetailEnlarged D
    JOIN Sales.SalesOrderHeaderEnlarged H
        ON D.SalesOrderID = H.SalesOrderID
    JOIN Production.Product Pr
        ON D.ProductID = Pr.ProductID
    WHERE H.SalesPersonID = P.BusinessEntityID
        AND Pr.Color = 'red'
    GROUP BY H.SalesOrderID, H.ShipDate
    ORDER BY H.ShipDate DESC
) AS LastTenOrdersWithRedItems
GROUP BY P.BusinessEntityID, P.FirstName, P.LastName
ORDER BY SUM(OrderQty) DESC

Great! That query gives me the Adventureworks salespeople and the number of red items they’ve sold in their last ten orders. But the execution time was a bit slow and so I decide to tune the query. The query plan looks like this:
The query plan for the previous query

There’s a couple things that jump out at me right away: Some of the thick lines tell me that there’s a lot of data getting processed. There’s also a warning that tells me that tempdb was used for a sorting operation. And in general, I notice that there are a lot of operators here including parallelism operators.

So now I’m going to try to look at a serial query plan for the same query. I have to tell SQL Server that the maximum degree of parallelism it should consider is one thread. That’s done with the MAXDOP query hint like this:

SELECT P.FirstName, P.LastName, SUM(OrderQty) Items
FROM Person.Person P
CROSS APPLY (
    SELECT TOP 10 SUM(OrderQty) as OrderQty, H.SalesOrderID
    FROM Sales.SalesOrderDetailEnlarged D
    JOIN Sales.SalesOrderHeaderEnlarged H
        ON D.SalesOrderID = H.SalesOrderID
    JOIN Production.Product Pr
         ON D.ProductID = Pr.ProductID
    WHERE H.SalesPersonID = P.BusinessEntityID
        AND Pr.Color = 'red'
    GROUP BY H.SalesOrderID, H.ShipDate
    ORDER BY H.ShipDate DESC
) AS LastTenOrders
GROUP BY P.BusinessEntityID, P.FirstName, P.LastName
ORDER BY SUM(OrderQty) DESC
OPTION (MAXDOP 1) -- this line has been added

The hint is in there, but it’s not meant to be a permanent thing. I put it in there in order to look at the serial version of the query plan. Sure enough, it looks simpler and it’s easier to examine:

The query plan for the previous query

The information is still the same, but to me it’s a little clearer where the problem is. I see that SalesOrderHeader is scanned completely and then spooled (one iteration per person) and then sorted by date. I also see that there’s a lot of data coming from Person.Person.

So in this case, I may decide to add an index on SalesOrderHeader(SalesPersonId, ShipDate) like this:

CREATE INDEX IX_SalesOrderHeaderEnlarged_SalesPersonShipDate 
    ON Sales.SalesOrderHeaderEnlarged(SalesPersonId ASC, ShipDate DESC)

A Bonus Optimization
There’s another optimization I’d like to consider. Notice that all 19,972 rows in the Person.Person table are scanned. Well I can do better than that. I can join Person.Person to HumanResources.Employee because I know that all salespeople are employees. The join acts as a filter and it really cuts down on processing. The query now looks like:

SELECT P.FirstName, P.LastName, SUM(OrderQty) Items
FROM Person.Person P
JOIN HumanResources.Employee E
    ON P.BusinessEntityID = E.BusinessEntityID
CROSS APPLY (
    SELECT TOP 10 SUM(OrderQty) as OrderQty, H.SalesOrderID
    FROM Sales.SalesOrderDetailEnlarged D
    JOIN Sales.SalesOrderHeaderEnlarged H
        ON D.SalesOrderID = H.SalesOrderID
    JOIN Production.Product Pr
        ON D.ProductID = Pr.ProductID
    WHERE H.SalesPersonID = P.BusinessEntityID
        AND Pr.Color = 'red'
    GROUP BY H.SalesOrderID, H.ShipDate
    ORDER BY H.ShipDate DESC
) AS LastTenOrders
GROUP BY P.BusinessEntityID, P.FirstName, P.LastName
ORDER BY SUM(OrderQty) DESC

With the new index, this query performs about as well as we can hope. The new plan now looks like this:

The query plan for the previous query

Why this Works For Me

Adding MAXDOP 1 is a quick way to give me a serial version of the query plan and the serial version of the plan is simpler. This trick saves me time. Or it has often enough that it’s worthwhile to try.

Another reason this works for me is because the queries I write and maintain are often required to run sub-second. You might be in the same boat if you write your queries for high volume OLTP systems. This usually means that there’s a high standard for query performance. Acceptably tuned queries are fast enough that SQL Server will rarely decide to consider parallel query plans. And that means that it’s tempting to look for parallelism to find candidates for poorly tuned queries.

Okay, Here Come The Caveats

  • Reading the title, some will assume that I’m claiming a parallel query plan is a bad query plan. Absolutely not. Queries that are executed with parallel plans are queries that do a lot of work. That’s not bad in itself. Sometimes queries need to do a lot of work. The question that should be asked is whether these queries are doing too much work.
  • I’m not claiming MAXDOP 1 is a tuning technique in itself. It’s just a trick I use to look at a query from a different point of view. If your query is a thousand times faster after adding MAXDOP 1 and you don’t know why, then you’re not done yet.
  • Parallel execution plans are not dull or useless! They’re fascinating! I’ve learned that the mental effort spent following a Paul White post on parallelism always pays off.

July 3, 2013

It Took Me Ages To Figure This One Out: A Puzzle

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 12:00 pm

Takeaway: I tackled a problem recently and I present a version of the problem here as a challenge. The challenge is to figure out why a query and a stored procedure return different results even though the procedure is defined using the same query.

If no one cracks this one, I’ll leave the answer in the comments in a couple days.

Too Easy

The Setup

Okay, I use AdventureWorks2012 here. I define a table data type IDList and a stored procedure s_GetCustomerProductPurchases:

use Adventureworks2012
go
 
create type IDList as TABLE(id int);
go
 
CREATE PROCEDURE dbo.s_GetCustomerProductPurchases 
	@CustomerIds IDList READONLY,
	@ProductIds IDList READONLY,
	@SearchString NVARCHAR(1000)
AS
SET NOCOUNT ON 
 
DECLARE @ResultList TABLE ( 
	ProductId INT, 
	ProductName Name 
	);
 
insert @ResultList (ProductId, ProductName)
select distinct TOP (2000) p.ProductID, p.Name
from sales.salesorderdetail sod
join sales.salesorderheader soh 
    on soh.SalesOrderID = sod.SalesOrderID
join Production.Product p
    on p.ProductID = sod.ProductID
join Sales.Customer c
    on c.CustomerID = soh.CustomerID
join @CustomerIds CIds
    on CIds.id = c.CustomerID
join @ProductIds PIds
	on PIds.id = p.ProductID
where p.Name like '%' + @SearchString + '%'
	and soh.[Status] = 5
 
IF	@@ROWCOUNT = 0 
	SELECT	0 ProductId, 'No Matching Products' ProductName
ELSE 
	SELECT	 ProductId, ProductName
	FROM	 @ResultList 
	ORDER BY ProductId

Executing the Query

Next lets run a query against Adventureworks2012. We define some variables that are used as filters in a query. This query populates a temporary results table and then we select from it:

DECLARE 
	@ProductIds IDList,
	@CustomerIds IDList,
	@SearchString Name = 'Red';
insert @ProductIds (id) VALUES (707), (717), (727)
insert @CustomerIds (id) VALUES (11000), (11103), (11201)
 
SET NOCOUNT ON 
 
DECLARE @ResultList TABLE ( 
	ProductId INT, 
	ProductName Name 
	);
 
insert @ResultList (ProductId, ProductName)
select distinct TOP (2000) p.ProductId, p.Name
from sales.salesorderdetail sod
join sales.salesorderheader soh 
    on soh.SalesOrderID = sod.SalesOrderID
join Production.Product p
    on p.ProductID = sod.ProductID
join Sales.Customer c
    on c.CustomerID = soh.CustomerID
join @CustomerIds CIds
    on CIds.id = c.CustomerID
join @ProductIds PIds
	on PIds.id = p.ProductID
where p.Name like '%' + @SearchString + '%'
	and soh.[Status] = 5
 
IF	@@ROWCOUNT = 0 
	SELECT	0 ProductId, 'No Matching Products' ProductName
ELSE 
	SELECT	 ProductId, ProductName
	FROM	 @ResultList 
	ORDER BY ProductId

These are the results I get:

ProductId ProductName
707 Sport-100 Helmet, Red

Executing the Procedure

Now run the procedure which contains the exact same query:

DECLARE 
	@ProductIds IDList,
	@CustomerIds IDList,
	@SearchString Name = 'Red';
insert @ProductIds (id) VALUES (707), (717), (727)
insert @CustomerIds (id) VALUES (11000), (11103), (11201)
 
exec dbo.s_GetCustomerProductPurchases @ProductIds, @CustomerIds, @SearchString

The results we get now look like:

ProductId ProductName
0 No Matching Products

So Why Are These Results Different?

That’s what I want you to figure out. Try to crack this one before looking at the comments.

SQL Fiddle

Have you used SQL Fiddle yet? It’s a site that allows SQL developers to set up different scenarios for use in online discussions. So if you don’t have Adventureworks2012 handy, I’ve setup up two SQLFiddles here and here. Your job then is to follow these two links and explain why the results are different.

April 17, 2013

The Sch-M lock is Evil

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

A necessary evil, but still evil. Why? Because it it won’t share with Sch-S and Sch-S is taken and held for absolutely everything (Yes, even your NOLOCK queries). And that can lead to some interesting concurrency problems. Let me explain.

Sch-M

Sch-M is an abbreviation for Schema Modification. It refers to a kind of lock that is taken on a table/index or other object whenever you want to modify that object. SQL Server allows only one Sch-M lock on an object at a time. So if you want to modify a table, your process waits to takes a Sch-M lock on that table. Once that lock is granted, the modification is completed and then the lock is released.

Sch-S

Sch-S is an abbreviation for Schema Stability. It is a kind of lock that is taken on an object when a process doesn’t want that object to change its definition. It makes sense. If I’m reading a set of rows through a table, I don’t want a column to disappear on me half way through. SQL Server allows many Sch-S locks on a table.

Sch-S vs. Sch-M

But Sch-S locks are incompatible with Sch-M locks. This means that when you want to modify a table, you’re not granted a Sch-M lock immediately. You have to wait for everyone using that table to finish using it. You’re essentially put on hold until the existing queries complete and their existing Sch-S locks are released. This also means that while you’re waiting, every query who wants to begin using that table is waiting in line behind you too. Basically “Everybody outta the pool while the lifeguards change shifts.” But that’s usually acceptable right? Database schema modifications are a big enough change to require a maintenance window.

Index Rebuilds Are Table Modifications

It’s true, if you have the luxury of maintenance windows for your DB changes, you’ll be alright. But you also have to consider your database maintenance plans (automated or otherwise). Those plans can launch index rebuilds while the database is online. And all index rebuilds also count as table modifications and take Sch-M locks. An index rebuild has syntax like this:

ALTER INDEX [PK_MyTable] ON [MyTable] REBUILD WITH (ONLINE=ON)

Hopefully you’ve remembered that ONLINE=ON part. When you use that part, the index is rebuilt in the background and at the end of that processing time, a Sch-M lock is taken and released very quickly.

But maybe you’re not so lucky. Maybe you’re not running 2012 yet and have an index that includes blobs. Or maybe you’re running on Standard Edition. In those cases you won’t be able to use the ONLINE=ON feature. In that case, the Sch-M lock is taken by the rebuild process and it’s held the entire time that index is rebuilt. During the rebuild, that index is now truly offline. No access for you.

You Can Get Stuck

Just like I did. A while ago, I was asked to help with this exact situation. An index rebuild had been running for hours it was offline and the Sch-M lock that was held was preventing anybody from using or even looking at that table. I was stuck between a rock and a hard place. I had to choose between letting the index rebuild complete (which could take hours) or cancelling the job (whose rollback could take hours). There was nothing I could do to avoid additional hours of downtime. There was another bit of irony in my situation. We didn’t care about the existing data in that table. A truncate table or drop/recreate table would have suited us just fine.

… Like Really Stuck

It occurred to me to try something out. What if I created an identical empty table with a different name. We didn’t need any data in the locked table. So using a new table could work. And because the locked table is only accessed by stored procedures, I can modify those procedures to use the new table instead of the locked one.

Nope!

For some reason, the ALTER PROCEDURE requires a Sch-S lock on the old table, the table it no longer refers to. The sprocs can’t operate or be modified without a Sch-S lock on that locked table. This sketch illustrates my situation.

Make sure your index plans never attempt to rebuild big indexes offline.

Extras

Some relevant DBAReactions gifs:

April 3, 2013

Altering Text Columns: Only a Metadata Change?

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

Say you want to change the type of a text column using the ALTER TABLE … ALTER COLUMN syntax. It is valuable to know how much work SQL Server will have to do to fulfill your request. When your tables are large, it can mean the difference between a maintenance window that lasts five minutes, or one that lasts five hours or more.

I give a list of exactly when you’ll feel that pain and when you won’t.

A joke about char(max)

(BTW, CHAR(MAX) columns are impossible, you’ll get the gag if you figure out why)

When is the Whole Table Processed?

Here are conditions which require processing the entire table:

  • switching from unicode to non-unicode or vice versa.
  • changing a column from nullable to not nullable.
  • going from fixed length field to variable length field or vice versa.
  • decreasing the maximum length of a field.
  • increasing the maximum length of a fixed length field.
  • converting from limited length columns to unlimited or vice versa. (e.g. varchar(10) to varchar(max)).
  • collation modifications on non-unicode columns that change character set or code page. (See Collation Hell Part 3 by Dan Guzman)

On large tables, any of the above alterations will be a heavy hitter and will take time and fill transaction log (except that a shrink of fixed length fields seems to only require a scan).

What Changes are Metadata Only Changes?

That’s a lot of conditions! What’s alterations are left?

Not much:

  • Increasing the maximum length of a variable length column.
  • Changing the type from text to varchar(max).
  • Changing the type from ntext to nvarchar(max).
  • Any of the above while making a non-nullable field nullable.
  • Any of the above with a change in collation (with some big caveats, see Collation Hell Part 3 by Dan Guzman).

These changes are metadata only changes which means SQL Server doesn’t have to touch any of the actual data. So the size of the table will not impact the time it takes SQL Server to process the ALTER TABLE command. But see some notes about concurrency below.

Some Notes

Some notes about the above:

  • I ignored and make no claims about migrations where text or ntext is the target column type because the exceptions are strange and that scenario seems to fall under “Why would you want to do that?”
  • The above applies to only versions I’ve tested. Specifically 2008, and 2012.
  • The metadata-only changes I described above is not entirely on-line. There are still concurrency concerns to watch out for. These ALTER statements still request Schema modification (Sch-M) locks on the table, and once granted, only hold them briefly. But if you try to alter a column on a live environment and some long running query blocks your ALTER TABLE statement, then other queries that need access to the table will be blocked as well. 
  • Terms I used
    • fixed length: char(x), nchar(x)
    • variable length: varchar(x), nvarchar(x)
    • unlimited length: varchar(max), nvarchar(max), text, ntext
    • unicode: nchar(x), nvarchar(x), nvarchar(max), ntext
    • non-unicode: char(x), varchar(x), varchar(max), text

March 19, 2013

Checking Out Amazon Redshift

In order to refresh my memory about what I learned in University, I’ve been watching a course that UC Berkeley made available on Youtube. It’s been a good course so far. Recently I got to the topic of logical fallacies. And so I’m reminded that the following is not a valid argument: Jimmy advises X; Jimmy profits if you follow advice X; Therefore, Not X. It’s not valid in the academic sense. In the practical sense, I have to remember to not distrust all commercials and marketing videos.

But it’s hard, especially when Googling for “Big Data” or “Analytics”, I find it very difficult to find anything to read that’s both unbiased and useful. Maybe it’s because I’m still skeptical about any solution that is promoted by the people who stand to profit from following their advice (I’m trying not to discount their advice, I just squint at it a little).

So when Amazon came out with a publicly available Beta for their new Redshift Datawarehouse service (accompanied by a slick marketing video), I decided to kick the tires. Here’s some notes I made:

Amazon’s Redshift Commercial

Their commercial is here. And I have to say it reminded me of an infomercial. You know the kind where they try to sell fancy mop gadgets by showing how ridiculously clumsy people can be with regular mops. The Amazon commercial is light on technical details but I don’t think it was meant for an audience like me and you. I made a conscious effort not to hold that against them.

Warehouse in the cloud

Having a warehouse in the cloud makes a lot of sense in some ways. Pay-as-you-go pricing is what you want when you prefer operational costs over capital costs. Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves. But it can get expensive quick, Amazon advertises less than $1000 / Terabyte / year. But with upfront costs for reserved pricing and a two terabyte minimum. The smallest rate you can get away with is $3000 per year for three years. In practice it will likely be significantly more. I can’t see anyone choosing Redshift without doing due diligence on the pricing, but it’s probably not going to be dirt cheap.

Connecting to Redshift
Star Trek's Jean Luc Picard says "Very well Data, open a channel"
Connections are made to Redshift only through ODBC (or JDBC) using Postgres drivers. Some challenges:

  • Picking my way through 32 bit vs 64 bit was tricky. Getting and using the right driver took some care.
  • Uploading through this connection is slow. So don’t try it this way. I mean it’s possible, but it’s simply not recommended. I learned this lesson not by reading through Amazon docs, but by attempting to run an SSIS job to funnel data into an ODBC destination. It was actually working, but it was impractical because it was so slow.

Creating the warehouse schema wasn’t too hard: I scripted a subset of tables from an existing warehouse. I stripped indexes, constraints and identities. There were a couple syntax differences (int becomes integer, bit becomes boolean, and GO becomes semicolon) but it went pretty smooth. For Redshift, in lieu of indexes, you choose columns on each table for the distribution key and the sort key. Underneath the covers, Redshift uses ParAccel and so if you’re familiar with that, you’ve got a great head start. Otherwise, Amazon’s Developer Guide is a good start. I’d like to hear more from others who know more about modeling for Redshift; it seems like a regular star schema will work well here.

Loading data is best through S3

I created a SSIS package that takes data from my relational warehouse. It takes that data and dumps it into delimited files (Hat tip to @peschkaj for advice on delimiter values). Then I gzipped the whole thing. I loaded those gzipped files into Amazon S3 and loaded data into Redshift using those files. Initially, I was afraid of Amazon S3 (What? I have to sign up for and learn about another service?) but working with Amazon S3 was dead simple.

Consuming data

I connected a Tableau client to Redshift using an ODBC connection. This Tableau discussion tells me that’s the current way to connect Tableau and Redshift. There are quite a few SQL limitations imposed by the ODBC protocol. So the Tableau experience was not too smooth. Tableau has a new Redshift connector coming out in a matter of weeks which should clear these limitations.
The point is that Amazon claims Redshift integrates seamlessly with a list of reporting apps (including Tableau). I believe it really will by the time Redshift is released, it’s just right now there’s a pretty big seam.

Next?

I’m going to get around to evaluating Google Big Query and of course Microsoft’s HDInsight. You might notice that lately, I find myself in a phase of learning (fun!) and so this post and the next few will be a description of my exploration.

February 28, 2013

Follow up on Ad Hoc TVP contention

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Last month I posted PAGELATCH Contention on 2:1:103. It described a troubleshooting experience I had. I was troubleshooting a performance problem that was felt when applications issued very frequent ad hoc queries that used Table Valued Parameters (TVPs).

I’ve recently learned some extra things from the SQL Server team at Microsoft that I wanted to pass on. So this post acts as a kind of update to that one.

I know. I've drawn Columbo before, but it's my blog.

Reproduction

One thing I did was to come up with an application that reproduces the contention on 2:1:103 in the same way as production. I share that here now.

What this reproduction does is launch fifty processes. Each process uses multiple threads to hit a db server with an ad hoc query that uses a TVP.  For the curious here it is: Download AdHocTVPContention.zip and then open readme.txt. This reproduction is a mashup of

  • A simple batch script which launches an executable 50 times at once. A technique I gleaned from Paul Randal in a demo of his.
  • A multi-threaded executable which sends queries asynchronously. A framework I first developed in a post I wrote on concurrency.
  • The ad hoc TVP query itself. It was a bit of a modification of my example at Table Valued Parameters, A Short Complete Example. The only difference is that the command is a query instead of a stored procedure.

What I’ve Learned Since Then

The Microsoft team helped me understand a few things. The main thing is that the particular version of SQL Server (including service pack level) matters a lot:

SQL Server 2005 and Earlier

TVPs weren’t supported then. (This post isn’t for you).

SQL Server 2008 and R2

In my earlier post, I dismissed a fix provided by Microsoft. Fix: Poor performance in SQL Server 2008 R2 when table-valued functions use many table variables. At the time, I dismissed it for a number of reasons:

  • The workaround listed on that kb article page did not help us.
  • The symptoms that were described at that issue did not match my own scenario (i.e. I wasn’t using table-valued functions).
  • The title mentioned R2 and I thought (mistakenly) that this fix was not applicable to SQL Server 2008. (Update April 5, 2013: The fix used to be available as a hotfix for 2008 SP3, but as of March 2013, this fix is now included in a cumulative update. Upgrade to SP3 CU10 if you’re using 2008 and want this fix. Microsoft has also updated the title of the fix to mention 2008)

Microsoft helped me learn that although the symptoms in that fix differ from mine, the improvement in that fix involves a change to code that also improves my scenario. So with the fix applied, on my test environment, I saw that SQL Server could handle at least 6 times the volume of ad hoc queries than it could before (your mileage may vary).

SQL Server 2012

But there’s one other thing I noticed. When I ran the reproduction on a SQL Server 2012 environment, the process just flew through the workload! Temp table creation was no longer a bottleneck; there was no more contention on temp table creation. The Microsoft Engineer I worked with noticed this too and eventually it prompted a CSS blog entry. You can read all about this 2012 performance improvement at Temp table caching improvement for table valued parameters in SQL Server 2012

Partner vs MVP

I’m very lucky to have many investigation options available to me when I’m digging into a particular quirk of SQL Server behaviour. There’s the public options like #sqlhelp or dba.stackexchange or Microsoft support. And for me there’s some private options like the MVP mailing list. But this month, I feel lucky to work as a Microsoft partner. It’s through that connection that I learned so much.

Too Long; Didn’t Read

If you’re experiencing 2:1:103 contention because of too frequent ad hoc TVP queries then:

December 5, 2012

Well That Wasn’t Obvious

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

I’ve discovered that non-intuitive lessons stick in the brain the best. I share three examples that I either learned recently (or at least re-learned recently).

Uniqueifiers on Indexes That I Thought Shouldn’t Need Them

Do you remember what a Uniquifier is? SQL Server uses these hidden values on rows to keep non-clustered indexes in sync with their non-unique clustered indexes. Think of this questionable schema:

create table Actors
(
	ActorId int identity not null,
	LastName nvarchar(100) not null,
	FirstName nvarchar(100) not null,
	AgentId int null,
	primary key nonclustered (ActorId)
)
 
create clustered index IX_Actors
	on Actors(LastName, FirstName)

Then add some data:

insert Actors (LastName, FirstName, AgentId)
values ('Douglas', 'Michael', 3),
       ('Douglas', 'Michael', 4),
       ('Keaton', 'Michael', 5),
       ('Douglas', 'Kirk', 4)

I can demonstrate the uniquifier value here because of the non-unique clustered index:

select sys.fn_PhysLocFormatter (%%physloc%%) as [physical rid], *
from Actors -- in my case, this shows page 231
 
DBCC TRACEON (3604);
DECLARE @dbid int = DB_ID(); 
DBCC PAGE(@dbid, 1, 231, 3); 
 
/* showing (among other things):
...
Slot 1 Offset 0x60 Length 51
UNIQUIFIER = 0                       
LastName = Douglas                   
FirstName = Michael                  
ActorId = 1                          
AgentId = 3                          
KeyHashValue = (16035ff378a3)        
 
Slot 2 Offset 0x93 Length 55
UNIQUIFIER = 1                       
LastName = Douglas                   
FirstName = Michael                  
ActorId = 2                          
AgentId = 4                          
KeyHashValue = (16035e22af06)        
...
*/

Cool! Just like real life (See Michael Douglas (I – XXVI) at IMDB). This uniquifier value behaves very much like the roman numerals you see at IMDB.

Here’s the part that wasn’t obvious… what if I chose a clustered index whose columns include the primary key…

drop index IX_Actors ON Actors;
create clustered INDEX IX_Actors
	ON Actors(LastName, FirstName, ActorId)

That should be unique right? The index includes the primary key column (ActorId) that guarantees it. So SQL Server shouldn’t need the uniquifier right?

But it does! 

SQL Server doesn’t count on the index being unique unless you say so. I actually discovered the above while using Kendra Little’s sp_BlitzIndex, a tool so underexposed it’s pasty.

UPDATE Modifies Column Values Only Once

What does the following code produce?

-- set up test table 
declare @test TABLE (id int, value int);
declare @source TABLE (id int, increment int);
 
insert @test (id, value)
values (1, 0), (2, 0)
 
insert @source (id, increment)
values (1, 10),(1, 20),(1, 30),(2, 100),(2, 100),(2, 100)
 
-- "sum": add each value in the source to the existing value
update @test 
set value = value + increment
from @test t
join @source s
    on s.id = t.id
 
-- check the results
select id, value
from @test
/* The results:
id          value
----------- -----------
1           10
2           100
*/

Surprised? I think this example (or one very much like it) surprises those of us who started our careers as programmers. Many of us follow up this lesson with the related “UPDATE statements don’t have GROUP BY clauses” and then the lesson “How do I use CTEs?”

Indexed Views Don’t Support Max/Min Aggregates

Indexed views support a couple aggregate functions like COUNT_BIG() and SUM(). And with some trickery you can calculate AVG() and STDEV(). But SQL Server restricts the use of MAX() and MIN() in indexed views despite how useful they’d be.

It might help to understand why. SQL Server maintains indexed views as physical database objects and it can maintain aggregate values like COUNT() and SUM() by processing only the changing rows in the base table while safely ignoring the rest of the table. But I can’t say the same for MAX() or MIN(). If SQL Server supported MAX() and MIN() in indexed views, then when you delete the row in the base table that represents the MAX() value, SQL Server would have to search the rest of the base table to find the new MAX() value.

Check out the microsoft.connect feature suggestion Expand aggregate support in indexed views (MIN/MAX). Aaron Bertrand created this Connect suggestion and I like it because it shows how effective constructive feedback can be. I like it because of its description, comments and the useful workaround. The Microsoft team even gave some insight into how they almost included this feature in SQL 2008. This connect item only seems to be lacking an E.T.A. so go and cast your vote!

Bonus Content

I didn’t draw any illustrations this week so I’m including some bonus content (admittedly written by others):

Non-obvious Things From Twitter Friends

Request for Tweets: What thing in sql surprised you? Something not obvious. For me it was how the TRY and CATCH didn't always.

 I love how the log shipping "use default data and log folders" ... doesn't

how about how INSERT <tbl> VALUES (1),(2),(3) limits you to 1000 entries

Rolling time-window filtered index. The cool surprise - building each new index uses the existing one, so it's really quick.

Trivia

  • Did you know Michael Keaton (born Michael Douglas) changed his name to avoid confusion with that other guy? Pretty wise.
  • Jes Schultz Borland reminded me recently that good writers use more active verbs. I took that advice to heart and turned the writing of this article into an exercise. I avoided using words like is, was or are here and I think it turned out pretty well.

October 31, 2012

Triggers On Views? What For?

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

What’s up with triggers on views? What kind of patchwork monster is this?

Happy Halloween!

I’m sharing something that I learned this week. It’s something I learned while talking to a friend at work. He wondered whether SQL Server had any feature similar to the rule system over at Postgres. (I thought no at first, but keep reading).  You know, I continue to learn tons while talking to friends at work even though I’m the in-house SQL Server specialist. I don’t mind at all.

Aside: Have I ever mentioned how amazing the people I work with are? I work for a great company and we’re hiring like crazy.

Back to the post. Here’s what I learned:

  • I knew that insert/update/delete statements could be executed on views if the view was simple enough. If it was simple enough we call the view “updatable”.
  • I thought that triggers couldn’t be defined on views but
  • I was wrong and …
  • I learned that not only are triggers allowed on views but INSTEAD-OF triggers will work on any kind of view, updatable or not.

Example

Here’s what I mean.

create table A (id int primary key)
create table B (id int primary key)
go
 
create view AB as
    select id, 'a' as tablename from A
    union all
    select id, 'b' as tablename from B
go

If I were to try to insert rows into the view, I get this error message:

'UNION ALL view ~ is not updatable because a partitioning column was not found.'

Other views give other errors, but the message is the same: “Your view is too complicated to be updatable”. That’s okay. I never really expected inserts to work on anything but the simplest views so I’m not too disappointed.

But like I said, here’s how you would make this view updatable – or rather insertable – with an INSTEAD OF trigger:

create trigger t_AB
on AB
instead of insert
as
    if exists (select 1 from inserted where isnull(tablename,'') not in ('a','b'))
         raiserror (N'column ''tablename'' must be either ''a'' or ''b''' 16, 1);
    else
    begin
        insert A (id) select id from inserted where tablename = 'a'
        insert B (id) select id from inserted where tablename = 'b' 
    end
GO

Boom! No more this-view-is-not-updatable errors.

But When Would I Use This?

I thought of the perfect use case. This strategy helps with  SCHEMA REFACTORING. Say your schema contains this table:

create table ANIMALS 
(
    Name varchar(100) primary key,
    IsVertebrate bit not null,
    VertebraeCount int
)

and you’ve decided to change it to look like this:

create table ANIMALS_BASE
(
    Name varchar(100) primary key
)
 
create table VERTEBRATES
(
    Name varchar(100) primary key
        references ANIMALS_BASE(Name),
    VertebraeCount int not null
)

After you’ve done that, you can create a view called ANIMALS with triggers to accommodate applications that still expect the old schema. That would look something like:

CREATE VIEW ANIMALS
AS
    SELECT AB.Name, 
        CASE WHEN V.Name IS NULL THEN CAST(0 AS BIT) ELSE CAST(1 AS BIT) END as IsVertebrate,
        V.VertebraeCount
    FROM ANIMALS_BASE AB
    LEFT JOIN VERTEBRATES V
        ON AB.Name = V.Name

Next create the insert, delete and update triggers. I only show the insert trigger here. The delete and update shouldn’t be too hard to write either.

create trigger t_ANIMALS_i
on ANIMALS
instead of insert
as
    INSERT ANIMALS_BASE (Name) SELECT Name FROM inserted;
    INSERT VERTEBRATES (Name, VertebraeCount)
        SELECT Name, VertebraeCount 
        FROM inserted 
        WHERE IsVertebrate = 1;

Here’s a few ways that this helps:

  • You might want to use this strategy when you don’t have control over every application using your database.
  • This helps with migrations because applications are now free to expect the new or old schema. The applications can transition gradually to the new way of doing things.

Even if you believe every application uses only the new schema. You can still make sure gradually by logging whenever someone uses the old schema. Just add some logging statement to the definition of the triggers. You can log or track trigger use using any way you’re comfortable with. One suggestion is to use the user-settable performance counter via sp_increment_counter1 then Poof! You’ve got your own deprecated-features performance counter just like Microsoft.

October 11, 2012

You Can “Do Science” Anywhere

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Consider the Following

Did you know you can do science anywhere. Science is about learning how Nature behaves. Well what if we replace Mother Nature with SQL Server? SQL Server’s not as pretty, but we can still learn things. It’s 2012! When we wonder about things – even databases – we don’t have to guess.

Consider the following...

The Question

To keep things simple, I’m going to follow the scientific method here and this is step one. The question. All good science starts with a question. The question itself doesn’t matter. I just want to talk about the process.

So I take the question from this DBA Stackexchange question: Why are NULLS sorted first?

Good question! The SQL Standard does not actually prescribe any behavior here (says Wikipedia). So the actual behavior is left up to the RDBMS vendor. The likely answer to the question is that NULLS are sorted that way because Microsoft had to pick something and they probably just made a choice based on what other RDBMS’s were doing or other implementation considerations. Or who knows, maybe they just flipped a coin.

But that leads me to this question: Are they always sorted first? If I reverse the order of a sort by using ORDER BY column DESC do Null values still show up first? If I look at the manual, Microsoft tells me “Null values are treated as the lowest possible values” so that ORDER BY column DESC will place Null values at the end.

But the way the stack overflow question is worded makes me wonder a little and I’d like to find out for sure. So to repeat:

“If I sort the results of a query using ORDER BY column DESC do Null values come last?”

Hypothesis / Prediction

I’m placing my bets with Microsoft’s documentation and so I answer YES.

I predict that any query written with “ORDER BY column DESC” will place the Null values at the end. If I’m wrong, then it would seem like null values always first is the behavior here (Note: NULLS FIRST is actually mentioned in the SQL Standards as an extension to the ORDER BY clause, but it’s not implemented by SQL Server. This also lets me believe I’m not crazy for wondering.)

Test

That’s simple enough:

select *
from (values (1),(2),(3),(null)) as v(value)
order by value desc

The prediction says that the fourth value in the results will be NULL. And tada! The prediction is correct.

Analysis

There’s nothing surprising here, but maybe it raises some new questions. Like:

  • What do other vendors do? (Turns out they do the same. This lends weight to Microsoft picking this behavior because it was common practice)
  • What’s up with this NULLS FIRST keyword?

There’s a publish part here. The default assumption (that Microsoft documentation is often correct) is still valid so there’s nothing surprising to publish. Probably not worth a blog post (despite what you’re reading).

Boiled Down

Normally the thought processes aren’t so structured, but as long as the question comes first and the guess is checked against experiment, it counts.

It seems like common sense. And maybe I could have picked a better example. Science helps satisfy our curiosity. But it also can save us from kidding ourselves. Without this kind of thinking we risk turning into superstitious pigeons. I once saw someone perform a task (release and renew his ip address) three times out of habit “just so it takes”. That way lies madness.

September 27, 2012

My Short Struggle With SQL Agent

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

You know I’ve always thought of SQL Agent as one of the “Agent Programs” from the movie the Matrix:

So what follows is a mini-lesson about the SQL Agent. I recently got into trouble when I created a SQL Agent Job step. The only thing this step needed to do was to move some text files from one network drive to another.

I just wanted to use the dos command:

move "\\server1\somedir\*.txt" "\\server2\somedir"

But if there were no files copied, the step failed because the return code was 1 instead of 0. It succeeded in copying all the files that were there (all zero of them), but apparently that’s unacceptable.

I then tried using robocopy. Robocopy is the “robust copy” command for Windows. It’s got all the bells and whistles… so I tried:

robocopy "\\server1\somedir\" "\\server2\somedir\" *.txt /mov

but the return code for robocopy is very complicated. Even more so than regular move. I’m not the first to deal with this headache… Check out this topic on SQL Server Central’s forum. The solution there uses a batch script to accomplish the task. If it’s successful, it returns a consistent error code. But doesn’t it seem like there should be a better way?

Only One Successful Return Code For Cmd Steps

Yep, the problem stems from the fact that only one return code is allowed for SQL Agent Job steps that are Operating System commands. That’s shown on this screen:

There can be only one ... error code.

Powershell Beats ‘em All

Long story short… I went with a powershell script. It was the simplest:

move-item "\\server1\somedir\*.txt" "\\server2\somedir"

And I don’t worry about return codes here. The powershell script succeeds or it doesn’t. Times when the script fails include when the move would overwrite an existing file or when permissions prevent the file move and that’s what I wanted.

In powershell, simple things are simple. Hard things are possible.

« Newer PostsOlder Posts »

Powered by WordPress