t-sql | Michael J. Swart Michael J. Swart

August 17, 2012

Be Careful with the Merge Statement

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: merge statement, t-sql — Michael J. Swart @ 8:38 am

So there’s a lot of good documentation provided by Microsoft for SQL Server. Microsoft in general is the best model I know of for how technical documentation should be provided to the public (If you’re laughing, it’s time to take another look).

But to the unfamiliar, it can still be hard to judge good docs from bad (or good docs from better). If you knew better, then you probably wouldn’t need the docs right? Wouldn’t it be great to have some third party point to a particular article and say “Read this; this is important.”

For example consider this article: Optimizing MERGE Statement Performance. It’s well written and it explains how performance gains are possible using the MERGE statement because source and target datasets can be processed only once. The alternative to the MERGE statement is to process the datasets multiple times by using more than one INSERT, UPDATE or DELETE statements. But then Microsoft goes on to say “… performance gains depend on having correct indexes, joins, and other considerations in place.” That is an understatement dear readers and I’d like to call your attention to it.

I’m not going to repeat Microsoft’s advice (it is like I said, an excellent article), but I am going to add to it by describing some of the dangers of not following their advice.

Why Is MERGE Dangerous?

Here’s why. It’s because there is a risk that if you don’t get the “indexing, joins and other considerations” right, then it is possible to not only lose the performance benefits you hoped for, but to suffer from performance problems much much worse than if you had written the straight-forward INSERT, UPDATE and/or DELETE statements in the first place. Because of their complexity, I believe MERGE statements seem more vulnerable to bad query plans than other DML statements.

My Own Experience

I’ve been bitten by this problem twice in the past month the most recent time being yesterday. So for the curious, you could say that this lesson counts as the latest thing about SQL Server I learned the hard way.

Experience #1

The first time was with a merge statement that was implementing the standard logic of updating an existing set of table rows with using a dataset sent from an application:

Insert rows that are new
Update rows that have changed
Delete rows that are gone

But the logic of the whole thing required a fussy “WHEN NOT MATCHED BY SOURCE AND …” and the resulting query plan was doing something like a full-outer join. The query processed the entire target table even though it didn’t need to. If you’re really really curious, look for more details look at this similar question I found on Stack Overflow: Sql Server Delete and Merge performance.

I was about to rewrite the whole thing using only UPDATE/INSERT/DELETE statements, but then my colleague found a way to make the merge statement work. Woo hoo.

Experience #2

And then just yesterday, I had another problem with a MERGE statement. Some of the facts:

The source query was a CTE and produced about 4000 rows max. That CTE when run on its own never took more than 3 seconds in either on test or production databases.
The MERGE’s search condition – the ON [merge_search_condition] clause – used columns that matched a unique non-clustered index. Microsoft recommends a clustered index here, but what can you do?
Testing with a target table of several thousand rows went just fine (in hindsight, a test db with millions of rows would have saved me).

But testing is testing and production is production. The optimizer crossed a threshold somewhere and said “Whelp, that’s it. I guess a non-clustered index isn’t going to cut it here any more. Let’s scan the whole thing.” Now most of the time the optimizer comes up with a really good plan. But in this case, it just didn’t.

I plan to rewrite the query as separate INSERT and UPDATE statements and early tests are promising.

Where Do We Go Now

Well, as a veteran problem-avoider, I now feel distrustful and wary of the MERGE statement. And maybe I don’t mind the multiple INSERT/UPDATE statements so much any more. If my MERGE statement takes one second to run and my INSERT/UPDATE statements takes two, then maybe two seconds is acceptable when the alternative takes so much extra effort to properly test. What do you think?

-- Comments (11)

May 30, 2012

MERGE Statement Generator

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Tags: merge, t-sql — Michael J. Swart @ 12:00 pm

So what I've got here below is a web form. It takes a source query and target table and spits out a merge statement for the standard insert/update use case. I made it for myself but you might find it useful too. I'm counting on this to be a real time saver (at least for me).

Clustered Index Merge Showplan Operator

Some Neat Things

This webform doesn't post to any server. It's all javascript that gets executed in your browser. This means you don't have to worry about anyone stealing your code and I don't have to worry about this form going viral (ha ha).
SQL Snippets (either kind) don't quite handle the dynamic list of columns here. Otherwise, you would be reading a different post today.
The MERGE statement is a lot more versatile than what I show here. I'm just handling the most common use case.

Things to Watch For

Concurrency. If this is important to you, remember to use appropriate locks (usually UPDLOCK) on the target table.
Make sure the column list of the source query matches that of the target table
You're going to have to clean up the script if your target table has columns that are rowversion, identity, computed etc...
User input is being used to generate code here. So my SQL-injection spidey-sense starts to tingle. But it's okay in this case because I'm not running anything. I'm just displaying it. You're the one who's running this stuff so it's up to you to vouch for any generated code. Take care.

The Form

Put in your own data (or just use the sample here). When you're ready, hit MERGE!

Source Query

Target Table

Key Columns (one column per line)

Other Columns (one column per line)

Your Merge Query

-- Comments (6)

June 8, 2011

Forget About PIVOT, You Don’t Need It

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: "sql server", PIVOT, pivot tables, t-sql — Michael J. Swart @ 12:00 pm

Takeaway: Until recently, I was a huge T-SQL PIVOT fan. But the more I think about it, the more I realize that it’s probably better to pivot data anywhere else than to pivot data in SQL Server. I talk about easier and better alternatives to PIVOT. Remember PIVOT is a T-SQL command that was introduced in SQL Server 2005 that lets you return results as pivot tables or more commonly as cross tabulations.

SQL Server 2000 and Earlier

I remember the first time I came across the need for pivot tables. I knew what I needed the data to look like, but I didn’t know what they were called. It didn’t matter, PIVOT wasn’t yet a T-SQL keyword and I couldn’t make use of it any way.

I was asked to write a query that returned data about the number of widgets and their categories. That seems like no problem right? Except I was asked to return the data in a two dimensional grid (with widgets as rows and categories as columns). It was a perfect use case for PIVOT! Except that it wasn’t yet available and so I was forced to write something complicated in T-SQL that used subqueries, CASE statements and other crazy stuff.

The resulting query was at least 100 lines. It looked complicated but it did the trick. I was actually congratulated on my SQL skillz (although in hindsight I shouldn’t have been).

But it got me to thinking. Why was that so hard? I used to think that any reasonable English question can be translated into simple SQL. But here was a big fat hairy counter-example. I eventually came up with an answer: It’s because the requested data has information about different sets of data in the same row.

SQL Server 2005 and Later

Fast forward a few years. I was learning about all the new stuff that SQL Server 2005 could do. And I came across the new PIVOT keyword. I knew immediately what it could do for me. I knew immediately how it was used. And I used it when the results called for it. I never remembered the syntax because I knew I could always look it up.

Eventually I gained a reputation as a SQL know-it-all (whether I deserved it or not) and I started fielding database questions. If any colleagues were struggling to return a cross-table, it was easy to recognize what they were doing and easier to point them to the online docs for PIVOT.

Or Just Let The App Deal With It

But I realized recently that it’s really not necessary. At the database level, I probably don’t have to pivot this. I can ask the question: Do I really need to deliver the data that way – pre-pivoted? Heck No! I can delegate that stuff. It’s actually a lot easier to pivot this data almost anywhere else besides SQL Server.

So in essence I’m giving myself (and you!) permission to forget something: T-SQL’s PIVOT syntax.

Pivot Inside Excel

Say the data is a one-time-only query and you want to include a cross table in a spreadsheet or email. Well Excel’s pivot feature turns out to be dead simple. This pivot functionality is also found in Open Office’s Calc and any other spreadsheet application built this century. Just a couple extra tips:

Using the “format as table” feature can save yourself a couple clicks.
Find the “Create Pivot Table” feature under the Insert tab.
If you’re still having any trouble, I’m sure there’s tons more help here.

Pivot Inside a .Net App

Okay, so say you’re writing some C# and you have a DataTable that you wish were more pivot-y. Ask and ye shall receive:

DataTable Pivot( DataTable dt, DataColumn pivotColumn, DataColumn pivotValue ) {
	// find primary key columns 
	//(i.e. everything but pivot column and pivot value)
	DataTable temp = dt.Copy();
	temp.Columns.Remove( pivotColumn.ColumnName );
	temp.Columns.Remove( pivotValue.ColumnName );
	string[] pkColumnNames = temp.Columns.Cast<DataColumn>()
		.Select( c => c.ColumnName )
		.ToArray();
 
	// prep results table
	DataTable result = temp.DefaultView.ToTable(true, pkColumnNames).Copy();
	result.PrimaryKey = result.Columns.Cast<DataColumn>().ToArray();
	dt.AsEnumerable()
		.Select(r => r[pivotColumn.ColumnName].ToString())
		.Distinct().ToList()
		.ForEach (c => result.Columns.Add(c, pivotValue.DataType));
 
	// load it
	foreach( DataRow row in dt.Rows ) {
		// find row to update
		DataRow aggRow = result.Rows.Find(
			pkColumnNames
				.Select( c => row[c] )
				.ToArray() );
		// the aggregate used here is LATEST 
		// adjust the next line if you want (SUM, MAX, etc...)
		aggRow[row[pivotColumn.ColumnName].ToString()] = row[pivotValue.ColumnName];
	}
 
	return result;
}

If you know the shape of your data ahead of time, you could have coded this more easily by hard coding column names. But what I’ve given here is a general procedure (It works on most test cases. Error handling and extra testing are left as an exercise for the reader 🙂 ).

Pivot Inside Other B.I. Clients

Most other clients I can think of have even easier pivot features:

Reporting Service’s matrix control is a pivot table.
Integration Service’s pivot transformation is easy and handy.
Analysis Service’s cubes are nothing but a multi-dimensional pivot tables.
Other Business Intelligence software is all over this. If your B.I. software can’t do cross-tables or pivot tables, I would ask for your money back.

You don’t need T-SQL’s PIVOT table at all. Forget that stuff…

… Except When You Do Need T-SQL’s PIVOT

Okay, maybe I’m a bit over-zealous there. Here are some reasons why you might want to use T-SQL’s PIVOT tables. They’re not great reasons but I present them any way. You’ll need to know this stuff:

when studying for certifications, interviews or other tests,
when the application or data destination you’re working with is not suited to pivoting (although nothing comes to mind). Your choice is then PIVOT using T-SQL or not at all.
when the data’s destination is the SSMS results pane. That’s fair.
when you don’t have control over the app, or the app is too difficult to modify and it’s expecting a cross-table for data.

So for the most part, you can forget about PIVOT syntax. Next week maybe I’ll talk about remembering the syntax for the MERGE statement. In my opinion, SQL’s MERGE statement is a statement worth remembering, but that’s another story.

-- Comments (35)

March 7, 2011

The Aggregate Function PRODUCT()

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: aggregate functions, product, sum, t-sql — Michael J. Swart @ 8:00 pm

So it’s T-SQL Tuesday time again and this month it’s hosted by Jes Schultz Borland.
The topic this month is aggregation. Which is a great topic. Real T-SQL topics are my favorite kind. In the past I’ve actually written a couple of posts on the topic of aggregation which would have fit in perfectly this month:

But this month, I want to tell you about the aggregate function PRODUCT().

The PRODUCT() Aggregate Function

Okay, I’m pulling your leg a bit. There is no function defined in T-SQL that is called PRODUCT(). But everything is in place to let you build one without having to resort to CLR aggregate functions. All we need is to do is remember a bit of math. Remember that:

So when ever you would want to write

SELECT PRODUCT(field)
FROM SomeTable

You can feel free to write

SELECT EXP(SUM(LOG(field)))
FROM SomeTable

Examples

Multiplying the first six primes

SELECT EXP(SUM(LOG(num))) as product
FROM (VALUES (2),(3),(5),(7),(11),(13)) as NUMS(num)
/*
product
----------------------
30030
*/

The Value of a 2005 U.S. Dollar in 2010

DECLARE @StartYear INT = 2005;
DECLARE @EndYear INT = 2010;
 
SELECT EXP(SUM(LOG(1+ (inflationRate/100.0))))
	AS ValueOfUSDollarAfterInflation
FROM (VALUES
	(2001, 2.85),
	(2002, 1.58),
	(2003, 2.28),
	(2004, 2.66),
	(2005, 3.39),
	(2006, 3.23),
	(2007, 2.85),
	(2008, 3.84),
	(2009, -0.36),
	(2010, 1.64),
	(2011, 0.99)
	) AS RATES([year], inflationRate)
WHERE [year] BETWEEN @StartYear + 1 AND @EndYear
/*
ValueOfADollarAfterInflation
----------------------------
1.11653740799858
*/

More About This Method

I don’t know the first person who came up with this trick (Most likely Napier). With a quick search, I understand that others have written about its implementation in SQL many times before. In fact, I wouldn’t be surprised if this tip comes up again this T-SQL Tuesday. But I post anyway because I like my examples and had fun writing it.

-- Comments (18)

August 26, 2010

Ten Things I Hate to See in T-SQL

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: "sql server", t-sql — Michael J. Swart @ 11:35 am

So here’s my top ten list of things that I see in written in SQL that I dislike or that irk me in some way. And yeah, in some cases, I would use the word hate.

Fury

They’re not necessarily things that go against best practices. But each one gets under my skin in some way and I thought I’d share that list at the risk of giving my enemies buttons to push.

So without further ado.

10. Code that’s Been Commented Out

It’s in every set of code I’ve ever worked with and I’m sure you’ve seen it too. It’s not even SQL Server specific, but you’ll see it in any code that changes over time. The code that get’s commented out instead of deleted.

I’m guessing that the reasoning behind this is that some developers want to keep code around to make reverts easier, or to keep a sense of history of the code. But that’s what source control systems were built for!

9. The isDeleted Column

“Deleted records aren’t deleted. Look, they’re right there!”

You sometimes see this. Records that are marked as deleted with a flag. It’s sometimes needed to support an undelete feature but it still bugs me.

It’s also confusing when looking at procedures called s_OBJECT_Delete(). Without looking at the definition, does this procedure delete the record, or just mark it as deleted?

8. Fluff in Object Definitions

When I create a table from scratch like:

CREATE TABLE MyTest
(
	Id INT IDENTITY NOT NULL,
	Value1 NVARCHAR (100) NOT NULL,
	Value2 NVARCHAR (max),
	CONSTRAINT PK_MyTest PRIMARY KEY (Id)
)

I can get SQL Server to script out the table for me later on and it will look like this:

/****** Object:  Table [dbo].[MyTest]    Script Date: 08/25/2010 19:08:29 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[MyTest](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[Value1] [nvarchar](100) NOT NULL,
	[Value2] [nvarchar](max) NULL,
 CONSTRAINT [PK_MyTest] PRIMARY KEY CLUSTERED
(
	[Id] ASC
) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

All the added stuff is marked in green. This is stuff that is either optional, or that specifies an option that happens to be the default. This is okay, because SQL Server didn’t know if the table was created with options left as the default, or whether they were explicitly set. So the scripter spells it out.

And that’s totally fine. I just get peeved when some of this stuff makes it into checked-in object definitions.

7. Filter in the JOIN Clause Instead of the WHERE Clause

Look at this query:

SELECT p.name, COUNT(1)
FROM Sales.SalesOrderDetail sod
JOIN Production.Product p
	ON sod.ProductID = p.ProductID
	AND p.Name like 'Sport%'
GROUP BY p.Name

I’d rather see that filter on p.Name in a where clause:

SELECT p.name, COUNT(1)
FROM Sales.SalesOrderDetail sod
JOIN Production.Product p
	ON sod.ProductID = p.ProductID
WHERE p.Name like 'Sport%'
GROUP BY p.Name

Of course if you’re dealing with LEFT OUTER JOINS, then it matters where the filter is placed.

6. Concatenating Strings to Build SQL in C#

This is more of a C# gripe than a SQL gripe but it’s my list 🙂

Old C++ developers create strings filled with SQL to be passed to the database. But long statements would get broken up in code like:

string sql = "SELECT name, location " +
    "FROM MyTable " +
    "WHERE id = 12312 " +
    "ORDER BY name";

I hate that format because it makes things impossible to cut and paste. But C# has this nifty string prefix @ that allows newlines inside string literals:

string sql = @"SELECT name, location
    FROM MyTable
    WHERE id = 12312
    ORDER BY name";

5. Concatenating Variables Values Into SQL

Pretty self-explanatory. This practice puts a database in danger of SQL injection attacks.

(Always happy for an excuse to mention little Bobby Tables from xkcd)

4. Triggers that maintain data in other tables

This is a symptom of a database design that’s not normalized. Essentially, the same data is stored in two places. The development effort needed to maintain both tables properly is often more trouble than it’s worth.

Very closely related is the indexed view. But I have nothing but love for these.

3. Missing Foreign Keys

Missing foreign keys are more common than other problems because there’s no obvious symptom to react to:

A new table will support an application just as nicely with or without foreign keys.
Software testers will see no error messages coming from the database.

It’s often later on that data integrity becomes a problem, now you’ve got more things to worry about:

Adding the foreign key.
Cleaning up dirty data.
and Public Relations.

Here’s a post I wrote earlier on finding missing keys.

2. Identifiers That Are Also Reserved Key Words

It’s confusing to have column names, alias names or table names that are also keywords. In some cases the syntax is allowed and what’s left is merely confusing.

I often see this pet peeve when columns are named the same as a particular type. like Date or Text. I’d rather see names like CreateDate, LogDate or CommentText. The following script can identify those:

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS c
JOIN sys.types t
	ON t.name = c.COLUMN_NAME
ORDER BY c.TABLE_NAME, c.COLUMN_NAME

1. Column Names That End in a Digit

Now, I don’t mind stuff like AddressLine1 and AddressLine2. They serve a purpose in their own way.

But normally when I see this it smells like unnormalized database design (not denormalized, there’s a distinction). So this is a good litmus test to check to see whether tables are at least in first normal form.

I use this script to check:

select t.name as table_name,
	c.name as column_name
from sys.tables t
join sys.columns c
	on t.[object_id] = c.[object_id]
where c.name like '%[0-9]'
order by t.name, c.name

~ Thanks to xkcd.com and victorianweb.org for sharing their images

-- Comments (26)

June 3, 2010

Keeping Track Of Root Nodes in a Hierarchy

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: "sql server", hierarchy, t-sql — Michael J. Swart @ 2:53 pm

Takeaway: I explain an interesting problem I encountered. I explain how I maintain data for a table that stores hierarchy info. The problem might not be relevant to you dear reader, but the technique might.

The Problem

So say that you’re keeping track of a hierarchy in a particular table and to make queries faster, you’ve decided to store the a key to the root level node as a column in each row.

Constraints: For whatever reason, the data in this column is untrustworthy and it’s our job to fix it! It’s a huge table, most of the data is okay, and we’d like to fix everything without blocking too much of the table.

Table Description

You’re keeping track of a hierarchies in a table using a foreign key to itself
Parent nodes are indicated using a ParentId column (see table below):
Root nodes are indicated when ParentId is NULL.
BaseObjectId indicates the root node of hiearchies (i.e. this equals ObjectId when ParentObjectId is NULL)
You’re using SQL 2005 and you can’t use 2008’s new hierarchy data type (or you don’t want to).

The Table Diagram would look something like this:

Hierarchy Example

And the code to create this table would look like:

CREATE TABLE HierarchyExample (
	ObjectId int NOT NULL IDENTITY(1,1),
	ParentObjectId int NULL,
	BaseObjectId int NULL,
	OtherDataGoesHere nvarchar(max) NOT NULL,
	CONSTRAINT PK_HierarchyExample
		PRIMARY KEY (ObjectId),
	CONSTRAINT FK_HierarchyExample
		FOREIGN KEY (ParentObjectId)
		REFERENCES HierarchyExample (ObjectId)
)

Example: You might store this kind of data useful if you were really interested in finding the head or root node of the hierarchy. Like with vampires:

Vampire Hierarchy

A Solution

The query:

declare @rc int;
declare @batch int;
select @rc = 1, @batch = 1000;
 
while @rc > 0
begin
      UPDATE TOP (@batch) HE
      SET BaseObjectId  = HE_Parent.BaseObjectId
      FROM HierarchyExample HE
      JOIN HierarchyExample AS HE_Parent
		ON HE.ParentObjectId = HE_Parent.ObjectId
      WHERE HE.BaseObjectId <> HE_Parent.BaseObjectId;
 
      SET @rc = @@ROWCOUNT
end

This script updates the BaseObjectId a thousand rows at a time which keeps the query relatively quick so it doesn’t hold on to locks too long (at least not x-locks). It also handles hierarchies with multiple levels.

Index for Efficiency: The best indexes for this table are the obvious nonclustered covering ones and really turn this query into something fast (using a merge join no less!):

HierarchyExample(ParentObjectId, BaseObjectId, ObjectId)
HierarchyExample(ObjectId) Include (BaseObjectId)

The original solution I tried used recursive CTEs where the performance was less than ideal.

What now?

This is the kind of thing I like working on and I thought I’d share. So what about you? What kind of interesting T-SQL challenges have you come across recently? Leave something in the comments or blog about it and post back here.

-- Comments (2)

April 14, 2010

Forcefully Rename a SQL Server Database

Filed under: SQL Scripts,SQLServerPedia Syndication,Technical Articles — Tags: "sql server", alter database, Error 5030, rename, rename database, sp_renamedb, t-sql — Michael J. Swart @ 3:41 pm

If you’ve ever used Object Explorer (in SSMS) in order to rename a database, you may have come across this error message from Management Studio:

which says:

Unable to rename [database]. (ObjectExplorer)
Additional information:
--> Rename failed for Databse '[databse]'.
	(Microsoft.SqlServer.Smo)
	--> An exception occured while executing a Transact-SQL statement or batch.
		(Microsoft.SqlServer.ConnectionInfo)
		--> The database could not be exclusively locked to perform the operation.
			(Microsoft SQL Server, Error: 5030)

Management Studio is telling you that there are connections that it doesn’t want to mess with.

Or when using sp_renamedb from a query window, you might see this similar error:

Msg 5030, Level 16, State 2, Line 1
The database could not be exclusively locked to perform the operation.

If you really really want to rename the database and don’t care at all about in-flight transactions, then use this script/template I recently came up with:

A Forceful Script:

ALTER DATABASE [old_name]
 
	SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
ALTER DATABASE [old_name]
 
	MODIFY NAME = [new_name]
GO
ALTER DATABASE [new_name]
	SET MULTI_USER
GO

Caveat:

Remember that this script failed kindegarten for not playing well with others. So use this script only on QA and Dev boxes. And if you’re going to do something on a production box, make sure you do this only during scheduled downtime.

One More Interesting Thing:

A friend of mine stored this script in One Note and found that when it came to use it, the script failed because One Note had replaced some of the white space with non breaking spaces!

This issue is a One Note issue and applies to all scripts and code, not just T-SQL scripts.

-- Comments (6)