Michael J. Swart

August 17, 2012

Be Careful with the Merge Statement

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , — Michael J. Swart @ 8:38 am

So there’s a lot of good documentation provided by Microsoft for SQL Server. Microsoft in general is the best model I know of for how technical documentation should be provided to the public (If you’re laughing, it’s time to take another look).

But to the unfamiliar, it can still be hard to judge good docs from bad (or good docs from better). If you knew better, then you probably wouldn’t need the docs right? Wouldn’t it be great to have some third party point to a particular article and say “Read this; this is important.”

For example consider this article: Optimizing MERGE Statement Performance. It’s well written and it explains how performance gains are possible using the MERGE statement because source and target datasets can be processed only once. The alternative to the MERGE statement is to process the datasets multiple times by using more than one INSERT, UPDATE or DELETE statements. But then Microsoft goes on to say “… performance gains depend on having correct indexes, joins, and other considerations in place.” That is an understatement dear readers and I’d like to call your attention to it.

I’m not going to repeat Microsoft’s advice (it is like I said, an excellent article), but I am going to add to it by describing some of the dangers of not following their advice.

Why Is MERGE Dangerous?

Here’s why. It’s because there is a risk that if you don’t get the “indexing, joins and other considerations” right, then it is possible to not only lose the performance benefits you hoped for, but to suffer from performance problems much much worse than if you had written the straight-forward INSERT, UPDATE and/or DELETE statements in the first place. Because of their complexity, I believe MERGE statements seem more vulnerable to bad query plans than other DML statements.

My Own Experience

I’ve been bitten by this problem twice in the past month the most recent time being yesterday. So for the curious, you could say that this lesson counts as the latest thing about SQL Server I learned the hard way.

Experience #1

The first time was with a merge statement that was implementing the standard logic of updating an existing set of table rows with using a dataset sent from an application:

  • Insert rows that are new
  • Update rows that have changed
  • Delete rows that are gone

But the logic of the whole thing required a fussy “WHEN NOT MATCHED BY SOURCE AND …” and the resulting query plan was doing something like a full-outer join. The query processed the entire target table even though it didn’t need to. If you’re really really curious, look for more details look at this similar question I found on Stack Overflow: Sql Server Delete and Merge performance.

I was about to rewrite the whole thing using only UPDATE/INSERT/DELETE statements, but then my colleague found a way to make the merge statement work. Woo hoo.

Experience #2

And then just yesterday, I had another problem with a MERGE statement. Some of the facts:

  • The source query was a CTE and produced about 4000 rows max. That CTE when run on its own never took more than 3 seconds in either on test or production databases.
  • The MERGE’s search condition – the ON [merge_search_condition]¬†clause – used columns that matched a unique non-clustered index. Microsoft recommends a clustered index here, but what can you do?
  • Testing with a target table of several thousand rows went just fine (in hindsight, a test db with millions of rows would have saved me).

But testing is testing and production is production. The optimizer crossed a threshold somewhere and said “Whelp, that’s it. I guess a non-clustered index isn’t going to cut it here any more. Let’s scan the whole thing.” Now most of the time the optimizer comes up with a really good plan. But in this case, it just didn’t.

I plan to rewrite the query as separate INSERT and UPDATE statements and early tests are promising.

Where Do We Go Now

Well, as a veteran problem-avoider, I now feel distrustful and wary of the MERGE statement. And maybe I don’t mind the multiple INSERT/UPDATE statements so much any more. If my MERGE statement takes one second to run and my INSERT/UPDATE statements takes two, then maybe two seconds is acceptable when the alternative takes so much extra effort to properly test. What do you think?

9 Comments »

  1. “found a way to make the merge statement work.”

    You can’t leave this hanging like that. How did you get it to work?

    Personally I love MERGE, but I have seen similar issues where splitting things out works better.

    Comment by Mike Donnelly — August 17, 2012 @ 3:44 pm

  2. Well I can’t go into great detail, but it involved experimenting putting a row filter either in a CTE, Merge’s ON clause or in the WHEN NOT MATCHED AND .

    One of them happened to work… The only reason I didn’t give the exact example (or an anonymized version of it) is because the indexes present on the table made for a unique situation and the solution here will not work (or not work for the same reasons) for other merge problems.

    It reminds me of the opening of Anna Karenina:
    “Happy Merge statements are all alike; every unhappy Merge statement is unhappy in its own way.”

    Comment by Michael J. Swart — August 17, 2012 @ 3:49 pm

  3. On your second experience, since the CTE is your data source, invoke it inline in your merge statement. We have seen marvelous performance improvement by not explicity naming the source table in the merge command but selecting from it instead.

    Comment by Jim O'Hearn — August 20, 2012 @ 1:05 pm

  4. @Jim
    Really. Wow I’ll have to try that. But it’s surprising. I would have thought that if the CTE contains no recursive statements, then it behaves identical to what you describe. Thanks again Jim. It gives me another thing to consider.

    Comment by Michael J. Swart — August 20, 2012 @ 1:09 pm

  5. @Jim
    I’ve seen the same thing multiple times with CTEs in views. The view with the CTE performs horribly, but when you take the same query and make it inline the performance gain is huge.

    Comment by Eric Ness — August 21, 2012 @ 1:58 pm

  6. [...] Be Careful with the Merge Statement - A cautionary tale from Michael J. Swart (Blog|Twitter). [...]

    Pingback by Something for the Weekend - SQL Server Links 24/08/12 — August 24, 2012 @ 10:52 am

  7. [...] Be Careful with the Merge Statement [...]

    Pingback by The MERGE statement in SQL Server | James Serra's Blog — September 5, 2012 @ 11:01 am

  8. I know I’ve heard that if a CTE is referenced multiple times in a query that the benefits of using it go down. I wonder if the MERGE statement is doing that behind the scenes.

    Comment by Chris F — September 6, 2012 @ 11:16 am

  9. Good advice. After changing our CSV files processing from using MERGE to UPDATE + INSERT, the time required goes down from 23 minutes to 13.5 minutes, a 40% reduction.

    There are 9 CSV files to be loaded into 9 tables. For the smaller ones, MERGE actually performs better. However, for larger CSV files / destination tables, where performance really matters, MERGE performs worse.

    Note that I did follow the “optimization” guide, including sorting the CSV files and adding the ORDER and UNIQUE hints in the OPENROWSET(BULK…) clause.

    I think the fundamental problem is that the query planner for SQL Server simply just sucks. Instead of making the planner faster and better (like what the PostgreSQL guys do), we get plan caching and query hints instead, both of which make everything way more complicated.

    Then, to tick the “we support MERGE statement” checkbox, a shitty implementation got pushed out without a good query planner to drive it.

    Comment by sayap — February 25, 2013 @ 12:25 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress