Michael J. Swart

July 20, 2017

SQL Server UPSERT Patterns and Antipatterns

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 11:34 am
Upsert Patterns and Antipatterns
Good ways and bad ways to update or insert rows

In 2015, Postgres added support for ON CONFLICT DO UPDATE to their INSERT statements. Their docs say that using this syntax guarantees an atomic INSERT or UPDATE outcome; one of those two outcomes is guaranteed, even under high concurrency.

Now, that is one amazing feature isn’t it? Postgres worked really hard to include that feature and kudos to them. I would trade ten features like stretch db for one feature like this. It makes a very common use case simple. And it takes away the awkwardness from the developer.

But if you work with SQL Server, the awkwardness remains and you have to take care of doing UPSERT correctly under high concurrency.

Easy things should be easy

I wrote a post in 2011 called Mythbusting: Concurrent Update/Insert Solutions. But since then, I learned new things, and people have suggested new UPSERT methods. I wanted to bring all those ideas together on one page. Something definitive that I can link to in the future.

The Setup

I’m going consider multiple definitions of a procedure called s_AccountDetails_Upsert. The procedure will attempt to perform an UPDATE or INSERT to this table:

CREATE TABLE dbo.AccountDetails
(
    Email NVARCHAR(400) NOT NULL CONSTRAINT PK_AccountDetails PRIMARY KEY,
    Created DATETIME NOT NULL DEFAULT GETUTCDATE(),
    Etc NVARCHAR(MAX)
);

Then I’m going to test each definition by using a tool like SQL Query Stress to run this code

declare @rand float = rand() + DATEPART( second, getdate() );
declare @Email nvarchar(330) = cast( cast( @rand * 100 as int) as nvarchar(100) ) + '@somewhere.com';
declare @Etc nvarchar(max) = cast( newid() as nvarchar(max) );
exec dbo.s_AccountDetails_Upsert @Email, @Etc;

Antipattern: Vanilla Solution

If you don’t care about concurrency, this is a simple implementation. Nothing fancy here:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
  ELSE 
 
    INSERT dbo.AccountDetails ( Email, Etc )
    VALUES ( @Email, @Etc );

Unfortunately, under high concurrency, this procedure fails with primary key violations.

Antipattern: MERGE Statement

The MERGE statement still suffers from the same concurrency issues as the vanilla solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

Primary Key violations are still generated under high concurrency. Even though it’s a single statement, it’s not isolated enough.

Antipattern: Inside a Transaction

If you try putting the vanilla solution inside a transaction, it makes the whole thing atomic, but still not isolated. You still get primary key violations:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Inside a Serializable Transaction

So this should be isolated enough, but now it’s vulnerable to deadlocks:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Using IGNORE_DUP_KEY

I want to mention one more bad solution before I move on to the good solutions. It works and it’s interesting, but it’s a bad idea.

ALTER TABLE dbo.AccountDetails 
  DROP CONSTRAINT PK_AccountDetails;
 
ALTER TABLE dbo.AccountDetails
  ADD CONSTRAINT PK_AccountDetails 
  PRIMARY KEY ( Email ) WITH ( IGNORE_DUP_KEY = ON )
GO
 
CREATE PROCEDURE dbo.s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  -- This statement will not insert if there's already a duplicate row,
  -- But it won't raise an error either
  INSERT dbo.AccountDetails ( Email, Etc ) 
  VALUES ( @Email, @Etc );
 
  UPDATE dbo.AccountDetails 
     SET @Etc = Etc
   WHERE @Email = Email;

So what are some better solutions?

Pattern: Inside a Serializable Transaction With Lock Hints

Still my prefered solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  BEGIN TRAN
 
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WITH (UPDLOCK) WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

This is a solid solution, but every implementation is different so every time you use this pattern, test for concurrency.

Pattern: MERGE Statement With Serializable Isolation

Nice, but be careful with MERGE statements.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

For alternative syntax, skip the SET TRANSACTION ISOLATION LEVEL SERIALIZABLE and include a lock hint: MERGE dbo.AccountDetails WITH (HOLDLOCK) AS myTarget.

Pattern: Just Do It

Just try it and catch and swallow any exception

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
 
BEGIN TRY
 
  INSERT INTO dbo.AccountDetails (Email, Etc) VALUES (@Email, @Etc);  
 
END TRY
 
BEGIN CATCH
 
  -- ignore duplicate key errors, throw the rest.
  IF ERROR_NUMBER() IN (2601, 2627) 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
END CATCH

I wrote a post about it in Ugly Pragmatism For The Win.
And I wrote about a rare problem with it in Case study: Troubleshooting Doomed Transactions.

I have no concurrency concerns here but because of this issue – not to mention performance concerns – it’s no longer my first preference.

Performance and Other Concerns

When describing each pattern so far, I haven’t paid attention to performance, just concurrency. There are many opportunities to improve the performance of these solutions. In particular, pay very close attention to your average use case. If your workload UPDATEs a row 99% of the time and INSERTs a row only 1% of the time, then optimal implementation will look different than if the UPDATE and INSERT frequency is reversed.

Check out these links (thanks Aaron Bertrand)

14 Comments »

  1. Nice work as always, sir. Look at you with the table of contents at the top right! Nifty! Try viewing the post in incognito mode in your browser though – there’s two checkboxes for “Notify me of followup comments via email”.

    Comment by Brent Ozar — July 20, 2017 @ 2:58 pm

  2. (fixed)
    What? You don’t want double the opportunity to be notified that someone on the internet said something!
    Michael

    Comment by Michael J. Swart — July 20, 2017 @ 3:00 pm

  3. Hi,
    I have tried to reproduce the issues, but cannot to do achieve that. Have tried only the vanilla solution.
    Could you please share the .sqlstress file?
    Which version and edition of SQl Server do you use?
    Thanks.
    Martin

    Comment by Martin — July 26, 2017 @ 5:42 am

  4. Hi Martin,

    I think there are a few things I could suggest

    • I updated the query in this post to run inside SQL Query stress, the way emails are generated now, there is a better chance of key collisions
    • I didn’t mention this, but remember to delete all the rows from AccountDetails when you begin each run.
    • When I use query stress, I use 50 threads for 1000 iterations
    • Personally, I’ve had better luck with a program I wrote which invokes tempdb.dbo.s_DoSomething. That’s described under The C# heading on this post: Generating Concurrent Activity. But I recognize that not everyone has a C# compiler on hand.

    Good luck Martin, let me know what works for you, I’ll update the post.

    Comment by Michael J. Swart — July 27, 2017 @ 9:28 am

  5. Great stuff! All SQL devs should read this.

    Comment by Alex Friedman — September 12, 2017 @ 11:20 am

  6. I tend to use MERGE with UPDLOCK, HOLDLOCK, ROWLOCK. (And HOLDLOCK = SERIALIZABLE). That packages everything that’s needed up into one statement.

    Another credible pattern would be using SERIALIZABLE, any other technique, and retry in case of deadlock.

    Comment by tobi — September 12, 2017 @ 1:58 pm

  7. Thanks Tobi,
    The retry by the application in case of deadlock is definitely a good pattern. I wish included it.
    I think I’m changing my point of view on it too. I used to think that retrying a failure was a last-resort solution and amounted to admitting defeat.
    But now I think that retries are a necessary part of the transaction landscape and they becomes almost necessary when considering more distributed solutions, or other high-availability solutions in general.

    Comment by Michael J. Swart — September 12, 2017 @ 2:28 pm

  8. Hi Michael,
    Have you considered this pattern, it does slightly favour UPDATE over INSERT, but performs well for both?

    ALTER PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
    AS

    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    BEGIN TRAN

    UPDATE dbo.AccountDetails
    SET Etc = @Etc
    WHERE Email = @Email

    INSERT dbo.AccountDetails ( Email, Etc )
    SELECT @Email, @Etc
    WHERE @@ROWCOUNT=0

    COMMIT

    Comment by Mister Magoo — October 5, 2017 @ 8:28 pm

  9. Nice! That does indeed work with no deadlocks.
    I like this especially because it avoids hints and the merge statement.

    Comment by Michael J. Swart — October 6, 2017 @ 9:27 am

  10. The “Just do it” solution made me think “Thank God I don’t work with this guy”.

    Comment by Leo — March 7, 2018 @ 11:42 am

  11. Yeah, the feeling’s mutual.

    Comment by Michael J. Swart — March 7, 2018 @ 2:23 pm

  12. Serializable isolation is sort of an anti-pattern, since it hampers concurrent writes to the table being upserted to. Sql server is going to take out a range lock on the key field.

    For the sake of concurrency, a better pattern is to use sp_getapplock, where you can lock on your key(s). sp_getapplock is extremely performant, and since you can lock on a single key, concurrency doesn’t take a hit. Here’s the gist of the pattern:

    -- Only open a new transaction if there is no outer transaction
    -- (We're going to tie the app lock to our transaction -- which means a rollback or commit will unlock)
    declare @hasOuterTran bit = case when @@trancount > 0 then 1 else 0 end
    if (@hasOuterTran = 1) save tran outerTran
    else begin tran
    begin try
    	declare @lockName varchar(128) = ''
    	declare @result int = 0
    	exec @result = sp_getapplock @Resource = @lockName, @LockMode = 'Exclusive', @LockTimeout = 5000
     
    	if (@result < 0) raiserror('Could not lock...';, 16, 1)
    	-- We got the lock
    	else begin
    		MERGE dbo.AccountDetails AS myTarget
    		USING (SELECT @Email Email, @Etc etc) AS mySource
    		ON mySource.Email = myTarget.Email
    		WHEN MATCHED THEN UPDATE
    			SET etc = mySource.etc
    		WHEN NOT MATCHED THEN 
    			INSERT (Email, Etc) 
    			VALUES (@Email, @Etc);
    	end
    	if (@hasOuterTran = 0) commit tran
    end try
    begin catch
    	if (@hasOuterTran = 0) rollback tran
    	else if (xact_state()  -1) rollback tran outerTran
    	exec Error_Rethrow
    end catch

    Comment by Toofy Snaddlepuff — October 12, 2018 @ 4:43 pm

  13. Hey there Toofy (great name)
    Your experience with sp_getapplock and mine are really really different. In fact, my first reaction to your post was surprise because I’ve always seen sp_getapplock as an enemy of concurrency.

    My dumb blog software stripped out the value that you used for @lockName. But I’m assuming it’s email or something. I have to admit that I’ve never thought about using sp_getapplock that way.

    Correct me if I’m wrong, but the idea is that two inserts that have different keys, but adjacent key values, would block eachother because of the range lock (with the same ranges).
    And your method avoids that by locking the value of the key.

    That’s really an cool idea. I’m going to test it out soon.

    Comment by Michael J. Swart — October 12, 2018 @ 5:16 pm

  14. OK. I was actually turned onto sp_getapplock by the former chief architect of Ancestry.com. The amount of concurrency that they support was / is ridiculous, and he said he used it all over the place. In fact, concurrency / load was so high, that the Sql Server tiger team that MS sent them told them to yank all foreign keys as rule #1 on a db of that size and load (since fkeys cause concurrency issues when you’re at that scale).

    You do need to be careful not to leave a lock hanging open. That’s why I prefer the transaction based approach, where a rollback or commit closes the lock. There can also arise deadlock issues as with any other pessimistic locking strategy.

    Granularity of the lock is one benefit, but there are others. For one, you can make the “critical section” as large or small as you like, ie, you can release the lock before the entire batch finishes. Secondly, you’re not locking rows / tables with serializable isolation, which has an impact on other transactions in addition to the current transaction. Remember that serializable isolation includes locking used for repeatable read, which affects reads.

    Comment by Toofy Snaddlepuff — October 15, 2018 @ 4:57 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress