Michael J. Swart

January 10, 2018

100 Percent Online Deployments: Blue Green Details

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:54 am
100 Percent Online Deployments
How to deploy schema changes without scheduled downtime

So now for the nitty gritty. In my last post, Blue-Green Deployment, I talked about replacing old blue things with new green things as an alternative to altering things. But Blue-Green doesn’t work with databases, so I introduced the Blue-Aqua-Green method. This helps keep databases and other services online 24/7.

The Aqua Database

What does the Aqua database look like? It’s a smaller version of Blue-Green, but only for those database objects that are being modified. Borrowing some icons from Management Studio’s Object Explorer, here’s what one Blue-Aqua-Green migration might look like:

Start with a database in the original blue state:

After the pre-migration scripts, the database is in the aqua state, the new green objects have been created and are ready for traffic from the green application servers. Any type of database object can use the Blue-Green method. Even objects as granular as indexes or columns.

Finally when the load has switched over to the green servers and they’re nice and stable, run the post-migration steps to get to the green state.

Blue-Green for Database Objects

How is the Blue-Green method applied to each kind of database object? With care. Each kind of object has its own subtle differences.

PROCEDURES:
Procedures are very easy to Blue-Green. Brand new procedures are added during the pre-migration phase. Obsolete procedures are dropped during the post-migration phase.

If the procedure is changing but is logically the same, then it can be altered during the pre-migration phase. This is common when the only change to a procedure is a performance improvement.

But if the procedure is changing in other ways. For instance, when a new parameter is added, or dropped, or the resultset is changing. Then use the Blue-Green method to replace it: During the pre-migration phase, create a new version of the procedure. It must be named differently and the green version of the application has to be updated to call the new procedure. The original blue version of the procedure is deleted during the post-migration phase. It’s not always elegant calling a procedure something like s_USERS_Create_v2 but it works.

VIEWS:
Views are treated the same as procedures with the exception of indexed views.
That SCHEMA_BINDING keyword is a real thorn in the side of Blue-Green and online migrations in general. If you’re going to use indexed views, remember that you can’t change the underlying tables as easily.

Creating an index on a view is difficult because (ONLINE=ON) can’t be used. If you want to get fancy go look at How to Create Indexed Views Online.

INDEXES:
The creation of other indexes are nice and easy if you have Enterprise Edition because you can use the (ONLINE=ON) keyword. But if you’re on Standard Edition, you’re a bit stuck. In SQL Server 2016 SP1, Microsoft included a whole bunch of Enterprise features into Standard, but ONLINE index builds didn’t make the cut.

If necessary, the Blue-Green process works for indexes that need to be altered too. The blue index and the green index will exist at the same time during the aqua phase, but that’s usually acceptable.

CONSTRAINTS:
Creating constraints like CHECKS and FOREIGN KEYS can be tricky because they require size-of-data scans. This can block activity for the duration of the scan.

My preferred approach is to use the WITH NOCHECK syntax. The constraint is created and enabled, but existing data is not looked at. The constraint will be enforced for any future rows that get updated or inserted.

That seems kind of weird at first. The constraint is marked internally as not trusted. For peace of mind, you could always run a query on the existing data.

TABLES:
The creation of tables doesn’t present any problems, it’s done in the pre-migration phase. Dropping tables is done in the post-migration phase.

What about altering tables? Does the Blue-Green method work? Replacing a table while online is hard because it involves co-ordinating changes during the aqua phase. One technique is to create a temporary table, populate it, keep it in sync and cut over to it during the switch. It sounds difficult. It requires time, space, triggers and an eye for detail. Some years ago, I implemented this strategy on a really complicated table and blogged about it if you want to see what that looks like.

If this seems daunting, take heart. A lot of this work can be avoided by going more granular: When possible, Blue-Green columns instead.

COLUMNS:
New columns are created during the pre-migration phase. If the table is large, then the new columns should be nullable or have a default value. Old columns are removed during the post-migration phase.

Altering columns is sometimes easy. Most of the time altering columns is quick like when it only involves a metadata change.

But sometimes it’s not easy. When altering columns on a large table, it may be necessary to use the Blue-Green technique to replace a column. Then you have to use triggers and co-ordinate the changes with the application, but the process is much easier than doing it for a whole table. Test well and make sure each step is “OLTP-Friendly”. I will give an example of the Blue-Green method for a tricky column in the post “Stage and Switch”.

Computed persisted columns can be challenging. When creating persisted computed columns on large tables, they can lock the table for too long. Sometimes indexed views fill the same need.

DATA:
Technically, data changes are not schema changes but migration scripts often require data changes to so it’s important to keep those online too. See my next post “Keep Changes OLTP-Friendly”

Automation

Easy things should be easy and hard things should be possible and this applies to writing migration scripts. Steve Jones asked me on twitter about “some more complex idempotent code”. He would like to see an example of a migration script that is re-runnable when making schema changes. I have the benefit of some helper migration scripts procedures we wrote at work. So a migration script that I write might look something like this:

declare @Columns migration.indexColumnSet;
 
INSERT @Columns (column_name, is_descending_key, is_included_column)
VALUES ('UserId', 0, 0)
 
exec migration.s_INDEX_AlterOrCreateNonClustered_Online
	@ObjectName = 'SOME_TABLE',
	@IndexName = 'IX_SOME_TABLE_UserId',
	@IndexColumns = @Columns;

We’ve got these helper scripts for most standard changes. Unfortunately, I can’t share the definition of s_INDEX_AlterOrCreateNonClustered_Online because it’s not open source. But if you know of any products or open source scripts that do the same job, let me know. I’d be happy to link to them here.

Where To Next?

So that’s Blue-Green, or more accurately Blue-Aqua-Green. Decoupling database changes from application changes allows instant cut-overs. In the next post Keep Changes OLTP-Friendly I talk about what migration scripts are safe to run concurrently with busy OLTP traffic.

January 8, 2018

100 Percent Online Deployments: Blue-Green Deployment

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:06 pm
100 Percent Online Deployments
How to deploy schema changes without scheduled downtime

The Blue-Green technique is a really effective way to update services without requiring downtime. One of the earliest references I could find for the Blue-Green method is in a book called Continuous Delivery by Humble and Farley. Martin Fowler also gives a good overview of it at BlueGreenDeployment. Here’s a diagram of the typical blue green method (adapted from Martin Fowler).

When using the Blue-Green method, basically nothing gets changed. Instead everything gets replaced. We start by setting up a new environment – the green environment – and then cut over to it when we’re ready. Once we cut over to the new environment successfully, we’re free to remove the original blue environment. The technique is all about replacing components rather than altering components.

IndianaBlueGreen

Before I talk about the database (databases), notice a couple things. We need a router: Load balancers are used to distribute requests but can also be used to route requests. This enables the quick cut-over. The web servers or application servers have to be stateless as well.

What About The Database Switch?

The two databases in the diagram really threw me for a loop the first time I saw this. This whole thing only works if you can replace the database on a whim. I don’t know about you, but this simply doesn’t work for us. The Continuous Delivery book suggests putting the database into read-only mode. Implementing a temporary read-only mode for applications is difficult and rare (I’ve only ever heard of Stackoverflow doing something like this succesfully).

But we don’t do that. We want our application to be 100% online for reads and writes. We’ve modified the Blue-Green method to work for us. Here’s how we change things:

Modified Blue-Green: The Aqua Database

Leave the database where it is and decouple the database changes from the applications changes. Make database changes ahead of time such that the db can serve blue or green servers. We call this forward compatible database version “aqua”.

The changes that are applied ahead of time are “pre-migration” scripts. The changes we apply afterwards are “post-migration” scripts. More on those later. So now our modified Blue-Green migration looks like this:

Start with the original unchanged state of a system:

Add some new green servers:

Apply the pre-migration scripts to the database. The database is now in an “aqua” state:

Do the switch!

Apply the post-migration scripts to the database. The database is now in the new “green” state:

Then remove the unused blue servers when you’re ready:

We stopped short of replacing the entire database. That “aqua” state for the database is the Blue-Green technique applied to individual database objects. In my next post, I go into a lot more detail about this aqua state with examples of what these kind of changes look like.

Ease in!

It takes a long time to move to a Blue-Green process. It took us a few years. But it’s possible to chase some short-term intermediate goals which pay off early:

Start with the goal of minimizing downtime. For example, create a pre-migration folder. This folder contains migration scripts that can be run online before the maintenance window. The purpose is to reduce the amount of offline time. New objects like views or tables can be created early, new indexes too.

Process changes are often disruptive and the move to Blue-Green is no different. It’s good then to change the process in smaller steps (each step with their own benefits).

After adding the pre-migration folder, continue adding folders. Each new folder involves a corresponding change in process. So over time, the folder structure evolves:

  • The original process has all changes made during an offline maintenance window. Make sure those change scripts are checked into source control and put them in a folder called offline: (offline)
  • Then add a pre-migration folder as described above: (pre, offline)
  • Next add a post-migration folder which can also be run while online: (pre, offline, post)
  • Drop the offline step to be fully online: (pre, post)

Safety

Automated deployments allow for more frequent deployments. Automated tools and scripts are great at taking on the burden of menial work, but they’re not too good at thinking on their feet when it comes to troubleshooting unexpected problems. That’s where safety comes in. By safety, I just mean that as many risks are mitigated as possible. For example:

Re-runnable Scripts
If things go wrong, it should be easy to get back on track. This is less of an issue if each migration script is re-runnable. By re-runnable, I just mean that the migration script can run twice without error. Get comfortable with system tables and begin using IF EXISTS everywhere:

-- not re-runnable:
CREATE INDEX IX_RUN_ONCE ON dbo.RUN_ONCE (RunOnceId);

The re-runnable version:

-- re-runnable:
IF NOT EXISTS (SELECT * 
                 FROM sys.indexes 
                WHERE name = 'IX_RUN_MANY' 
                  AND OBJECT_NAME(object_id) = 'RUN_MANY' 
                  AND OBJECT_SCHEMA_NAME(object_id) = 'dbo')
BEGIN
    CREATE INDEX IX_RUN_MANY ON dbo.RUN_MANY (RunManyId);
END

Avoid Schema Drift
Avoid errors caused by schema drift by asserting the schema before a deployment. Unexpected schema definitions lead to one of the largest classes of migration script errors. Errors that surprise us like “What do you mean there’s a foreign key pointing to the table I want to drop? That didn’t happen in staging!”

Schema drift is real and almost inevitable if you don’t look for it. Tools like SQL Compare are built to help you keep an eye on what you’ve got versus what’s expected. I’m sure there are other tools that do the same job. SQL Compare is just a tool I’ve used and like.

Schema Timing
When scripts are meant to be run online, duration becomes a huge factor so it needs to be measured.

When a large number of people contribute migration scripts, it’s important to keep an eye on the duration of those scripts. We’ve set up a nightly restore and migration of a sample database to measure the duration of those scripts. If any script takes a long time and deserves extra scrutiny, then it’s better to find out early.

Measuring the duration of these migration scripts helps us determine whether they are “OLTP-Friendly” which I elaborate on in Keep Changes OLTP Friendly.

Tackle Tedious Tasks with Automation

That’s a lot of extra steps and it sounds like a lot of extra work. It certainly is and the key here is automation. Remember that laziness is one of the three great virtues of a programmer. It’s the “quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs…”. That idea is still true today.

Coming Next: Blue-Green Deployment (Details).

January 5, 2018

100% Online Deployments

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 10:01 am
100 Percent Online Deployments
How to deploy schema changes without scheduled downtime

There’s increasing pressure to keep software services available all the time. But there’s also pressure to deploy improvements frequently. How many of us would love to reduce the duration of migration windows or better yet eliminate them entirely. It’s always been challenging to make changes safely without interrupting availability, especially database schema changes.

And when responsibilities are split between different groups – DBAs responsible for availability, developers responsible for delivering improvements – it causes some tension:

You truly belong with us in the cloud.

So I’m beginning a series describing the “Blue Green” continuous delivery technique. It’s a technique that really works well where I work, It helps us manage thousands of databases on hundreds of servers with monthly software updates and zero downtime.

Outline

Blue-Green
We use the Blue-Green deployment techniques described in Continuous Delivery by Humble and Farley and made popular by Martin Fowler in Blue Green Deployment. I’ll describe what the Blue-Green deployment technique is and how we use it.

We actually don’t follow the book perfectly. I’ll describe what we do differently, and why.

OLTP-Friendly
With some effort and creativity, we can break our database migrations into small chunks and deploy them while the application is still live.

Many changes to schema will lock tables for longer than we can tolerate. Often, the schema changes will need to take a brief SCH-M lock on certain objects and so this technique works best with OLTP workloads (workloads that don’t send many long-running queries).

I explore ways to make schema changes that run concurrently with an OLTP workload. What kinds of changes are easy to deploy concurrently and what kind of changes are enemies of concurrency?

Co-ordination is Key

This series is meant to help people investing in automation and other process improvements. It takes careful co-ordination between those responsible for uptime and those responsible for delivering improvements. So in your organization, if Dev Vader and DBA Calrissian are on the same team (or even the same person) then this series is especially for you.

I know this topic really well. It should be fun.

Coming Next: Blue-Green Deployment

December 20, 2017

When Measuring Timespans, try DATEADD instead of DATEDIFF

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 11:25 am

Recently I tackled an issue where a DateTime field was getting updated too often. The query looked something like this:

UPDATE dbo.SOMETABLE
SET MyDateTime = GETUTCDATE()
WHERE SomeTableId = @SomeTableId;

So I decided to give up accuracy for concurrency. Specifically, I decided to only update MyDateTime if the existing value was more than a second ago.

First Attempt: Use DATEDIFF With SECOND

My first attempt looked like this:

UPDATE dbo.SOMETABLE
SET MyDateTime = GETUTCDATE()
WHERE SomeTableId = @SomeTableId
AND DATEDIFF(SECOND, GETUTCDATE(), MyDateTime) > 1;

But I came across some problems. I assumed that the DATEDIFF function I wrote worked this way: Subtract the two dates to get a timespan value and then return the number of seconds (rounded somehow) in that timespan.

But that’s not how it works. The docs for DATEDIFF say:

“Returns the count (signed integer) of the specified datepart boundaries crossed between the specified startdate and enddate.”

There’s no rounding involved. It just counts the ticks on the clock that are heard during a given timespan.

Check out this timeline. It shows three timespans and the DATEDIFF values that get reported:

But that’s not the behavior I want.

How About DATEDIFF with MILLISECOND?

Using milliseconds gets a little more accurate:

UPDATE dbo.SOMETABLE
SET MyDateTime = GETUTCDATE()
WHERE SomeTableId = @SomeTableId
AND DATEDIFF(MILLISECOND, GETUTCDATE(), MyDateTime) > 1000;

And it would be good for what I need except that DATEDIFF using MILLISECOND will overflow for any timespan over a month. For example,

SELECT DATEDIFF (millisecond, '2017-11-01', '2017-12-01')

gives this error:

Msg 535, Level 16, State 0, Line 1
The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

SQL Server 2016 introduced DATEDIFF_BIG to get around this specific problem. But I’m not there yet.

Use DATEADD

I eventually realized that I don’t actually need to measure a timespan. I really just need to answer the question “Does a particular DateTime occur before one second ago?” And I can do that with DATEADD

UPDATE dbo.SOMETABLE
SET MyDateTime = GETUTCDATE()
WHERE SomeTableId = @SomeTableId
AND MyDateTime < DATEADD(SECOND, -1, GETUTCDATE()) ;

Update: Adam Machanic points out another benefit to this syntax. The predicate AND MyDateTime < DATEADD(SECOND, -1, GETUTCDATE()) syntax is SARGable (unlike the DATEDIFF examples). Even though there might not be a supporting index or SQL Server might not choose to use such an index it in this specific case, I prefer this syntax even more.

So How About You?

Do you use DATEDIFF at all? Why? I'd like to hear about what you use it for. Especially if you rely on the datepart boundary crossing behavior.

November 10, 2017

Postponing Our Use of In Memory OLTP

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:30 am

Devops, Agile and Continuous Delivery

Sometimes I get asked about work and our software development practices. Often these questions use words like agile, devops, or continuous delivery as in “Are you agile?” or “Do you do continuous delivery?”. But these questions rarely have yes or no answers. I always want to answer “It’s a work in progress”.

One of the things I like best about the PASS Summit is the opportunity to talk to people to find out how they do development. After speaking with others, it seems like everyone’s processes are always going to be works-in-progress. But I’ve come to realize that at D2L, we’re pretty far along. Here are just some of the things we do:

Things We Do

  • Deployments to production are scheduled and almost completely automatic. This lets us deploy way more often than we used to.
  • All code – including procedures and table definitions – are checked in.
  • Automatic tests are run on every pull request and merge.
  • Safety is a huge concern. We use feature flags and other techniques, but it remains difficult to maintain large complicated systems safely.
  • We use blue-green deployments for zero downtime.
  • The database layer is an exception to this blue-green technique. So it’s very important to be able to rollback any changes or configuration.

Sardines and Whales

This means we must also support thousands of copies of our main database. They’re used for client sites, test sites, qa sites, or whatever. So that leads to a variety of server configurations that I refer to as sardines and whales:

Look at those sardines. They’re quite happy where they are. The server can handle up to a thousand databases when there’s almost no activity.

But that whale is on a huge server and is extremely busy. Because of the high volume of transactions, we sometimes encounter tempdb contention due to our frequent use of table valued parameters. One technique I’ve been looking forward to evaluating is using memory optimized table types.

Maybe We Can Use In Memory OLTP?

I’m actually not very interested in memory optimized tables. I’m much more interested in the memory optimized table types. Those types can be used for table valued parameters. I can’t tell you how excited I was that it might solve my tempdb pet peeve.

But our dreams for the feature died
Memoriam

We’re leaving the feature behind for a few reasons. There’s an assumption we relied on for the sardine servers: Databases that contain no data and serve no activity should not require significant resources like disk space or memory. However, when we turned on In Memory OLTP by adding the filegroup for the memory-optimized data, we found that the database began consuming memory and disk (about 2 gigabytes of disk per database). This required extra resources for the sardine servers. So for example, 1000 databases * 2Gb = 2Tb for a server that should be empty.

Another reason is that checkpoints began to take longer. Checkpoints are not guaranteed to be quick, but on small systems they take a while which impacts some of our Continuous Integration workflows.

At the PASS Summit, I talked to a Hekaton expert panel. I also talked to a couple people in the Microsoft SQL Server clinic about some of my issues. They all recommended that we upgrade to SQL Server 2016 (which we can’t yet). Maybe I didn’t phrase my questions well, but I didn’t come away with any useful strategy to pursue.

I later talked to a Speaker Idol contestant Brian Carrig (@briancarrig) after hearing him talk briefly about his experiences with In Memory OLTP. He mentioned his own hard-fought lessons with In Memory OLTP including some uncomfortable outages.

The final nail in the coffin, as it were, is that once you turn on In Memory OLTP, you can’t turn it off. Once the In Memory OLTP filegroup is added, it’s there for good. Like I said, safety is a huge concern for us so we’re giving up on the feature for now.

Resurrecting the Feature?

The feature was designed for whales, not sardines. Maybe someday we will try to get those sardine servers to not fail with In Memory OLTP. Until then, the feature goes back on the shelf.

August 2, 2017

Problem With Too Many version_ghost_records

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

We encountered a CPU issue which took over a month to understand and fix. Since then, it’s been another couple months and so I think it may be time for a debrief.

The cause was identified as a growing number of ghost records that SQL Server would not clean up no matter what. Our fix was ultimately to restart SQL Server.

When there's somethin' strange in your neighborhood

Symptoms and Data Collection

Here’s what we found.

  • At time marked ‘A’ on the graph, we noticed that CPU increased dramatically. It was hard not to notice.
  • We used sp_whoisactive to identify a frequent query that was taking a large amount of CPU. That query had never been a problem before. It was a select against a queue table – a table whose purpose was to store data for an implementation of a work queue. This table had a lot of churn: many INSERTS and DELETES. But it was small, never more than 100 rows.
  • So next, we ran the query manually in Management Studio. SET STATISTICS IO, TIME ON gave us a surprise. A simple COUNT(*) of the table told us there were 30 rows in the table, but reading it took 800K logical page reads!
  • What pages could it be reading? It’s impossible for a table to be that fragmented, it would mean less than one row per page. To look at the physical stats we ran:
    select 
           sum(record_count) as records,
           sum(ghost_record_count) as ghost_records,
           sum(version_ghost_record_count) as version_ghost_records
      from sys.dm_db_index_physical_stats(db_id(), object_id('<table_name>'), default, default, 'detailed')
     where index_id = 1
           and index_level = 0

    And that gave us these results:


    Interesting. The ghost records that remain are version_ghost_records, not ghost_records. Which sounds like we’re using some sort of snapshot isolation (which we’re not), or online reindexing (which we are), or something else that uses row versions.

  • Over time, those version_ghost_records would constantly accumulate. This ghost record accumulation was also present in all other tables, but it didn’t hurt as much as the queue table which had the most frequent deletes.

Mitigation – Rebuild Indexes

Does an index rebuild clean these up? In this case, yes. An index rebuild reduced the number of version ghost records for the table. At the time marked ‘B’ in the timeline, we saw that an index rebuild cleaned up these records and restored performance. But only temporarily. The version_ghost_records continued to accumulate gradually.

At time ‘C’ in the timeline, we created a job that ran every 15 minutes to rebuild the index. This restored performance to acceptable levels.

More investigation online

Kendra Little – Why Is My Transaction Log Growing In My Availability Group?
This is a great video. Kendra describes a similar problem. Long running queries on secondary replicas can impact the primary server in an Availability Group (AG). But we found no long running queries on any replica. We looked using sp_whoisactive and DBCC OPENTRAN. We didn’t see any open transactions anywhere.

Amit Banerjee – Chasing the Ghost Cleanup in an Availability Group
Amit also mentions that log truncation would be prevented in the case of a long-running query on a replica. However, in our case, log truncation was occurring.

Uwe Ricken – Read Committed Snapshot Isolation and high number of version_ghost_record_count
Uwe Ricken also blogged recently about a growing number of version_ghost_records. He talked about looking for open transactions that use one of the snapshot isolation levels. Unfortunately it didn’t apply to our case.

Bitemo Erik Gergely – A very slow SQL Server
Another example of a long running query keeping version_ghost_records around.

dba.stackexchange – GHOST_CLEANUP Lots of IO and CPU Usage
This stackexchange question also describes a problem with lots of CPU and an inefficient, ineffective ghost cleanup task for databases in an AG. There’s an accepted answer there, but it’s not really a solution.

Calling Microsoft Support

So we called Microsoft support. We didn’t really get anywhere. We spoke with many people over a month. We generated memory dumps, PSSDiag sessions and we conducted a couple screen sharing sessions. Everyone was equally stumped.

After much time and many diagnostic experiments. Here’s what we did find.

  • Availability Groups with readable secondaries are necessary (but not sufficient) to see the problem. This is where the version_ghost_records come from in the first place. Readable secondaries make use of the version store.
  • We ran an extended event histogram on the ghost_cleanup event. There was a ridiculous amount of events. Like millions per minute, but they weren’t actually cleaning up anything:
    CREATE EVENT SESSION ghostbusters ON SERVER 
    ADD EVENT sqlserver.ghost_cleanup( ACTION( sqlserver.database_id ) )
    ADD TARGET package0.histogram( SET filtering_event_name=N'sqlserver.ghost_cleanup', source=N'sqlserver.database_id' )
  • Microsoft let me down. They couldn’t figure out what was wrong. Each new person on the case had to be convinced that there were no open transactions. They couldn’t reproduce our problem. And on our server, we couldn’t avoid the problem.

Resolution

Ultimately the time came for an unrelated maintenance task. We had to do a rolling upgrade for some hardware driver update. We manually failed over the Availability Group and after that, no more problem!

It’s satisfying and not satisfying. Satisfying because the problem went away for us and we haven’t seen it since. After the amount of time I spent on this, I’m happy to leave this problem in the past.

But it’s not satisfying because we didn’t crack the mystery. And restarting SQL Server is an extreme solution for a problem associated with an “Always On” feature.

If you’re in the same boat as I was, go through the links in this post. Understand your environment. Look for long running queries on all replicas. And when you’ve exhausted other solutions, mitigate with frequent index rebuilds and go ahead and restart SQL Server when you can.

July 20, 2017

SQL Server UPSERT Patterns and Antipatterns

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 11:34 am
Upsert Patterns and Antipatterns
Good ways and bad ways to update or insert rows

In 2015, Postgres added support for ON CONFLICT DO UPDATE to their INSERT statements. Their docs say that using this syntax guarantees an atomic INSERT or UPDATE outcome; one of those two outcomes is guaranteed, even under high concurrency.

Now, that is one amazing feature isn’t it? Postgres worked really hard to include that feature and kudos to them. I would trade ten features like stretch db for one feature like this. It makes a very common use case simple. And it takes away the awkwardness from the developer.

But if you work with SQL Server, the awkwardness remains and you have to take care of doing UPSERT correctly under high concurrency.

Easy things should be easy

I wrote a post in 2011 called Mythbusting: Concurrent Update/Insert Solutions. But since then, I learned new things, and people have suggested new UPSERT methods. I wanted to bring all those ideas together on one page. Something definitive that I can link to in the future.

The Setup

I’m going consider multiple definitions of a procedure called s_AccountDetails_Upsert. The procedure will attempt to perform an UPDATE or INSERT to this table:

CREATE TABLE dbo.AccountDetails
(
    Email NVARCHAR(400) NOT NULL CONSTRAINT PK_AccountDetails PRIMARY KEY,
    Created DATETIME NOT NULL DEFAULT GETUTCDATE(),
    Etc NVARCHAR(MAX)
);

Then I’m going to test each definition by using a tool like SQL Query Stress to run this code

declare @rand float = rand() + DATEPART( second, getdate() );
declare @Email nvarchar(330) = cast( cast( @rand * 100 as int) as nvarchar(100) ) + '@somewhere.com';
declare @Etc nvarchar(max) = cast( newid() as nvarchar(max) );
exec dbo.s_AccountDetails_Upsert @Email, @Etc;

Antipattern: Vanilla Solution

If you don’t care about concurrency, this is a simple implementation. Nothing fancy here:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
  ELSE 
 
    INSERT dbo.AccountDetails ( Email, Etc )
    VALUES ( @Email, @Etc );

Unfortunately, under high concurrency, this procedure fails with primary key violations.

Antipattern: MERGE Statement

The MERGE statement still suffers from the same concurrency issues as the vanilla solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

Primary Key violations are still generated under high concurrency. Even though it’s a single statement, it’s not isolated enough.

Antipattern: Inside a Transaction

If you try putting the vanilla solution inside a transaction, it makes the whole thing atomic, but still not isolated. You still get primary key violations:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Inside a Serializable Transaction

So this should be isolated enough, but now it’s vulnerable to deadlocks:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Using IGNORE_DUP_KEY

I want to mention one more bad solution before I move on to the good solutions. It works and it’s interesting, but it’s a bad idea.

ALTER TABLE dbo.AccountDetails 
  DROP CONSTRAINT PK_AccountDetails;
 
ALTER TABLE dbo.AccountDetails
  ADD CONSTRAINT PK_AccountDetails 
  PRIMARY KEY ( Email ) WITH ( IGNORE_DUP_KEY = ON )
GO
 
CREATE PROCEDURE dbo.s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  -- This statement will not insert if there's already a duplicate row,
  -- But it won't raise an error either
  INSERT dbo.AccountDetails ( Email, Etc ) 
  VALUES ( @Email, @Etc );
 
  UPDATE dbo.AccountDetails 
     SET @Etc = Etc
   WHERE @Email = Email;

So what are some better solutions?

Pattern: Inside a Serializable Transaction With Lock Hints

Still my prefered solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  BEGIN TRAN
 
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WITH (UPDLOCK) WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

This is a solid solution, but every implementation is different so every time you use this pattern, test for concurrency.

Pattern: MERGE Statement With Serializable Isolation

Nice, but be careful with MERGE statements.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

For alternative syntax, skip the SET TRANSACTION ISOLATION LEVEL SERIALIZABLE and include a lock hint: MERGE dbo.AccountDetails WITH (HOLDLOCK) AS myTarget.

Pattern: Just Do It

Just try it and catch and swallow any exception

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
 
BEGIN TRY
 
  INSERT INTO dbo.AccountDetails (Email, Etc) VALUES (@Email, @Etc);  
 
END TRY
 
BEGIN CATCH
 
  -- ignore duplicate key errors, throw the rest.
  IF ERROR_NUMBER() IN (2601, 2627) 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
END CATCH

I wrote a post about it in Ugly Pragmatism For The Win.
And I wrote about a rare problem with it in Case study: Troubleshooting Doomed Transactions.

I have no concurrency concerns here but because of this issue – not to mention performance concerns – it’s no longer my first preference.

Performance and Other Concerns

When describing each pattern so far, I haven’t paid attention to performance, just concurrency. There are many opportunities to improve the performance of these solutions. In particular, pay very close attention to your average use case. If your workload UPDATEs a row 99% of the time and INSERTs a row only 1% of the time, then optimal implementation will look different than if the UPDATE and INSERT frequency is reversed.

Check out these links (thanks Aaron Bertrand)

July 16, 2017

10 Things I Learned While Working At D2L

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication — Michael J. Swart @ 3:49 pm

Today marks my tenth anniversary working for D2L and it also marks ten years since I specialized in databases full time.
I’ve never worked anywhere this long. I always told myself that I’d stick around as long as there were new and interesting challenges to tackle and there’s no end in sight.

Recognition
So I thought I’d compile a list of things I didn’t know in 2007.

1. Database Performance Improvements Are Often Dramatic

A poorly performing database query can run orders of magnitude slower than a tuned query. Compare that to poorly performing application code. Improvements there are usually more modest.

But this can work out. It makes you look like a genius. You get to brag how that report that once took 32 hours now takes 10 minutes. Even though that missing index suggestion in the query plan turned out to be right this time, take the credit, you deserve it.

2. The Community Has Your Back

Speaking of appearing like a genius, here are three resources I still use when solving problems.

3. People Love Pictures With Their Content

I’ve been writing here for nearly ten years. One day, I decided to include an illustration with each blog post. For some reason it makes each post just a little more compelling than a wall of text. Drawing has been a very rewarding hobby for me. I highly recommend it.

4. There Are Dozens of Walkable Lunch Options In Downtown Kitchener

Here are a few of my favorites

  • Kava Bean: Excellent all day breakfast
  • City Cafe: Awesome bagels in the morning.
  • Darlise: You have to hunt for it, but it’s great food
  • Holy Guacamole: Especially if you’re a fan of cilantro

If you’re in town, send me an email. I’ll introduce you to one of these places.

5. Don’t Overestimate Normal Forms

Knowing normal forms is a lot less useful than I thought it would be. Normal forms are an academic way of describing a data model and it’s ability to avoid redundant data. But knowing the difference between the different normal forms has never helped me avoid trouble.

Just use this rule of thumb: Avoid storing values in more than one place. That will get you most of the way. Do it right and you’ll never hear anyone say things like

  • We need a trigger to keep these values in sync
  • We need to write a script to correct bad data

6. “How Do You Know That?”

This is one of my favorite lines. I use it when troubleshooting a technical problem. It’s a line that helps me avoid red-herrings or to maybe find out if that info is coming via telephone game. It’s kind of my version of Missouri’s “you’ve got to show me”. It might go something like this:

Say a colleague says “I/O is slow lately.”

So I’d ask “How do you know?”

They might say something like:

  • “Hinky McUnreliable thought that’s what it was. He told me in the hallway just now”
  • The error log reports 3 instances of I/O taking longer than 15 seconds
  • I measured it on the QA box.

7. Reduce Query Frequency

The impact of a particular procedure or query on a server is simply the frequency of a query multiplied by the cost of that query. To reduce the impact on the server it’s often more valuable to focus on reducing the number of calls to that query in addition to tuning it.

  • Look for duplicate queries. You’d be surprised how often a web page asks for the same information.
  • Cache the results of queries that are cache-able. I rely on this one heavily.
  • Consider techniques like lazy loading. Only request values the moment they’re needed.

8. Schema Drift Is Real

If you have multiple instances of a database then schema will drift over time. In other words, the definition of objects like tables, views and procedures will be out of alignment with what’s expected. The drift is inevitable unless you have an automated process to check for misalignment.

We wrote our own but tools like Redgate’s Schema Compare can help with this kind of effort.

9. Get Ahead Of Trouble

Software will change over time and it requires deliberate attention and maintenance to remain stable. Watch production and look for expensive queries before they impact end users.

When I type it out like that, it seems obvious. But the question really then becomes, how much time do you want to invest in preventative maintenance? My advice would be more.

Not enough time means that your team gets really good practice firefighting urgent problems. I work directly with a dream team of troubleshooters. They’re smart, they’re cool during a crisis and they’re awesome at what they do. Fixing issues can be thrilling at times, but it pulls us away from other projects.

And what do they say about people who can’t remember the past? Sometimes we have to relearn this lesson every few years.

10. D2L Is A Great Employer

D2L values learning of course and that includes employee professional development. They have always supported my blogging and speaking activities and they recognize the benefits of it.

The people I work with are great. I couldn’t ask for a better team.

And we’ve got a decent mission. I’m really glad to be part of a team that works so hard to transform the way the world learns. Thanks D2L!

July 6, 2017

Converting from DateTime to Ticks using SQL Server

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 9:55 am

A .Net tick is a duration of time lasting 0.1 microseconds. When you look at the Tick property of DateTime, you’ll see that it represents the number of ticks since January 1st 0001.
But why 0.1 microseconds? According to stackoverflow user CodesInChaos “ticks are simply the smallest power-of-ten that doesn’t cause an Int64 to overflow when representing the year 9999”.

Even though it’s an interesting idea, just use one of the datetime data types, that’s what they’re there for. I avoid ticks whenever I can.

ticks

But sometimes things are out of your control. If you do want to convert between datetime2 and ticks inside SQL Server instead of letting the application handle it, then you need to do some date math. It can get a bit tricky avoiding arithmetic overflows and keeping things accurate:

DateTime2 to Ticks

CREATE FUNCTION dbo.ToTicks ( @DateTime datetime2 )
  RETURNS bigint
AS
BEGIN
 
    RETURN DATEDIFF_BIG( microsecond, '00010101', @DateTime ) * 10 +
           ( DATEPART( NANOSECOND, @DateTime ) % 1000 ) / 100;
END

DateTime2 to Ticks (Versions < SQL Server 2016)

CREATE FUNCTION dbo.ToTicks ( @DateTime datetime2 )
  RETURNS bigint
AS
BEGIN
    DECLARE @Days bigint = DATEDIFF( DAY, '00010101', cast( @DateTime as date ) );
    DECLARE @Seconds bigint = DATEDIFF( SECOND, '00:00', cast( @DateTime as time( 7 ) ) );
    DECLARE @Nanoseconds bigint = DATEPART( NANOSECOND, @DateTime );
    RETURN  @Days * 864000000000 + @Seconds * 10000000 + @Nanoseconds / 100;
END

Ticks to DateTime2

CREATE FUNCTION dbo.ToDateTime2 ( @Ticks bigint )
  RETURNS datetime2
AS
BEGIN
    DECLARE @DateTime datetime2 = '00010101';
    SET @DateTime = DATEADD( DAY, @Ticks / 864000000000, @DateTime );
    SET @DateTime = DATEADD( SECOND, ( @Ticks % 864000000000) / 10000000, @DateTime );
    RETURN DATEADD( NANOSECOND, ( @Ticks % 10000000 ) * 100, @DateTime );
END

May 24, 2017

A Table Of Contents For the Data Industry

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 10:52 am

DDIAA review of Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman, published by O’Reilly.

Martin Kleppmann

In my time exploring the world of data, I often feel intrigued by the unfamiliar. But like most, I’m leery of buzzwords and maybe a little worried that I’m missing out on the thing that we’ll all need to learn to extend our careers beyond the next five years.

Adding to the frustration, I have difficulty evaluating new technologies when the only help I’ve got is a vendor’s brochure, or a blog titled “Why X is way way better than Y”. Balanced, unbiased help is hard to find. And I really value unbiased ideas. Personally, I try not to appear the Microsoft groupie. I value my MVP award, but I also like to stress the “independent” part of that.

I’m also wary of some academic researchers. The theory can sometimes drift too far away from the practical (dba.stackexchange only has 411 results for “normal form” and I suspect many of those are homework questions). Some academics almost seem offended whenever a vendor deviates from relational theory.

That’s why I was so thrilled (and bit relieved) to discover Martin Kleppmann’s Designing Data-Intensive Applications. Martin is an amazing writer who approached his book with a really balanced style. He’s also a researcher with real-world experience that helps him focus on the practical.

Designing Data-Intensive Applications

First let me just say that the book has a really cool wild boar on the cover. The boar reminds me of Porcellino at the University of Waterloo.

Boar

Designing Data-Intensive Applications is a book that covers database systems very comprehensively. He covers both relational systems and distributed systems. He covers data models, fault tolerance strategies and so much more. In fact he covers so many topics that the whole book seems like a table of contents for our data industry.

Here’s the thing. When I read the parts I know, he’s a hundred percent right and that helps me trust him when he talks about the parts that I don’t know about.

Martin talks a lot about distributed systems, both the benefits and drawbacks, and even though you may have no plans to write a Map-Reduce job, you’ll be equipped to talk about it intelligently with those that do. The same goes for other new systems. For example, after reading Martin’s book, you’ll be able to read the spec sheet on Cosmos DB and feel more comfortable reasoning about its benefits (but that’s a post for another day).

Event Sourcing

Martin then goes on to write about event sourcing. Martin is a fan of event sourcing (as are several of my colleagues). In fact I first learned about event sourcing when a friend sent me Martin’s video Turning the Database Inside Out With Apache Samza.
A ridiculously simplified summary is this. Make the transaction log the source of truth and make everything else a materialized view. It allows for some really powerful data solutions and it simplifies some problems you and I have learned to live with. So I wondered whether his chapter on event sourcing would sound like a commercial. Nope, that chapter is still remarkably well balanced.

By the way, when the revolution comes it won’t bother me at all. There’s no longer such thing as a one-size-fits-all data system and the fascinating work involves fitting all the pieces together. I’m working in the right place to explore this brave new world and excited to learn the best way to move from here to there.

Some Other Notes I Made

  • This feels like the RedBook put together by Michael Stonebraker but it deals with more kinds of systems. I hope Martin refreshes this book every few years with new editions as the industry changes.
  • Martin suggests that the C in ACID might have been introduced for the purpose of the acronym. I knew it!
  • Martin calls the CAP theorem unhelpful and explains why. He admits the CAP theorem “encouraged engineers to explore a wider design space” but it is probably better left behind.
  • My wife Leanne doesn’t like fantasy books and she won’t read a book that has a map in the front. My friend Paul won’t read a book without one. Martin is very professional, but his style shows and I love it. He’s got a map at the beginning of every chapter.
  • The quotes at the beginning of each chapter are really well chosen. They come from Douglas Adams, Terry Pratchett and others. But my favorite is from Thomas Aquinas “If the highest aim of a captain were to preserve his ship, he would keep it in port forever.”

Martin writes “The goal of this book is to help you navigate the fast and diverse changing landscapes of technologies for processing and storing data”. I believe he met that goal.

« Newer PostsOlder Posts »

Powered by WordPress