Michael J. Swart

November 25, 2022

Use RCSI to tackle most locking and blocking issues in SQL Server

Filed under: Miscelleaneous SQL,Technical Articles — Michael J. Swart @ 12:54 pm

What’s the best way to avoid most blocking issues in SQL Server? Turn on Read Committed Snapshot Isolation (RCSI). That’s it.

Configuring RCS isolation level

To see if it’s enabled on your database, use the is_read_committed_snapshot_on column in sys.databases like this:

select is_read_committed_snapshot_on
from sys.databases
where database_id = db_id();

To enable the setting alter the database like this:

ALTER DATABASE CURRENT
SET READ_COMMITTED_SNAPSHOT ON

Is it that easy?

Kind of. For the longest time at work, we ran our databases with this setting off. Mostly because that’s the default setting for SQL Server. As a result, we encountered a lot of blocking and deadlocks. I got really really good at interpreting deadlocks and blocking graphs. I’ve written many blog posts on blocking and I even wrote a handy tool (the blocked process report viewer) to help understand who the lead blocker was in a blocking traffic jam.

Eventually after a lot of analysis we turned on RCSI. Just that setting change probably gave us the biggest benefit for the least effort. We rarely have to deal with blocking issues. I haven’t made use of the blocked process report viewer in years.

Be like Severus Snape

I’m reminded of a note that Snape (from the Harry Potter books) wrote in his textbook on poison antidotes “Just shove a bezoar down their throats.” The idea was that you didn’t have to be good at diagnosing and creating antidotes because a bezoar was simply an “antidote to most poisons”.

In the same way, I’ve found that RCSI is an antidote to most blocking.

October 12, 2022

You Can Specify Two Indexes In Table Hint?

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:00 pm

Yes, It turns out that you can specify two indexes in a table hint:

SELECT Id, Reputation
FROM dbo.Users WITH (INDEX (IX_Reputation, PK_Users_Id))
WHERE Reputation > 1000

And SQL Server obeys. It uses both indexes even though the nonclustered index IX_Reputation is covering:
Two Indexes

But Why?

I think this is a solution looking for a problem.

Resolving Deadlocks?
My team wondered if this could be used as to help with a concurrency problem. We recently considered using it to resolve a particular deadlock but we had little success.

It’s useful to think that SQL Server takes locks on index rows instead of table rows. And so the idea we had was that perhaps taking key locks on multiple indexes can help control the order that locks are taken. But after some effort, it didn’t work at avoiding deadlocks. For me, I’ve had better luck using the simpler sp_getapplock.

Forcing Index Intersection?
Brent Ozar wrote about index intersection a while ago. Index intersection is a rare thing to find in a query plan. Brent can “count on one hand the number of times [he’s] seen this in the wild”.

In theory, I could force index intersection (despite the filter values):

SELECT Id
FROM dbo.Users WITH (INDEX (IX_UpVotes, IX_Reputation))
WHERE Reputation > 500000
AND UpVotes > 500000

But I wouldn’t. SQL Server choosing index intersection is already so rare. And so I think the need to force that behavior will be even rarer. This is not a tool I would use for tuning queries. I’d leave this technique alone.

Have You Used More Than One Index Hint?

I’d love to hear about whether specifying more than one index in a table hint has ever helped solve a real world problem. Let me know in the comments.

October 6, 2022

The Tyranny Of Cumulative Costs (Save and Forget Build Up)

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:00 pm

50:50 Triangle

Using the right triangle above draw a vertical line separating the area of the triangle in to two parts with the same area.
The triangle on the left is 70.7% of the width of the original triangle.

Cumulative Storage Costs

Think of this another way. The triangle above is a graph of the amount of data you have over time. And if you pay for storage as an operational expense such as when you’re renting storage in the cloud (as opposed to purchasing physical drives). Then the cost of storage is the area of the graph. The monthly bills are ever-increasing, so half of the total cost of storage will be for the most recent 29%.

Put yet another way: If you started creating a large file in the cloud every day since March 2014, then the amount you paid to the cloud provider before the pandemic started is the same amount you paid after the pandemic started (as of August 2022).

How Sustainable is This?

If the amount of data generated a day isn’t that much, or the storage you’re using is cheap enough then it really doesn’t matter too much. As an example, AWS’s cheapest storage, S3 Glacier Deep Archive, works out to about $0.001 a month per GB.

But if you’re using Amazon’s Elastic Block Storage like the kind of storage needed for running your own SQL Servers in the cloud, the cost can be closer to $.08 a month per GB.

The scale on the triangle graph above really matters.

Strategies

This stresses the need for a data life-cycle policy. An exit story for large volumes of data. Try to implement Time-To-Live (TTL) or clean up mechanisms right from the beginning of even the smallest project. Here’s one quick easy example from a project I wrote that collects wait stats. The clean-up is a single line.

Look at Netflix does approaches this issue. I like how they put it. “Data storage has a lot of usage and cost momentum (i.e. save-and-forget build-up).”

Netflix stresses the importance of “cost visibility” and they use that to offer focused recommendations for cleaning up unused data. I recommend reading that whole article. It’s fascinating.

It’s important to implement such policies before that triangle graph gets too large.

September 28, 2022

When are Non-Updating Updates Treated Like Regular Updates?

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:00 pm

Takeaway: I look at different features to see whether non-updates are treated the same as other updates. Most of the time they are.

According to Microsoft’s documentation, an UPDATE statement “changes existing data in a table or view”. But what if the values don’t actually change? What if affected rows are “updated” with the original values? Some call these updates non-updating. This leads to a philosophical question: “If an UPDATE statement doesn’t change any column to a different value, has the row been updated?”

I answer yes to that question. I consider all affected rows as “updated” regardless of whether the values are different. I think of the UPDATE statement as more of an OVERWRITE statement. I also think of “affected rows” rather than “changed rows”. In most cases SQL Server thinks along the same lines.

I list some features and areas of SQL Server and whether non-updating updates are treated the same or differently than other updates:

The Performance of Non-Updates Non-Updates treated differently than other Updates

In 2010, Paul White wrote The Impact of Non-Updating Updates where he points out optimizations Microsoft has made to avoid unnecessary logging when performing some non-updating updates. It’s a rare case where SQL Server actually does pay attention to whether values are not changing to avoid doing unnecessary work.

In the years since, I’ve noticed that this optimization hasn’t changed much except that Microsoft has extended these performance improvements to cases where RCSI or SI is enabled.

Regardless of this performance optimization, it’s still wise to limit affected rows as much as possible. In other words, I still prefer

UPDATE FactOnlineSales 
SET DiscountAmount = NULL
WHERE CustomerKey = 19036
AND DiscountAmount IS NOT NULL;

over this logically equivalent version:

UPDATE FactOnlineSales 
SET DiscountAmount = NULL
WHERE CustomerKey = 19036;

Although the presence of triggers and cascading foreign keys require extra care as we’ll see.

Triggers Non-Updates are treated the same as Updates

Speaking of triggers, remember that inside a trigger, non-updating rows are treated exactly the same as any other changing row. Just remember that:

  • Triggers are always invoked, even when there are zero rows affected or even when the table is empty.
  • For UPDATE statements, the UPDATE() function only cares about whether a column appeared in the SET clause. It can be useful for short-circuit logic.
  • The virtual tables inserted and deleted are filled with all affected rows (not just changed rows).

ON UPDATE CASCADE Non-Updates are treated the same as Updates

When foreign keys have ON UPDATE CASCADE set, Microsoft says “corresponding rows are updated in the referencing table when that row is updated in the parent table”.

Non-updating updates are no exception. To demonstrate, I create an untrusted foreign key and perform a non-updating update. It’s not a “no-op”, the constraint is checked as expected.

CREATE TABLE dbo.TestReferenced (
	Id INT PRIMARY KEY
);
 
INSERT dbo.TestReferenced (Id) VALUES (1), (2), (3), (4);
 
 
CREATE TABLE dbo.TestReferrer (
	Id INT NOT NULL
);
 
INSERT dbo.TestReferrer (Id) VALUES (2), (4), (6), (8);
 
ALTER TABLE dbo.TestReferrer 
WITH NOCHECK ADD FOREIGN KEY (Id) 
REFERENCES dbo.TestReferenced(Id)
ON UPDATE CASCADE;
 
-- trouble with this non-updating update:
UPDATE dbo.TestReferrer
SET Id = Id
WHERE Id = 8;
-- The UPDATE statement conflicted with the FOREIGN KEY constraint ...

@@ROWCOUNT Non-Updates are treated the same as Updates

SELECT @@ROWCOUNT returns the number of affected rows in the previous statement, not the number of changed rows.

Temporal Tables Non-Updates are treated the same as Updates

Non-updating updates still generate new rows in the history table. This can lead to puzzling results if you’re not prepared for them. For example, I can make some changes and query the history like this:

INSERT MyTest(Value) VALUES ('Mike')
UPDATE MyTest SET Value = 'Michael';
UPDATE MyTest SET Value = 'Michael';

When I take the union of rows in the table and the history table, I might see this output:

It reminds me of when my GPS says something like “In two miles, continue straight on Highway 81.” The value didn’t change, but there are still two distinct ranges.

Change Tracking Non-Updates are treated the same as Updates

Change tracking could be called “Overwrite Tracking” because all non-updating updates are tracked:

ALTER DATABASE CURRENT
set change_tracking = ON  
(change_retention = 2 days, auto_cleanup = on);
GO
 
create table dbo.test (id int primary key);
insert dbo.test (id) values (1), (2), (3); 
 
alter table dbo.test enable change_tracking with (track_columns_updated = on)  
 
-- This statement produces 0 rows:
SELECT t.id, c.*
FROM CHANGETABLE (CHANGES dbo.Test, 0) AS c  
JOIN dbo.Test AS t ON t.id = c.id ;
 
-- "update"
update dbo.test set id = id; 
 
-- This statement produces 3 rows:
SELECT t.id, c.*
FROM CHANGETABLE (CHANGES dbo.Test, 0) AS c  
JOIN dbo.Test AS t ON t.id = c.id;

Change Data Capture (CDC) Non-Updates treated differently than other Updates

Here’s a rare exception where a SQL Server feature is named properly. CDC does indeed capture data changes only when data is changing.
Paul White provided a handy set up for testing this kind of stuff. I reran his tests with CDC turned on. I found that:

  • When CDC is enabled, an update statement is always logged and the data buffers are always marked dirty.
  • But non-updating updates almost never show up as captured data changes, not even when the update was on a column in the clustering key.
  • I was able to generate some CDC changes for non-updates by updating the whole table with an idempotent expression (e.g. SET some_column = some column * 1)
    CREATE TABLE dbo.SomeTable
    (
        some_column integer NOT NULL,
        some_data integer NOT NULL,
    	index ix_sometable unique clustered (some_column)
    );
     
    UPDATE dbo.SomeTable SET some_column = some_column*1;

If you’re using this feature, this kind of stuff is important to understand! If you’re using CDC for DIY replication (God help you), then maybe the missing non-updates are acceptable. But if you’re looking for a kind of audit, or a way to analyze user-interactions with the database, then CDC doesn’t give the whole picture and is not the tool for you.

September 21, 2022

Batching Follow-Up

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 12:00 pm

When I wrote Take Care When Scripting Batches, I wanted to guard against a common pitfall when implementing a batching solution (n-squared performance). I suggested a way to be careful. But I knew that my solution was not going to be universally applicable to everyone else’s situation. So I wrote that post with a focus on how to evaluate candidate solutions.

But we developers love recipes for problem solving. I wish it was the case that for whatever kind of problem you got, you just stick the right formula in and problem solved. But unfortunately everyone’s situation is different and the majority of questions I get are of the form “What about my situation?” I’m afraid that without extra details, the best advice remains to do the work to set up the tests and find out for yourself.

Your Own Batches

But despite that. I’m still going to answer some common questions I get. But I’m going to continue to focus on how I evaluate each solution.
(Before reading further, you might want to re-familiarize yourself with the original article Take Care When Scripting Batches).

Here are some questions I get:

What if the clustered index is not unique?

Or what if the clustered index had more than one column such that leading column was not unique. For example, imagine the table was created with this clustered primary key:

ALTER TABLE dbo.FactOnlineSales
ADD CONSTRAINT PK_FactOnlineSales
PRIMARY KEY CLUSTERED (DateKey, OnlineSalesKey)

How do we write a batching script in that case? It’s usually okay if you just use the leading column of the clustered index. The careful batching script looks like this now:

DECLARE
  @LargestKeyProcessed DATETIME = '20000101',
  @NextBatchMax DATETIME,
  @RC INT = 1;
 
WHILE (@RC > 0)
BEGIN
 
  SELECT TOP (1000) @NextBatchMax = DateKey
  FROM dbo.FactOnlineSales
  WHERE DateKey > @LargestKeyProcessed
    AND CustomerKey = 19036
  ORDER BY DateKey ASC;
 
  DELETE dbo.FactOnlineSales
  WHERE CustomerKey = 19036
    AND DateKey > @LargestKeyProcessed
    AND DateKey <= @NextBatchMax;
 
  SET @RC = @@ROWCOUNT;
  SET @LargestKeyProcessed = @NextBatchMax;
 
END

The performance is definitely comparable to the original careful batching script:

Logical Reads Per Delete

Logical Reads Per Delete

But is it correct? A lot of people wonder if the non-unique index breaks the batching somehow. And the answer is yes, but it doesn’t matter too much.

By limiting the batches by DateKey instead of the unique OnlineSalesKey, we are giving up batches that are exactly 1000 rows each. In fact, most of the batches in my test process somewhere between 1000 and 1100 rows and the whole thing requires three fewer batches than the original script. That’s acceptable to me.

If I know that the leading column of the clustering key is selective enough to keep the batch sizes pretty close to the target size, then the script is still accomplishing its goal.

What if the rows I have to delete are sparse?

Here’s another situation. What if instead of customer 19036, we were asked to delete customer 7665? This time, instead of deleting 45100 rows, we only have to delete 379 rows.

I try the careful batching script and see that all rows are deleted in a single batch. SQL Server was looking for batches of 1000 rows to delete. But since there aren’t that many, it scanned the entire table to find just 379 rows. It completed in one batch, but that single batch performed as poorly as the straight algorithm.

One solution is to create an index (online!) for these rows. Something like:

CREATE INDEX IX_CustomerKey 
ON dbo.FactOnlineSales(CustomerKey) 
WITH (ONLINE = ON);

Most batching scripts are one-time use. So maybe this index is one-time use as well. If it’s a temporary index, just remember to drop it after the script is complete. A temp table could also do the same trick.

With the index, the straight query only needed 3447 logical reads to find all the rows to delete:

DELETE dbo.FactOnlineSales WHERE CustomerKey = 7665;

Careful

Logical Reads

Can I use the Naive algorithm if I use a new index?

How does the Naive and other algorithms fare with this new index on dbo.FactOnlineSales(CustomerKey)?

The rows are now so easy to find that the Naive algorithm no longer has the n-squared behavior we worried about earlier. But there is some extra overhead. We have to delete from more than one index. And we’re doing many b-tree lookups (instead of just scanning a clustered index).

Remember the Naive solution looks like this:

DECLARE	@RC INT = 1;
 
WHILE (@RC > 0)
BEGIN
 
  DELETE TOP (1000) dbo.FactOnlineSales
  WHERE CustomerKey = 19036;
 
  SET @RC = @@ROWCOUNT
 
END

But now with the index, the performance looks like this (category Naive with Index)
Logical Reads Per Delete

The index definitely helps. With the index, the Naive algorithm definitely looks better than it did without the index. But it still looks worse than the careful batching algorithm.

But look at that consistency! Each batch processes 1000 rows and reads exactly the same amount. I might choose to use Naive batching with an index if I don’t know how sparse the rows I’m deleting are. There are a lot of benefits to having a constant runtime for each batch when I can’t guarantee that rows aren’t sparse.

Explore new solutions on your own

There are many different solutions I haven’t explored. This list isn’t comprehensive.

But it’s all tradeoffs. When faced with a choice between candidate solutions, it’s essential to know how to test and measure each solution. SQL Server has more authoritative answers about the behavior of SQL Server than me or any one else. Good luck.

September 14, 2022

The Effect of a Slow Registry on SQL Server

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:00 pm

I want to describe some symptoms that SQL Server may display when its Windows Registry is non-responsive or slow. From the symptoms, it’s hard to know that it’s a slow registry and so if a web search brought you here, hopefully this helps.

How does SQL Server use the Windows registry?

First, it’s useful to know a bit about how SQL Server uses the registry. We can watch registry activity using Process Monitor (procmon). On a fairly quiet local machine, I see these SQL Server processes “querying” registry keys:

  • There is some background process reading Query Store settings (every minute).
    HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQLServer\QueryStoreSettings
    QueryStore
  • There is also some background process writing uptime info (every minute).
    HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQLServer\uptime_pid
    Uptime
  • When a login is requested from a new connection, SQL Server will check to see if R Services are installed (aka Advanced Analytics).
    SQL Server will check SERVERPROPERTY('IsAdvancedAnalyticsInstalled') every time to see if it has to care about logins associated with something called implied authentication. This happens on every login which will be important later.
    HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL15.MSSQLSERVER\Setup\AdvancedAnalytics
    AdvancedAnalytics
  • If I use a function like HASHBYTES, SQL Server looks up some cryptography settings. These settings get queried only on the first call to HASHBYTES in each session.
    e.g. HKLM\SOFTWARE\Microsoft\Cryptography\Defaults\Provider\Microsoft Enhanced RSA and AES Cryptographic Provider
    Cryptography

That’s not an exhaustive list, there are many other ways SQL Server uses the Windows Registry. For example:

  • Many SQL Agent settings are stored there and are read regularly
  • xp_regread coming from using wizards in SQL Server Management Studio.
  • SERVERPROPERTY(N'MachineName') gets its info from HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\Hostname
  • And many others.

What happens when the Windows registry is slow?

SQL Server’s use of the registry can be fairly quiet – even on a busy server – so you may not see any symptoms at all. But if the calls to the registry are slow in responding, here is what you might see:

  • New logins will ask whether Advanced Analytics Extensions is installed. Leading to a non-yielding scheduler and a memory dump. With some effort, you might find a stack trace like the one in the appendix below.
  • Any other kind of memory dump caused by non-yielding schedulers in which the saved stack trace ends with ntdll!NtOpenKeyEx. The AdvancedAnalytics is just one example but it’s the most common because it’s executed first on each login.
  • Queries calling HASHBYTES (or other cryptography functions) will be suspended and wait with PREEMPTIVE_OS_CRYPTACQUIRECONTEXT. I mostly see this when the login checks are skipped i.e. when an open connection from a connection pool is used.
  • Another symptom is Availability Group failovers (allegedly). It’s harder (for me) to do AG failover post mortems and tie them definitively to slow Windows registries

Why might the registry be slow?

I’m not sure. Perhaps it’s associated with some registry cleanup process. It may have something to do with an IO spike on the C: drive.

We rebuilt a virtual machine image from scratch which seems to avoid the problem. I’m keeping my fingers crossed.

I’d love to hear if you’ve come across anything similar.

Appendix: Sample call stack for non-yielding scheduler

00 ntdll!NtOpenKeyEx
01 KERNELBASE!LocalBaseRegOpenKey
02 KERNELBASE!RegOpenKeyExInternalW
03 KERNELBASE!RegOpenKeyExW
04 sqlmin!IniRegOpenKeyExW
05 sqlmin!GetServerProperty
06 sqlmin!IsAdvancedAnalyticsInstalled
07 sqllang!IsExtensibilityFeatureEnabled
08 sqllang!ImpliedAuthenticationManager::IsImpliedAuthenticationEnabled
09 sqllang!FindLogin
0a sqllang!login
0b sqllang!process_login_finish
0c sqllang!process_login
0d sqllang!process_commands_internal
0e sqllang!process_messages
0f sqldk!SOS_Task::Param::Execute
10 sqldk!SOS_Scheduler::RunTask
11 sqldk!SOS_Scheduler::ProcessTasks
12 sqldk!SchedulerManager::WorkerEntryPoint
13 sqldk!SystemThreadDispatcher::ProcessWorker
14 sqldk!SchedulerManager::ThreadEntryPoint
15 kernel32!BaseThreadInitThunk
16 ntdll!RtlUserThreadStart

September 7, 2022

This Function Generates UNPIVOT Syntax

Filed under: Miscelleaneous SQL,SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Just like PIVOT syntax, UNPIVOT syntax is hard to remember.
When I can, I prefer to pivot and unpivot in the application, but here’s a function I use sometimes when I want don’t want to scroll horizontally in SSMS.

CREATE OR ALTER FUNCTION dbo.GenerateUnpivotSql (@Sql NVARCHAR(MAX))
  RETURNS NVARCHAR(MAX) AS
BEGIN 
RETURN '
WITH Q AS 
(
  SELECT TOP (1) ' + 
  (
    SELECT 
      STRING_AGG(
        CAST(
          'CAST(' + QUOTENAME(NAME) + ' AS sql_variant) AS ' + QUOTENAME(NAME) 
          AS NVARCHAR(MAX)
        ), ',
    '
      )
    FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT)
  ) + '
  FROM ( 
    ' + @sql + '
  ) AS O 
)
SELECT U.FieldName, U.FieldValue
FROM Q
UNPIVOT (FieldValue FOR FieldName IN (' +
  (
    SELECT STRING_AGG( CAST( QUOTENAME(name) AS NVARCHAR(MAX) ), ',
  ' ) 
  FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT)
  ) + '
  )) AS U';
END
GO

And you might use it like this:

declare @sql nvarchar(max) ='SELECT * FROM sys.databases WHERE database_id = 2';
declare @newsql nvarchar(max) = dbo.GenerateUnpivotSql (@sql);
exec sp_executesql @sql;
exec sp_executesql @newsql;

to get results like this:
Results

Uses

I find this function useful whenever I want to take a quick look at one row without all that horizontal scrolling. Like when looking at sys.dm_exec_query_stats and other wide dmvs. This function is minimally tested, so caveat emptor.

August 9, 2022

Formatting Binary(10) LSN Values For Use In sys.fn_dblog()

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 3:06 pm

System procedures like sp_replincrementlsn and system functions like fn_cdc_get_min_lsn and fn_cdc_get_max_lsn return values that are of type binary(10).

These values represent LSNs, Log Sequence Numbers which are an internal way to represent the ordering of transaction logs.

Typically as developers, we don’t care about these values. But when we do want to dig into the transaction log, we can do so with sys.fn_dblog which takes two optional parameters. These parameters are LSN values which limit the results of sys.fn_dblog. But the weird thing is that sys.fn_dblogis a function whose LSN parameters are NVARCHAR(25).

The function sys.fn_dblog doesn’t expect binary(10) values for its LSN parameters, it wants the LSN values as a formatted string, something like: 0x00000029:00001a3c:0002.

Well, to convert the binary(10) LSN values into the format expected by sys.fn_dblog, I came up with this function:

CREATE OR ALTER FUNCTION dbo.fn_lsn_to_dblog_parameter(
    @lsn BINARY(10)
)
RETURNS NVARCHAR(25)
AS 
BEGIN
  RETURN
    NULLIF(
      STUFF (
        STUFF (
          '0x' + CONVERT(NVARCHAR(25), @lsn, 2),
          11, 0, ':' ),
        20, 0, ':' ),
      '0x00000000:00000000:0000'
    )
END
GO

Example

I can increment the LSN once with a no-op and get back the lsn value with sp_replincrementlsn.
I can then use fn_lsn_to_dblog_parameter to get an LSN string to use as parameters to sys.fn_dblog.
This helps me find the exact log entry in the transaction that corresponds to that no-op:

DECLARE @lsn binary(10);
DECLARE @lsn_string nvarchar(25)
exec sp_replincrementlsn @lsn OUTPUT;
SET @lsn_string = dbo.fn_lsn_to_dblog_parameter(@lsn);
 
select @lsn_string as lsn_string, [Current LSN], Operation
from sys.fn_dblog(@lsn_string, @lsn_string);

March 18, 2022

UPSERT Requires a Unique Index

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 10:11 am

To avoid deadlocks when implementing the upsert pattern, make sure the index on the key column is unique. It’s not sufficient that all the values in that particular column happen to be unique. The index must be defined to be unique, otherwise concurrent queries can still produce deadlocks.

Say I have a table with an index on Id (which is not unique):

CREATE TABLE dbo.UpsertTest(
	Id INT NOT NULL,
	IdString VARCHAR(100) NOT NULL,
	INDEX IX_UpsertTest CLUSTERED (Id)
)

I implement my test UPSERT procedure the way I’m supposed to like this:

CREATE OR ALTER PROCEDURE dbo.s_DoSomething  
AS 
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION 
	DECLARE @Id BIGINT = DATEPART(SECOND, GETDATE());
	DECLARE @IdString VARCHAR(100) = CAST(@Id AS VARCHAR(100)); 
 
	IF EXISTS ( 
		SELECT * 
		FROM dbo.UpsertTest WITH (UPDLOCK) 
		WHERE Id = @Id 
	) 
	BEGIN 
		UPDATE dbo.UpsertTest 
		SET IdString = @IdString 
		WHERE Id = @Id; 
	END 
	ELSE 
	BEGIN 
		INSERT dbo.UpsertTest (Id, IdString) 
		VALUES (@Id, @IdString); 
	END; 
COMMIT

When I exercise this procedure concurrently with many threads it produces deadlocks! I can use extended events and the output from trace flag 1200 to find out what locks are taken and what order.

What Locks Are Taken?

It depends on the result of the IF statement. There are two main scenarios to look at. Either the row exists or it doesn’t.

Scenario A: The Row Does Not Exist (Insert)
These are the locks that are taken:

    For the IF EXISTS statement:

    • Acquire Range S-U lock on resource (ffffffffffff) which represents “infinity”

    For the Insert statement:

    • Acquire RangeI-N lock on resource (ffffffffffff)
    • Acquire X lock on resource (66467284bfa8) which represents the newly inserted row

Insert Scenario

Scenario B: The Row Exists (Update)
The locks that are taken are:

    For the IF EXISTS statement:

    • Acquire Range S-U lock on resource (66467284bfa8)

    For the Update statement:

    • Acquire RangeX-X lock on resource (66467284bfa8)
    • Acquire RangeX-X lock on resource (ffffffffffff)

Update Scenario

Scenario C: The Row Does Not Exist, But Another Process Inserts First (Update)
There’s a bonus scenario that begins just like the Insert scenario, but the process is blocked waiting for resource (ffffffffffff). Once it finally acquires the lock, the next locks that are taken look the same as the other Update scenario. The locks that are taken are:

    For the IF EXISTS statement:

    • Wait for Range S-U lock on resource (ffffffffffff)
    • Acquire Range S-U lock on resource (ffffffffffff)
    • Acquire Range S-U lock on resource (66467284bfa8)

    For the Update statement:

    • Acquire RangeX-X lock on resource (66467284bfa8)
    • Acquire RangeX-X lock on resource (ffffffffffff)

Update After Wait Scenario

The Deadlock

And when I look at the deadlock graph, I can see that the two update scenarios (Scenario B and C) are fighting:
Scenario B:

  • Acquire RangeX-X lock on resource (66467284bfa8) during UPDATE
  • Blocked RangeX-X lock on resource (ffffffffffff) during UPDATE

Scenario C:

  • Acquire RangeS-U lock on resource (ffffffffffff) during IF EXISTS
  • Blocked RangeS-U lock on resource (66467284bfa8) during IF EXISTS

Why Isn’t This A Problem With Unique Indexes?

To find out, let’s take a look at one last scenario where the index is unique:
Scenario D: The Row Exists (Update on Unique Index)

    For the IF EXISTS statement:

    • Acquire U lock on resource (66467284bfa8)

    For the Update statement:

    • Acquire X lock on resource (66467284bfa8)

Visually, I can compare scenario B with Scenario D:
Update Two Scenarios

When the index is not unique, SQL Server has to take key-range locks on either side of the row to prevent phantom inserts, but it’s not necessary when the values are guaranteed to be unique! And that makes all the difference. When the index is unique, no lock is required on resource (ffffffffffff). There is no longer any potential for a deadlock.

Solution: Define Indexes As Unique When Possible

Even if the values in a column are unique in practice, you’ll help improve concurrency by defining the index as unique. This tip can be generalized to other deadlocks. Next time you’re troubleshooting a deadlock involving range locks, check to see whether the participating indexes are unique.

This quirk of requiring unique indexes for the UPSERT pattern is not unique to SQL Server, I notice that PostgreSQL requires a unique index when using their “ON CONFLICT … UPDATE” syntax. This is something they chose to do very deliberately.

Other Things I Tried

This post actually comes from a real problem I was presented. It took a while to reproduce and I tried a few things before I settled on making my index unique.

Lock More During IF EXISTS?
Notice that there is only one range lock taken during the IF EXISTS statement, but there are two range needed for the UPDATE statement. Why is only one needed for the EXISTS statement? If extra rows get inserted above the row that was read, that doesn’t change the answer to EXISTS. So it’s technically not a phantom read and so SQL Server doesn’t take that lock.

So what if I changed my IF EXISTS to

IF ( 
	SELECT COUNT(*)
	FROM dbo.UpsertTest WITH (UPDLOCK) 
	WHERE Id = @Id 
) > 0

That IF statement now takes two range locks which is good, but it still gets tripped up with Scenario C and continues to deadlock.

Update Less?
Change the update statement to only update one row using TOP (1)

UPDATE TOP (1) dbo.UpsertTest 
SET IdString = @IdString
WHERE Id = @Id;

During the update statement, this only requires one RangeX-X lock instead of two. And that technique actually works! I was unable to reproduce deadlocks with TOP (1). So it is indeed a candidate solution, but making the index unique is still my preferred method.

February 7, 2022

Five Ways Time Makes Unit Tests Flaky

Filed under: Miscelleaneous SQL,Technical Articles — Michael J. Swart @ 12:21 pm
Five Ways Time Makes Unit Tests Flaky
I explore different sources of test flakiness related to time:

A flaky test is a unit test that sometimes passes and sometimes fails. The causes of these flaky tests are often elusive because they’re not consistently reproducible.

I’ve found that unit tests that deal with dates and times are notorious for being flaky – especially such tests that talk to SQL Server. I want to explore some of the reasons this can happen.

My Setup


All scripts and code samples are available on github.
In the examples I discuss below, I’m using a table defined like this:

CREATE TABLE dbo.MESSAGE_LOG
(
	LogId INT IDENTITY NOT NULL 
		PRIMARY KEY,
	LogMessage NVARCHAR(MAX) NOT NULL,
	CreatedDate DATETIME
		DEFAULT (SYSDATETIME())
)

I also wrote some methods that execute these queries:

AddLogMessage

INSERT dbo.MESSAGE_LOG(LogMessage)
OUTPUT inserted.LogId
VALUES (@Message);

AddLogMessageWithDate
Same method but this allows the application to supply the LastUpdate value

INSERT dbo.MESSAGE_LOG(LogMessage, LastUpdate)
OUTPUT inserted.LogId
VALUES (@Message, @LastUpdate)

UpdateLogMessage

UPDATE dbo.MESSAGE_LOG
SET Message = @Message,
    LastUpdatedTime = SYSDATETIME()
WHERE LogId = @LogId

Sources of Flaky Tests

In no particular order:

Tests Run Too Quick?

The following test checks to see that UpdateMessage updated the LastUpdate column.

[Test]
public void UpdateMessage_DateIsUpdated_1() {
    string message = Guid.NewGuid().ToString();
    int logId = m_data.AddLogMessage( message );
    LogMessageDto? dto = m_data.GetLogMessage( logId );
    DateTime createdDate = dto.LastUpdate;
 
    string newMessage = Guid.NewGuid().ToString();
    m_data.UpdateLogMessage( logId, newMessage );
 
    dto = m_data.GetLogMessage( logId );
    DateTime updatedDate = dto.LastUpdate;
 
    // The following assertion may fail! 
    // updatedDate and createdDate are Equal if the server is fast enough
    Assert.Greater( updatedDate, createdDate ); 
}

The test ran so quickly that updatedDate has the same value as createdDate. This test may fail with this error message:

    Failed UpdateMessage_DateIsUpdated [55 ms]
    Error Message:
    Expected: greater than 2022-02-05 15:18:10.33
    But was: 2022-02-05 15:18:10.33

It’s tempting to to get around this by adding a Thread.Sleep call between the insert and update. I don’t recommend it. That kind of pattern adds up and really lengthens the time it takes to run all tests.

Another solution might involve changing Greater to GreaterOrEqual but then we can’t verify that the value has actually been updated.

Storing dates using a more precise datetime type like DATETIME2 may help avoid more failures, but maybe not all failures.

The Right Way
Ideally we want to set up the test case such that the LastUpdate value is a constant date that’s definitely in the past. I would change this test to use AddLogMessageWithDate instead of AddLogMessage:

    DateTime then = new DateTime(2000, 01, 01);
    int logId = m_data.AddLogMessageWithDate( message, then );

Not All DateTimes Are Created Equal


.Net’s DateTime is different than SQL Server’s DATETIME. Specifically they have different precisions. DateTime values in SQL Server are rounded to increments of .000, .003, or .007 seconds. This means that you can’t store a .Net DateTime value in SQL Server and get it back. This test demonstrates the problem:

[Test]    
public void StoreDate_ReadItBack() {
    // Store date
    string message = Guid.NewGuid().ToString();
    DateTime now = DateTime.Now;
    int logId = m_data.AddLogMessageWithDate( message, now );
 
    // Read it back
    LogMessageDto? dto = m_data.GetLogMessage( logId );
 
    // The following assertion may fail! 
    // SQL Server's DATETIME has a different precision than .Net's DateTime
    Assert.AreEqual( now, dto.LastUpdate );
}

It may fail with:

    Failed StoreDate_ReadItBack [101 ms]
    Error Message:
    Expected: 2022-02-04 15:11:20.4474577
    But was: 2022-02-04 15:11:20.447

The Right Way
Understanding the resolution limitations of SQL Server’s DateTime is important here. A few solutions come to mind:

  • Maybe use a constant value instead of “now”
  • Modify the database columns to use SQL Server’s DATETIME2 which has a better resolution
  • Just fetch “now” from the database. I like this idea. When I use it again later, I’ll go into more detail.

Time Zones (Of Course)


Running integration tests that talk to a database on a separate servercan mean translating server times back and forth between both servers. This leads to another common source of flakiness: time zones. It’s not easy to avoid this kind of issue. Both Azure and AWS try to tackle this by use UTC everywhere.

A flaky test might look like this.

public void UpdateMessage_DateIsUpdated_2() {
    string message = Guid.NewGuid().ToString();
    DateTime now = DateTime.Now;
    int logId = m_data.AddLogMessageWithDate( message, now );
 
    string newMessage = Guid.NewGuid().ToString();
    m_data.UpdateLogMessage( logId, newMessage );
 
    LogMessageDto? dto = m_data.GetLogMessage( logId );
 
    // This next assertion can fail if the database is in a different time zone        
    Assert.GreaterOrEqual( dto.LastUpdate, now );
}

It fails like this:

    Failed UpdateMessage_DateIsUpdated_2 [19 ms]
    Error Message:
    Expected: greater than or equal to 2022-02-05 21:06:54.521464
    But was: 2022-02-05 16:06:54.52

Why is this pattern a source of flaky tests? The success of the test depends on the time zones of the test server and the database server. But even if you control both time zones, this particular example is still vulnerable to clock drift as we’ll see later.

The Right Way
Use a constant time or try fetching “now” from the database.

DateTime now = m_nowProvider.Now();

Here I’m using a method I wrote which simply returns the value of SELECT GETDATE(); from the database.

Clock Drift


Related to time zones is clock drift which again causes errors when you compare dates from two different servers.

No server’s clock is perfect and I like to think of each server’s clock as having its own time zone. Windows tells me that my laptop is set at (UTC -05:00) but with clock drift it’s probably something like (UTC -05:00:01.3). You can work at synchronizing clocks, but unless you’re testing that synchronization, you shouldn’t depend on it in your tests.

Just like in the case with time zones, this test may fail when it compares times from two different clocks:

public void UpdateMessage_DateIsUpdated_3() {
    string message = Guid.NewGuid().ToString();
    DateTime now = DateTime.Now;
    int logId = m_data.AddLogMessageWithDate( message, now );
 
    string newMessage = Guid.NewGuid().ToString();
    m_data.UpdateLogMessage( logId, newMessage );
 
    LogMessageDto? dto = m_data.GetLogMessage( logId );
 
    // This next test can fail if the clocks on the database server is off by a few seconds
    Assert.GreaterOrEqual( dto.LastUpdate, now );
}

The Right Way
Just like before, use a constant value or try fetching “now” from the database.

DateTime now = m_nowProvider.Now();

This way we’re comparing times from only one server’s clock.

Daylight Savings (Of Course)


This next test is flaky because of daylight savings time. It’s not specific to SQL Server but I thought I’d include it because I have been burned by this before:

[Test]    
public void StoreDateInTheFuture() {
    string message = Guid.NewGuid().ToString();
    DateTime inAMonth = DateTime.Now + TimeSpan.FromDays( 30 );        
 
    // CovertTime may fail because "a month from now" may be an invalid DateTime (with daylight savings)
    inAMonth = TimeZoneInfo.ConvertTime( inAMonth, TimeZoneInfo.Local );
    m_data.AddLogMessageWithDate( message, inAMonth );
    Assert.Pass();
}

I saw a test just like this one fail at 2:18 AM on February 9th, 2018. Adding 30 days to that date brought us to 2:18AM which was right in the middle of the hour we were skipping for daylight savings time and that’s what caused the error. This test fails with:

    Failed StoreDateInTheFuture [32 ms]
    Error Message:
    System.ArgumentException : The supplied DateTime represents an invalid time. For example, when the clock is adjusted forward, any time in the period that is skipped is invalid. (Parameter ‘dateTime’)

Summary


Flaky tests come from non-deterministic tests. To quote Martin Fowler, “Few things are more non-deterministic than a call to the system clock”. Try to:

  • Write tests with hard coded dates
  • Avoid comparing dates sourced from two different clocks
  • Consider writing a “NowProvider” (which can be mocked!)
  • Be very deliberate about time zones
  • Be very deliberate about data types (both in C# and SQL Server)
Older Posts »

Powered by WordPress