In the past I’ve written about monitoring identity columns to ensure there’s room to grow.
But there’s a related danger that’s a little more subtle. Say you have a table whose identity column is an 8-byte bigint. An application that converts those values to a 4-byte integer will not always fail! Those applications will only fail if the value is larger than 2,147,483,647.
If the conversion of a large value is done in C#, you’ll get an Overflow Exception or an Invalid Cast Exception and if the conversion is done in SQL Server you’ll see get this error message:
Msg 8115, Level 16, State 2, Line 21
Arithmetic overflow error converting expression to data type int.
The danger
If such conversions exist in your application, you won’t see any problems until the bigint identity values are larger than 2,147,483,647. My advice then is to test your application with large identity values in a test environment. But how?
Use this script to set large values on BIGINT identity columns
On a test server, run this script to get commands to adjust bigint identity values to beyond the maximum value of an integer:
-- increase bigint identity columnsselect'DBCC CHECKIDENT('''+QUOTENAME(OBJECT_SCHEMA_NAME(object_id))+'.'+QUOTENAME(object_Name(object_id))+''', RESEED, 2147483648);
'as script
from
sys.identity_columnswhere
system_type_id =127andobject_idin(selectobject_idfrom sys.tables);
-- increase bigint sequencesselect'ALTER SEQUENCE '+QUOTENAME(OBJECT_SCHEMA_NAME(object_id))+'.'+QUOTENAME(object_Name(object_id))+'
RESTART WITH 2147483648 INCREMENT BY '+CAST(increment as sysname)+' NO MINVALUE NO MAXVALUE;
'as script
from
sys.sequenceswhere
system_type_id =127;
-- increase bigint identity columns
select
'DBCC CHECKIDENT(''' +
QUOTENAME(OBJECT_SCHEMA_NAME(object_id)) + '.' +
QUOTENAME(object_Name(object_id)) + ''', RESEED, 2147483648);
' as script
from
sys.identity_columns
where
system_type_id = 127
and object_id in (select object_id from sys.tables);
-- increase bigint sequences
select
'ALTER SEQUENCE ' +
QUOTENAME(OBJECT_SCHEMA_NAME(object_id)) + '.' +
QUOTENAME(object_Name(object_id)) + '
RESTART WITH 2147483648 INCREMENT BY ' +
CAST(increment as sysname) +
' NO MINVALUE NO MAXVALUE;
' as script
from
sys.sequences
where
system_type_id = 127;
Prepared for testing
The identity columns in your test database are now prepared for testing. And hopefully you have an automated way to exercise your application code to find sneaky conversions to 4-byte integers. I found several of these hidden defects myself and I’m really glad I had the opportunity to tackle these before they became an issue in production.
When I wrote Take Care When Scripting Batches, I wanted to guard against a common pitfall when implementing a batching solution (n-squared performance). I suggested a way to be careful. But I knew that my solution was not going to be universally applicable to everyone else’s situation. So I wrote that post with a focus on how to evaluate candidate solutions.
But we developers love recipes for problem solving. I wish it was the case that for whatever kind of problem you got, you just stick the right formula in and problem solved. But unfortunately everyone’s situation is different and the majority of questions I get are of the form “What about my situation?” I’m afraid that without extra details, the best advice remains to do the work to set up the tests and find out for yourself.
But despite that. I’m still going to answer some common questions I get. But I’m going to continue to focus on how I evaluate each solution.
(Before reading further, you might want to re-familiarize yourself with the original article Take Care When Scripting Batches).
Here are some questions I get:
What if the clustered index is not unique?
Or what if the clustered index had more than one column such that leading column was not unique. For example, imagine the table was created with this clustered primary key:
How do we write a batching script in that case? It’s usually okay if you just use the leading column of the clustered index. The careful batching script looks like this now:
DECLARE
@LargestKeyProcessed DATETIME='20000101',
@NextBatchMax DATETIME,
@RC INT=1;
WHILE(@RC >0)BEGINSELECTTOP(1000) @NextBatchMax = DateKey
FROM dbo.FactOnlineSalesWHERE DateKey > @LargestKeyProcessed
AND CustomerKey =19036ORDERBY DateKey ASC;
DELETE dbo.FactOnlineSalesWHERE CustomerKey =19036AND DateKey > @LargestKeyProcessed
AND DateKey <= @NextBatchMax;
SET @RC =@@ROWCOUNT;
SET @LargestKeyProcessed = @NextBatchMax;
END
DECLARE
@LargestKeyProcessed DATETIME = '20000101',
@NextBatchMax DATETIME,
@RC INT = 1;
WHILE (@RC > 0)
BEGIN
SELECT TOP (1000) @NextBatchMax = DateKey
FROM dbo.FactOnlineSales
WHERE DateKey > @LargestKeyProcessed
AND CustomerKey = 19036
ORDER BY DateKey ASC;
DELETE dbo.FactOnlineSales
WHERE CustomerKey = 19036
AND DateKey > @LargestKeyProcessed
AND DateKey <= @NextBatchMax;
SET @RC = @@ROWCOUNT;
SET @LargestKeyProcessed = @NextBatchMax;
END
The performance is definitely comparable to the original careful batching script:
Logical Reads Per Delete
But is it correct? A lot of people wonder if the non-unique index breaks the batching somehow. And the answer is yes, but it doesn’t matter too much.
By limiting the batches by DateKey instead of the unique OnlineSalesKey, we are giving up batches that are exactly 1000 rows each. In fact, most of the batches in my test process somewhere between 1000 and 1100 rows and the whole thing requires three fewer batches than the original script. That’s acceptable to me.
If I know that the leading column of the clustering key is selective enough to keep the batch sizes pretty close to the target size, then the script is still accomplishing its goal.
What if the rows I have to delete are sparse?
Here’s another situation. What if instead of customer 19036, we were asked to delete customer 7665? This time, instead of deleting 45100 rows, we only have to delete 379 rows.
I try the careful batching script and see that all rows are deleted in a single batch. SQL Server was looking for batches of 1000 rows to delete. But since there aren’t that many, it scanned the entire table to find just 379 rows. It completed in one batch, but that single batch performed as poorly as the straight algorithm.
One solution is to create an index (online!) for these rows. Something like:
CREATEINDEX IX_CustomerKey
ON dbo.FactOnlineSales(CustomerKey)WITH(ONLINE =ON);
CREATE INDEX IX_CustomerKey
ON dbo.FactOnlineSales(CustomerKey)
WITH (ONLINE = ON);
Most batching scripts are one-time use. So maybe this index is one-time use as well. If it’s a temporary index, just remember to drop it after the script is complete. A temp table could also do the same trick.
With the index, the straight query only needed 3447 logical reads to find all the rows to delete:
DELETE dbo.FactOnlineSales WHERE CustomerKey = 7665;
Logical Reads
Can I use the Naive algorithm if I use a new index?
How does the Naive and other algorithms fare with this new index on dbo.FactOnlineSales(CustomerKey)?
The rows are now so easy to find that the Naive algorithm no longer has the n-squared behavior we worried about earlier. But there is some extra overhead. We have to delete from more than one index. And we’re doing many b-tree lookups (instead of just scanning a clustered index).
DECLARE @RC INT = 1;
WHILE (@RC > 0)
BEGIN
DELETE TOP (1000) dbo.FactOnlineSales
WHERE CustomerKey = 19036;
SET @RC = @@ROWCOUNT
END
But now with the index, the performance looks like this (category Naive with Index)
The index definitely helps. With the index, the Naive algorithm definitely looks better than it did without the index. But it still looks worse than the careful batching algorithm.
But look at that consistency! Each batch processes 1000 rows and reads exactly the same amount. I might choose to use Naive batching with an index if I don’t know how sparse the rows I’m deleting are. There are a lot of benefits to having a constant runtime for each batch when I can’t guarantee that rows aren’t sparse.
Explore new solutions on your own
There are many different solutions I haven’t explored. This list isn’t comprehensive.
But it’s all tradeoffs. When faced with a choice between candidate solutions, it’s essential to know how to test and measure each solution. SQL Server has more authoritative answers about the behavior of SQL Server than me or any one else. Good luck.
Just like PIVOT syntax, UNPIVOT syntax is hard to remember.
When I can, I prefer to pivot and unpivot in the application, but here’s a function I use sometimes when I want don’t want to scroll horizontally in SSMS.
CREATEORALTERFUNCTION dbo.GenerateUnpivotSql(@SqlNVARCHAR(MAX))RETURNSNVARCHAR(MAX)ASBEGINRETURN'
WITH Q AS
(
SELECT TOP (1) '+(SELECT
STRING_AGG(CAST('CAST('+QUOTENAME(NAME)+' AS sql_variant) AS '+QUOTENAME(NAME)ASNVARCHAR(MAX)), ',
')FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT))+'
FROM (
'+ @sql+'
) AS O
)
SELECT U.FieldName, U.FieldValue
FROM Q
UNPIVOT (FieldValue FOR FieldName IN ('+(SELECT STRING_AGG(CAST(QUOTENAME(name)ASNVARCHAR(MAX)), ',
')FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT))+'
)) AS U';
END
GO
CREATE OR ALTER FUNCTION dbo.GenerateUnpivotSql (@Sql NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) AS
BEGIN
RETURN '
WITH Q AS
(
SELECT TOP (1) ' +
(
SELECT
STRING_AGG(
CAST(
'CAST(' + QUOTENAME(NAME) + ' AS sql_variant) AS ' + QUOTENAME(NAME)
AS NVARCHAR(MAX)
), ',
'
)
FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT)
) + '
FROM (
' + @sql + '
) AS O
)
SELECT U.FieldName, U.FieldValue
FROM Q
UNPIVOT (FieldValue FOR FieldName IN (' +
(
SELECT STRING_AGG( CAST( QUOTENAME(name) AS NVARCHAR(MAX) ), ',
' )
FROM sys.dm_exec_describe_first_result_set(@sql, DEFAULT, DEFAULT)
) + '
)) AS U';
END
GO
And you might use it like this:
declare @sqlnvarchar(max)='SELECT * FROM sys.databases WHERE database_id = 2';
declare @newsql nvarchar(max)= dbo.GenerateUnpivotSql(@sql);
execsp_executesql @sql;
execsp_executesql @newsql;
I find this function useful whenever I want to take a quick look at one row without all that horizontal scrolling. Like when looking at sys.dm_exec_query_stats and other wide dmvs. This function is minimally tested, so caveat emptor.
System procedures like sp_replincrementlsn and system functions like fn_cdc_get_min_lsn and fn_cdc_get_max_lsn return values that are of type binary(10).
These values represent LSNs, Log Sequence Numbers which are an internal way to represent the ordering of transaction logs.
Typically as developers, we don’t care about these values. But when we do want to dig into the transaction log, we can do so with sys.fn_dblog which takes two optional parameters. These parameters are LSN values which limit the results of sys.fn_dblog. But the weird thing is that sys.fn_dblogis a function whose LSN parameters are NVARCHAR(25).
The function sys.fn_dblog doesn’t expect binary(10) values for its LSN parameters, it wants the LSN values as a formatted string, something like: 0x00000029:00001a3c:0002.
Well, to convert the binary(10) LSN values into the format expected by sys.fn_dblog, I came up with this function:
CREATE OR ALTER FUNCTION dbo.fn_lsn_to_dblog_parameter(
@lsn BINARY(10)
)
RETURNS NVARCHAR(25)
AS
BEGIN
RETURN
NULLIF(
STUFF (
STUFF (
'0x' + CONVERT(NVARCHAR(25), @lsn, 2),
11, 0, ':' ),
20, 0, ':' ),
'0x00000000:00000000:0000'
)
END
GO
Example
I can increment the LSN once with a no-op and get back the lsn value with sp_replincrementlsn.
I can then use fn_lsn_to_dblog_parameter to get an LSN string to use as parameters to sys.fn_dblog.
This helps me find the exact log entry in the transaction that corresponds to that no-op:
To avoid deadlocks when implementing the upsert pattern, make sure the index on the key column is unique. It’s not sufficient that all the values in that particular column happen to be unique. The index must be defined to be unique, otherwise concurrent queries can still produce deadlocks.
Say I have a table with an index on Id (which is not unique):
CREATETABLE dbo.UpsertTest(
Id INTNOTNULL,
IdString VARCHAR(100)NOTNULL,
INDEX IX_UpsertTest CLUSTERED(Id))
CREATE TABLE dbo.UpsertTest(
Id INT NOT NULL,
IdString VARCHAR(100) NOT NULL,
INDEX IX_UpsertTest CLUSTERED (Id)
)
CREATE OR ALTER PROCEDURE dbo.s_DoSomething
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
DECLARE @Id BIGINT = DATEPART(SECOND, GETDATE());
DECLARE @IdString VARCHAR(100) = CAST(@Id AS VARCHAR(100));
IF EXISTS (
SELECT *
FROM dbo.UpsertTest WITH (UPDLOCK)
WHERE Id = @Id
)
BEGIN
UPDATE dbo.UpsertTest
SET IdString = @IdString
WHERE Id = @Id;
END
ELSE
BEGIN
INSERT dbo.UpsertTest (Id, IdString)
VALUES (@Id, @IdString);
END;
COMMIT
When I exercise this procedure concurrently with many threads it produces deadlocks! I can use extended events and the output from trace flag 1200 to find out what locks are taken and what order.
What Locks Are Taken?
It depends on the result of the IF statement. There are two main scenarios to look at. Either the row exists or it doesn’t.
Scenario A: The Row Does Not Exist (Insert)
These are the locks that are taken:
For the IF EXISTS statement:
Acquire Range S-U lock on resource (ffffffffffff) which represents “infinity”
For the Insert statement:
Acquire RangeI-N lock on resource (ffffffffffff)
Acquire X lock on resource (66467284bfa8) which represents the newly inserted row
Scenario B: The Row Exists (Update)
The locks that are taken are:
For the IF EXISTS statement:
Acquire Range S-U lock on resource (66467284bfa8)
For the Update statement:
Acquire RangeX-X lock on resource (66467284bfa8)
Acquire RangeX-X lock on resource (ffffffffffff)
Scenario C: The Row Does Not Exist, But Another Process Inserts First (Update)
There’s a bonus scenario that begins just like the Insert scenario, but the process is blocked waiting for resource (ffffffffffff). Once it finally acquires the lock, the next locks that are taken look the same as the other Update scenario. The locks that are taken are:
For the IF EXISTS statement:
Wait for Range S-U lock on resource (ffffffffffff)
Acquire Range S-U lock on resource (ffffffffffff)
Acquire Range S-U lock on resource (66467284bfa8)
For the Update statement:
Acquire RangeX-X lock on resource (66467284bfa8)
Acquire RangeX-X lock on resource (ffffffffffff)
The Deadlock
And when I look at the deadlock graph, I can see that the two update scenarios (Scenario B and C) are fighting: Scenario B:
Acquire RangeX-X lock on resource (66467284bfa8) during UPDATE
Blocked RangeX-X lock on resource (ffffffffffff) during UPDATE
Scenario C:
Acquire RangeS-U lock on resource (ffffffffffff) during IF EXISTS
Blocked RangeS-U lock on resource (66467284bfa8) during IF EXISTS
Why Isn’t This A Problem With Unique Indexes?
To find out, let’s take a look at one last scenario where the index is unique: Scenario D: The Row Exists (Update on Unique Index)
For the IF EXISTS statement:
Acquire U lock on resource (66467284bfa8)
For the Update statement:
Acquire X lock on resource (66467284bfa8)
Visually, I can compare scenario B with Scenario D:
When the index is not unique, SQL Server has to take key-range locks on either side of the row to prevent phantom inserts, but it’s not necessary when the values are guaranteed to be unique! And that makes all the difference. When the index is unique, no lock is required on resource (ffffffffffff). There is no longer any potential for a deadlock.
Solution: Define Indexes As Unique When Possible
Even if the values in a column are unique in practice, you’ll help improve concurrency by defining the index as unique. This tip can be generalized to other deadlocks. Next time you’re troubleshooting a deadlock involving range locks, check to see whether the participating indexes are unique.
This quirk of requiring unique indexes for the UPSERT pattern is not unique to SQL Server, I notice that PostgreSQL requires a unique index when using their “ON CONFLICT … UPDATE” syntax. This is something they chose to do very deliberately.
Other Things I Tried
This post actually comes from a real problem I was presented. It took a while to reproduce and I tried a few things before I settled on making my index unique.
Lock More During IF EXISTS?
Notice that there is only one range lock taken during the IF EXISTS statement, but there are two range needed for the UPDATE statement. Why is only one needed for the EXISTS statement? If extra rows get inserted above the row that was read, that doesn’t change the answer to EXISTS. So it’s technically not a phantom read and so SQL Server doesn’t take that lock.
So what if I changed my IF EXISTS to
IF(SELECTCOUNT(*)FROM dbo.UpsertTestWITH(UPDLOCK)WHERE Id = @Id
)>0
IF (
SELECT COUNT(*)
FROM dbo.UpsertTest WITH (UPDLOCK)
WHERE Id = @Id
) > 0
That IF statement now takes two range locks which is good, but it still gets tripped up with Scenario C and continues to deadlock.
Update Less?
Change the update statement to only update one row using TOP (1)
UPDATETOP(1) dbo.UpsertTestSET IdString = @IdString
WHERE Id = @Id;
UPDATE TOP (1) dbo.UpsertTest
SET IdString = @IdString
WHERE Id = @Id;
During the update statement, this only requires one RangeX-X lock instead of two. And that technique actually works! I was unable to reproduce deadlocks with TOP (1). So it is indeed a candidate solution, but making the index unique is still my preferred method.
The configuration setting cost threshold for parallelism has a default value of 5. As a default value, it’s probably too low and should be raised. But what benefit are we hoping for? And how can we measure it?
The database that I work with is a busy OLTP system with lots of very frequent, very inexpensive queries and so I don’t like to see any query that needs to go parallel.
What I’d like to do is raise the configuration cost threshold to something larger and look at the queries that have gone from multi-threaded to single-threaded. I want to see that these queries become cheaper on average. By cheaper I mean consume less cpu. I expect the average duration of these queries to increase.
How do I find these queries? I can look in the cache. The view sys.dm_exec_query_stats can tell me if a query plan is parallel, and I can look into the plans for the estimated cost. In my case, I have relatively few parallel queries. Only about 300 which means the xml parsing piece of this query runs pretty quickly.
WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
sql_text.[text] as sqltext,
qp.query_plan,
xml_values.subtree_cost as estimated_query_cost_in_query_bucks,
qs.last_dop,
CAST( qs.total_worker_time / (qs.execution_count + 0.0) as money ) as average_query_cpu_in_microseconds,
qs.total_worker_time,
qs.execution_count,
qs.query_hash,
qs.query_plan_hash,
qs.plan_handle,
qs.sql_handle
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
CROSS APPLY sys.dm_exec_query_plan (qs.plan_handle) qp
CROSS APPLY
(
SELECT SUBSTRING(st.[text],(qs.statement_start_offset + 2) / 2,
(CASE
WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX),st.[text])) * 2
ELSE qs.statement_end_offset + 2
END - qs.statement_start_offset) / 2)
) as sql_text([text])
OUTER APPLY
(
SELECT
n.c.value('@QueryHash', 'NVARCHAR(30)') as query_hash,
n.c.value('@StatementSubTreeCost', 'FLOAT') as subtree_cost
FROM qp.query_plan.nodes('//StmtSimple') as n(c)
) xml_values
WHERE qs.last_dop > 1
AND sys.fn_varbintohexstr(qs.query_hash) = xml_values.query_hash
AND execution_count > 10
ORDER BY xml_values.subtree_cost
OPTION (RECOMPILE);
What Next?
Keep track of the queries you see whose estimated subtree cost is below the new threshold you’re considering. Especially keep track of the query_hash and the average_query_cpu_in_microseconds.
Then make the change and compare the average_query_cpu_in_microseconds before and after. Remember to use the sql_hash as the key because the plan_hash will have changed.
Here’s the query modified to return the “after” results:
Measure the Cost of Those Queries After Config Change
WITH XMLNAMESPACES (DEFAULT'http://schemas.microsoft.com/sqlserver/2004/07/showplan')SELECT
sql_text.[text]as sqltext,
qp.query_plan,
xml_values.subtree_costas estimated_query_cost_in_query_bucks,
qs.last_dop,
CAST( qs.total_worker_time/(qs.execution_count+0.0)asmoney)as average_query_cpu_in_microseconds,
qs.total_worker_time,
qs.execution_count,
qs.query_hash,
qs.query_plan_hash,
qs.plan_handle,
qs.sql_handleFROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
CROSS APPLY
(SELECTSUBSTRING(st.[text],(qs.statement_start_offset+2)/2,
(CASEWHEN qs.statement_end_offset=-1THENLEN(CONVERT(NVARCHAR(MAX),st.[text]))*2ELSE qs.statement_end_offset+2END- qs.statement_start_offset)/2))as sql_text([text])OUTER APPLY
(SELECT
n.c.value('@QueryHash', 'NVARCHAR(30)')as query_hash,
n.c.value('@StatementSubTreeCost', 'FLOAT')as subtree_cost
FROM qp.query_plan.nodes('//StmtSimple')as n(c)) xml_values
WHERE qs.query_hashin(/* put the list of sql_handles you saw from before the config change here */)AND sys.fn_varbintohexstr(qs.query_hash)= xml_values.query_hashORDERBY xml_values.subtree_costOPTION(RECOMPILE);
WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
sql_text.[text] as sqltext,
qp.query_plan,
xml_values.subtree_cost as estimated_query_cost_in_query_bucks,
qs.last_dop,
CAST( qs.total_worker_time / (qs.execution_count + 0.0) as money ) as average_query_cpu_in_microseconds,
qs.total_worker_time,
qs.execution_count,
qs.query_hash,
qs.query_plan_hash,
qs.plan_handle,
qs.sql_handle
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
CROSS APPLY sys.dm_exec_query_plan (qs.plan_handle) qp
CROSS APPLY
(
SELECT SUBSTRING(st.[text],(qs.statement_start_offset + 2) / 2,
(CASE
WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX),st.[text])) * 2
ELSE qs.statement_end_offset + 2
END - qs.statement_start_offset) / 2)
) as sql_text([text])
OUTER APPLY
(
SELECT
n.c.value('@QueryHash', 'NVARCHAR(30)') as query_hash,
n.c.value('@StatementSubTreeCost', 'FLOAT') as subtree_cost
FROM qp.query_plan.nodes('//StmtSimple') as n(c)
) xml_values
WHERE qs.query_hash in ( /* put the list of sql_handles you saw from before the config change here */ )
AND sys.fn_varbintohexstr(qs.query_hash) = xml_values.query_hash
ORDER BY xml_values.subtree_cost
OPTION (RECOMPILE);
What I Found
In general, increasing the threshold from 5 –> 50 generally had a very good effect on those queries that went from multithreaded to singlethreaded. Over half of the queries improved at least 1 order of magnitude (and a couple improved 3 orders of magnitude!) https://t.co/iELhiSr7Onpic.twitter.com/W4Mv7hGVlz
I have trouble with procedures that use SELECT *. They are often not “Blue-Green safe“. In other words, if a procedure has a query that uses SELECT * then I can’t change the underlying tables can’t change without causing some tricky deployment issues. (The same is not true for ad hoc queries from the application).
I also have a lot of procedures to look at (about 5000) and I’d like to find the procedures that use SELECT *.
I want to maybe ignore SELECT * when selecting from a subquery with a well-defined column list.
I also want to maybe include related queries like OUTPUT inserted.*.
The Plan
So I’m going to make a schema-only copy of the database to work with.
I’m going to add a new dummy-column to every single table.
I’m going to use sys.dm_exec_describe_first_result_set_for_object to look for any of the new columns I created
Any of my new columns that show up, were selected with SELECT *.
The Script
use master;
DROPDATABASEIFEXISTS search_for_select_star;
DBCC CLONEDATABASE (the_name_of_the_database_you_want_to_analyze, search_for_select_star);
ALTERDATABASE search_for_select_star SET READ_WRITE;
GO
use search_for_select_star;
DECLARE @SQLNVARCHAR(MAX);
SELECT
@SQL= STRING_AGG(CAST('ALTER TABLE '+QUOTENAME(OBJECT_SCHEMA_NAME(object_id))+'.'+QUOTENAME(OBJECT_NAME(object_id))+' ADD NewDummyColumn BIT NULL'ASNVARCHAR(MAX)),
N';')FROM
sys.tables;
execsp_executesql @SQL;
SELECT
SCHEMA_NAME(p.schema_id)+'.'+ p.nameAS procedure_name,
r.column_ordinal,
r.nameFROM
sys.procedures p
CROSS APPLY
sys.dm_exec_describe_first_result_set_for_object(p.object_id, NULL) r
WHERE
r.name='NewDummyColumn'ORDERBY
p.schema_id, p.name;
use master;
DROPDATABASEIFEXISTS search_for_select_star;
use master;
DROP DATABASE IF EXISTS search_for_select_star;
DBCC CLONEDATABASE (the_name_of_the_database_you_want_to_analyze, search_for_select_star);
ALTER DATABASE search_for_select_star SET READ_WRITE;
GO
use search_for_select_star;
DECLARE @SQL NVARCHAR(MAX);
SELECT
@SQL = STRING_AGG(
CAST(
'ALTER TABLE ' +
QUOTENAME(OBJECT_SCHEMA_NAME(object_id)) +
'.' +
QUOTENAME(OBJECT_NAME(object_id)) +
' ADD NewDummyColumn BIT NULL' AS NVARCHAR(MAX)),
N';')
FROM
sys.tables;
exec sp_executesql @SQL;
SELECT
SCHEMA_NAME(p.schema_id) + '.' + p.name AS procedure_name,
r.column_ordinal,
r.name
FROM
sys.procedures p
CROSS APPLY
sys.dm_exec_describe_first_result_set_for_object(p.object_id, NULL) r
WHERE
r.name = 'NewDummyColumn'
ORDER BY
p.schema_id, p.name;
use master;
DROP DATABASE IF EXISTS search_for_select_star;
Update
Tom from StraightforwardSQL pointed out a nifty feature that Microsoft has already implemented.
Not sure here, but doesn't dm_sql_referenced_entities.is_select_all achieve the same thing?
selectdistinct SCHEMA_NAME(p.schema_id)+'.'+ p.nameAS procedure_name
from sys.procedures p
cross apply sys.dm_sql_referenced_entities(
object_schema_name(object_id)+'.'+object_name(object_id), default) re
where re.is_select_all=1
select distinct SCHEMA_NAME(p.schema_id) + '.' + p.name AS procedure_name
from sys.procedures p
cross apply sys.dm_sql_referenced_entities(
object_schema_name(object_id) + '.' + object_name(object_id), default) re
where re.is_select_all = 1
Comparing the two, I noticed that my query – the one that uses dm_exec_describe_first_result_set_for_object – has some drawbacks. Maybe the SELECT * isn’t actually included in the first result set, but some subsequent result set. Or maybe the result set couldn’t be described for one of these various reasons
On the other hand, I noticed that dm_sql_referenced_entities has a couple drawbacks itself. It doesn’t seem to capture select statements that use `OUTPUT INSERTED.*` for example.
In practice though, I found the query that Tom suggested works a bit better. In the product I work most closely with, dm_sql_referenced_entities only missed 3 procedures that dm_exec_describe_first_result_set_for_object caught. But dm_exec_describe_first_result_set_for_object missed 49 procedures that dm_sql_referenced_entities caught!
Takeaway: For most use cases, using sp_releaseapplock is unnecessary. Especially when using @LockOwner = 'Transaction (which is the default).
The procedure sp_getapplock is a system stored procedure that can be helpful when developing SQL for concurrency. It takes a lock on an imaginary resource and it can be used to avoid race conditions.
But I don’t use sp_getapplock a lot. I almost always depend on SQL Server’s normal locking of resources (like tables, indexes, rows etc…). But I might consider it for complicated situations (like managing sort order in a hierarchy using a table with many different indexes).
But there’s a problem with this pattern, especially when using RCSI. After sp_releaseapplock is called, but before the COMMIT completes, another process running the same code can read the previous state. In the example above, both processes will think a time slot is available and will try to make the same reservation.
What I really want is to release the applock after the commit. But because I specified the lock owner is 'Transaction'. That gets done automatically when the transaction ends! So really what I want is this:
BEGINTRANexecsp_getapplock
@Resource = @LockResourceName,
@LockMode ='Exclusive',
@LockOwner ='Transaction';
/* read stuff (e.g. "is time slot available?") *//* change stuff (e.g. "make reservation") */COMMIT-- all locks are freed after this commit
BEGIN TRAN
exec sp_getapplock
@Resource = @LockResourceName,
@LockMode = 'Exclusive',
@LockOwner = 'Transaction';
/* read stuff (e.g. "is time slot available?") */
/* change stuff (e.g. "make reservation") */
COMMIT -- all locks are freed after this commit
But that gives the total amount of waits for each wait type accumulated since the server was started. And that isn’t ideal when I’m troubleshooting trouble that started recently. No worries, Paul also has another fantastic post Capturing wait statistics for a period of time.
Taking that idea further, I can collect data all the time and look at it historically, or just for a baseline. Lot’s of monitoring tools do this already, but here’s what I’ve written:
Mostly I’m creating these scripts for me. I’ve created a version of these a few times now and some reason, I can’t find them each time I need them again!
This stuff can be super useful, especially, if you combine it with a visualization tool (like PowerBI or even Excel).
For example, here’s a chart I made when we were experiencing the XVB_LIST spinlock issues I wrote about not too long ago. Good visualizations can really tell powerful stories.
I’m talking here about spins and not waits of course, but the idea is the same and I’ve included the spinlock monitoring scripts in the same repo on github.
Scaling SQL Server High
The beginning of the school year is behind us and what a semester start! 2020 has been tough on many of us and I’m fortunate to work for a company whose services are in such high demand. In fact we’ve seen some scaling challenges like we’ve never seen before. I want to talk about some of them.
Detect Excessive Spinlock Contention on SQL Server
Context
As we prepared to face unprecedented demand this year, we began to think about whether bigger is better. Worried about CPU limits, we looked to what AWS had to offer in terms of their instance sizes.
We were already running our largest SQL Servers on r5 instances with 96 logical CPUs. But we decided to evaluate the pricy u instances which have 448 logical CPUs and a huge amount of memory.
Painful Symptoms
Well, bigger is not always better. We discovered that as we increased the load on the u-series servers, there would come a point where all processors would jump to 100% and stayed there. You could say it plateaued (based on the graph, would that be a plateau? A mesa? Or a butte?)
When that occurred, the number of batch requests that the server could handle dropped significantly. So we saw more CPU use, but less work was getting done.
The high demand kept the CPU at 100% with no relief until the demand decreased. When that happened, the database seemed to recover. Throughput was restored and the database’s metrics became healthy again. During this trouble we looked at everything including the number of spins reported in the sys.dm_os_spinlock_stats dmv.
The spins and backoffs reported seemed extremely high, especially for the category “XVB_LIST”, but we didn’t really have a baseline to tell whether those numbers were problematic. Even after capturing the numbers and visualizing them we saw larger than linear increases as demand increased, but were those increases excessive?
How To Tell For Sure
Chris Adkin has a post Diagnosing Spinlock Problems By Doing The Math. He explains why spinlocks are useful. It doesn’t seem like a while loop that chews up CPU could improve performance, but it actually does when it helps avoid context switches. He gives a formula to help find how much of the total CPU is spent spinning. That percentage can then help decide whether the spinning is excessive.
But I made a tiny tweak to his formula and I wrote a script to have SQL Server do the math:
You still have to give the number of CPUs on your server. If you don’t have those numbers handy, you can get them from SQL Server’s log. I include one of Glenn Berry’s diagnostic queries for that.
There’s an assumption in Chris’s calculation that one spin consumes one CPU clock cycle. A spin is really cheap (because it can use the test-and-set instruction), but it probably consumes more than one clock cycle. I assume four, but I have no idea what the actual value is.
EXEC sys.xp_readerrorlog 0, 1, N'detected', N'socket';
-- SQL Server detected 2 sockets with 24 cores per socket ...
declare @Sockets int = 2;
declare @PhysicalCoresPerSocket int = 24;
declare @TicksPerSpin int = 4;
declare @SpinlockSnapshot TABLE (
SpinLockName VARCHAR(100),
SpinTotal BIGINT
);
INSERT @SpinlockSnapshot ( SpinLockName, SpinTotal )
SELECT name, spins
FROM sys.dm_os_spinlock_stats
WHERE spins > 0;
DECLARE @Ticks bigint
SELECT @Ticks = cpu_ticks
FROM sys.dm_os_sys_info
WAITFOR DELAY '00:00:10'
DECLARE @TotalTicksInInterval BIGINT
DECLARE @CPU_GHz NUMERIC(20, 2);
SELECT @TotalTicksInInterval = (cpu_ticks - @Ticks) * @Sockets * @PhysicalCoresPerSocket,
@CPU_GHz = ( cpu_ticks - @Ticks ) / 10000000000.0
FROM sys.dm_os_sys_info;
SELECT ISNULL(Snap.SpinLockName, 'Total') as [Spinlock Name],
SUM(Stat.spins - Snap.SpinTotal) as [Spins In Interval],
@TotalTicksInInterval as [Ticks In Interval],
@CPU_Ghz as [Measured CPU GHz],
100.0 * SUM(Stat.spins - Snap.SpinTotal) * @TicksPerSpin / @TotalTicksInInterval as [%]
FROM @SpinlockSnapshot Snap
JOIN sys.dm_os_spinlock_stats Stat
ON Snap.SpinLockName = Stat.name
GROUP BY ROLLUP (Snap.SpinLockName)
HAVING SUM(Stat.spins - Snap.SpinTotal) > 0
ORDER BY [Spins In Interval] DESC;
This is what I see on a very healthy server (r5.24xlarge). The server was using 14% cpu. And .03% of that is spent spinning (or somewhere in that ballpark).
More Troubleshooting Steps
So what’s going on? What is that XVB_LIST category? Microsoft says “internal use only” But I can guess. Paul Randal talks about the related latch class Versioning Transaction List. It’s an instance-wide list that is used in the implementation of features like Read Committed Snapshot Isolation (RCSI) which we do use.
Microsoft also has a whitepaper on troubleshooting this stuff Diagnose and resolve spinlock contention on SQL Server. They actually give a technique to collect call stacks during spinlock contention in order to try and maybe glean some information about what else is going on. We did that, but we didn’t learn too much. We learned that we use RCSI with lots of concurrent queries. Something we really can’t give up on.
So Then What?
What We Did
Well, we moved away from the u instance with its hundreds of CPUs and we went back to our r5 instance with only (only!) 96 logical CPUs. We’re dealing with the limits imposed by that hardware and accepting that we can’t scale higher using that box. We’re continuing to do our darnedest to move data and activity out of SQL Server and into other solutions like DynamoDb. We’re also trying to partition our databases into different deployments which spreads the load out, but introduces a lot of other challenges.
Basically, we gave up trying to scale higher. If we did want to pursue this further (which we don’t), we’d probably contact Microsoft support to try and address this spinlock contention. We know that these conditions are sufficient (if not necessary) to see the contention we saw:
SQL Server 2016 SP2
U-series instance from Amazon
Highly concurrent and frequent queries (>200K batch requests per second with a good mix of writes and reads on the same tables)
RCSI enabled.
Thank you Erin Stellato
We reached out to Erin Stellato to help us through this issue. We did this sometime around the “Painful Symptoms” section above. We had a stressful time troubleshooting all this stuff and I really appreciate Erin guiding us through it. We learned so much.