But that gives the total amount of waits for each wait type accumulated since the server was started. And that isn’t ideal when I’m troubleshooting trouble that started recently. No worries, Paul also has another fantastic post Capturing wait statistics for a period of time.
Taking that idea further, I can collect data all the time and look at it historically, or just for a baseline. Lot’s of monitoring tools do this already, but here’s what I’ve written:
Mostly I’m creating these scripts for me. I’ve created a version of these a few times now and some reason, I can’t find them each time I need them again!
This stuff can be super useful, especially, if you combine it with a visualization tool (like PowerBI or even Excel).
For example, here’s a chart I made when we were experiencing the XVB_LIST spinlock issues I wrote about not too long ago. Good visualizations can really tell powerful stories.
I’m talking here about spins and not waits of course, but the idea is the same and I’ve included the spinlock monitoring scripts in the same repo on github.
Scaling SQL Server High
The beginning of the school year is behind us and what a semester start! 2020 has been tough on many of us and I’m fortunate to work for a company whose services are in such high demand. In fact we’ve seen some scaling challenges like we’ve never seen before. I want to talk about some of them.
Detect Excessive Spinlock Contention on SQL Server
Context
As we prepared to face unprecedented demand this year, we began to think about whether bigger is better. Worried about CPU limits, we looked to what AWS had to offer in terms of their instance sizes.
We were already running our largest SQL Servers on r5 instances with 96 logical CPUs. But we decided to evaluate the pricy u instances which have 448 logical CPUs and a huge amount of memory.
Painful Symptoms
Well, bigger is not always better. We discovered that as we increased the load on the u-series servers, there would come a point where all processors would jump to 100% and stayed there. You could say it plateaued (based on the graph, would that be a plateau? A mesa? Or a butte?)
When that occurred, the number of batch requests that the server could handle dropped significantly. So we saw more CPU use, but less work was getting done.
The high demand kept the CPU at 100% with no relief until the demand decreased. When that happened, the database seemed to recover. Throughput was restored and the database’s metrics became healthy again. During this trouble we looked at everything including the number of spins reported in the sys.dm_os_spinlock_stats dmv.
The spins and backoffs reported seemed extremely high, especially for the category “XVB_LIST”, but we didn’t really have a baseline to tell whether those numbers were problematic. Even after capturing the numbers and visualizing them we saw larger than linear increases as demand increased, but were those increases excessive?
How To Tell For Sure
Chris Adkin has a post Diagnosing Spinlock Problems By Doing The Math. He explains why spinlocks are useful. It doesn’t seem like a while loop that chews up CPU could improve performance, but it actually does when it helps avoid context switches. He gives a formula to help find how much of the total CPU is spent spinning. That percentage can then help decide whether the spinning is excessive.
But I made a tiny tweak to his formula and I wrote a script to have SQL Server do the math:
You still have to give the number of CPUs on your server. If you don’t have those numbers handy, you can get them from SQL Server’s log. I include one of Glenn Berry’s diagnostic queries for that.
There’s an assumption in Chris’s calculation that one spin consumes one CPU clock cycle. A spin is really cheap (because it can use the test-and-set instruction), but it probably consumes more than one clock cycle. I assume four, but I have no idea what the actual value is.
EXEC sys.xp_readerrorlog 0, 1, N'detected', N'socket';
-- SQL Server detected 2 sockets with 24 cores per socket ...
declare @Sockets int = 2;
declare @PhysicalCoresPerSocket int = 24;
declare @TicksPerSpin int = 4;
declare @SpinlockSnapshot TABLE (
SpinLockName VARCHAR(100),
SpinTotal BIGINT
);
INSERT @SpinlockSnapshot ( SpinLockName, SpinTotal )
SELECT name, spins
FROM sys.dm_os_spinlock_stats
WHERE spins > 0;
DECLARE @Ticks bigint
SELECT @Ticks = cpu_ticks
FROM sys.dm_os_sys_info
WAITFOR DELAY '00:00:10'
DECLARE @TotalTicksInInterval BIGINT
DECLARE @CPU_GHz NUMERIC(20, 2);
SELECT @TotalTicksInInterval = (cpu_ticks - @Ticks) * @Sockets * @PhysicalCoresPerSocket,
@CPU_GHz = ( cpu_ticks - @Ticks ) / 10000000000.0
FROM sys.dm_os_sys_info;
SELECT ISNULL(Snap.SpinLockName, 'Total') as [Spinlock Name],
SUM(Stat.spins - Snap.SpinTotal) as [Spins In Interval],
@TotalTicksInInterval as [Ticks In Interval],
@CPU_Ghz as [Measured CPU GHz],
100.0 * SUM(Stat.spins - Snap.SpinTotal) * @TicksPerSpin / @TotalTicksInInterval as [%]
FROM @SpinlockSnapshot Snap
JOIN sys.dm_os_spinlock_stats Stat
ON Snap.SpinLockName = Stat.name
GROUP BY ROLLUP (Snap.SpinLockName)
HAVING SUM(Stat.spins - Snap.SpinTotal) > 0
ORDER BY [Spins In Interval] DESC;
This is what I see on a very healthy server (r5.24xlarge). The server was using 14% cpu. And .03% of that is spent spinning (or somewhere in that ballpark).
More Troubleshooting Steps
So what’s going on? What is that XVB_LIST category? Microsoft says “internal use only” But I can guess. Paul Randal talks about the related latch class Versioning Transaction List. It’s an instance-wide list that is used in the implementation of features like Read Committed Snapshot Isolation (RCSI) which we do use.
Microsoft also has a whitepaper on troubleshooting this stuff Diagnose and resolve spinlock contention on SQL Server. They actually give a technique to collect call stacks during spinlock contention in order to try and maybe glean some information about what else is going on. We did that, but we didn’t learn too much. We learned that we use RCSI with lots of concurrent queries. Something we really can’t give up on.
So Then What?
What We Did
Well, we moved away from the u instance with its hundreds of CPUs and we went back to our r5 instance with only (only!) 96 logical CPUs. We’re dealing with the limits imposed by that hardware and accepting that we can’t scale higher using that box. We’re continuing to do our darnedest to move data and activity out of SQL Server and into other solutions like DynamoDb. We’re also trying to partition our databases into different deployments which spreads the load out, but introduces a lot of other challenges.
Basically, we gave up trying to scale higher. If we did want to pursue this further (which we don’t), we’d probably contact Microsoft support to try and address this spinlock contention. We know that these conditions are sufficient (if not necessary) to see the contention we saw:
SQL Server 2016 SP2
U-series instance from Amazon
Highly concurrent and frequent queries (>200K batch requests per second with a good mix of writes and reads on the same tables)
RCSI enabled.
Thank you Erin Stellato
We reached out to Erin Stellato to help us through this issue. We did this sometime around the “Painful Symptoms” section above. We had a stressful time troubleshooting all this stuff and I really appreciate Erin guiding us through it. We learned so much.
Scaling SQL Server High
The beginning of the school year is behind us and what a semester start! 2020 has been tough on many of us and I’m fortunate to work for a company whose services are in such high demand. In fact we’ve seen some scaling challenges like we’ve never seen before. I want to talk about some of them.
It’s been a whirlwind of a month. The first day of school, or a new semester is always a busy time for online education. And in 2020 that’s an understatement. This year, we had to answer how busy could one SQL Server get?
We’ve always approached this question using the usual techniques
Cache like crazy (the cheapest query is the one that doesn’t have to be run)
Tackle expensive queries (CPU, IO)
Tackle large wait categories (so much easier said than done)
Offload reporting queries to other servers.
This year we smashed records and faced some interesting challenges along the way. I want to talk about those challenges in the next few blog posts. This first post is about user connections. Something I’ve never really paid much attention to before now.
A Metric I’d Never Thought I’d Have To Worry About
The maximum number of user connections that SQL Server can support is 32,767. That’s it. That’s the end of the line. You can buy faster I.O. or a server with more CPUs but you can’t buy more connections.
With such a hard limit, it’s important to watch this metric carefully. You can do that with the performance counter SQLServer: General Statistics – User Connections or with this query:
SELECT ISNULL(DB_NAME(database_id), 'Total On Server')AS DatabaseName,
COUNT(*)AS Connections,
COUNT(DISTINCThost_name) ClientMachines
FROM sys.dm_exec_sessionsWHEREhost_nameISNOTNULLGROUPBYROLLUP( database_id )
SELECT ISNULL(DB_NAME(database_id), 'Total On Server') AS DatabaseName,
COUNT(*) AS Connections,
COUNT(DISTINCT host_name) ClientMachines
FROM sys.dm_exec_sessions
WHERE host_name IS NOT NULL
GROUP BY ROLLUP( database_id )
The Maximum is 32,767
If you haven’t changed the maximum number of user connections by some method like sp_configure 'user connections', then the default is 0 and @@MAX_CONNECTIONS will return 32,767. I think the UI for this property is a bit misleading, 0 is absolutely not equivalent to unlimited.
What Does Trouble Look Like?
The issue shows up on the client when it’s unable to establish a connection to the server. There’s a variety of errors you might see such as:
Client unable to establish connection because an error was encountered during handshakes before login. Common causes include client attempting to connect to an unsupported version of SQL Server, server too busy to accept new connections or a resource limitation (memory or maximum allowed connections) on the server.
A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 – An existing connection was forcibly closed by the remote host.)
Or simply the generic:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 – Could not open a connection to SQL Server)
It’s Almost Always a Symptom
It’s often a symptom and not the root cause. For example, imagine you have a busy server with lots of clients and there’s a sudden slowdown.
Perhaps the slowdown is an increase in a certain wait category like blocking (LCK_M_X) or perhaps the slowdown is an increase in CPU because of a bad query plan. Either way, queries suddenly require a longer duration. They can’t complete quickly and remain open. So any new queries that come along have to make brand new connections (because there are no available connections in the connection pool). Maybe THREADPOOL waits start to pile up. And now even more sessions are waiting and more sessions get created. Then you might reach the maximum number of connections in this scenario.
But decent monitoring will often identify the actual root cause here (the blocking or whatever). And it’s actually easy to ignore the user connection limit because it wasn’t really the bottleneck.
But Sometimes it’s the Root Cause
Once we’ve tackled all the other scaling challenges, and the demand gets cranked up to 11. Then we can see trouble.
In our case. We have an elastic scaling policy that allows us to spin up client machines in response to demand. So hundreds of web servers times the number connection pools times the number of connections in each pool can really add up fast. So those web servers scale really nicely, but without a decent sharding strategy, SQL Server doesn’t.
On one of our servers, this is where we’re sitting now, and the number of connections is uncomfortably high. There’s not a lot of room to tolerate bursts of extra activity here:
The red lines indicate 10% and 100% of the maximum connection count.
What We Can Do About It
Monitor the number of connections that is typical for your servers using the query above for a point in time. Or use the performance counter: SQLServer: General Statistics – User Connections
Make sure you’re not leaking any connections. This SQLPerformance post tells you how to find that.
Use connection pooling efficiently. This Microsoft article SQL Server Connection Pooling (ADO.NET) has some great tips on avoiding pool fragmentation. The article describes pool fragmentation as a web server issue. But the tips are also appropriate to minimizing the total user connections on the server.
Keep the variety of connection strings used by your application small. With connection pooling, connections are only made when they’re needed. But still there’s a little bit of overhead. It takes 4 to 8 minutes for an idle connection to be released by the pool. So minimizing the number of connection pools actually does help.
Queries should be as quick as possible. Get in and out.
Offload reads as much as possible to other servers. Availability groups have read-only routing features but be careful how you implement this. If you have some connection strings that use ApplicationIntent=ReadOnly and some that don’t, then that’s two different connection pools. If you want to defer configuring AGs until after the connection strings are done, then there can be some tricky overlapping scenarios. It’s complicated, and it just highlights the importance of monitoring.
When tuning queries. The metric to focus on is the total duration of queries. Try to minimize that number. That’s the total_elapsed_time column in sys.dm_exec_query_stats. Or the elapsed time query in http://michaeljswart.com/go/top20. This is an interesting one. I’ve always preferred to focus on optimizing CPU or logical reads. But in this case, connection count is most sensitive to long running queries no matter the reason.
I’d love to hear about others who have tackled this problem. Let me know what strategies you came up with.
database table’s auto-incrementing ID column exceeded [maxint]. When we attempted to insert larger integers into the column, the database rejected the value
This led to a discussion about setting up monitoring for this kind of problem in our software. We have a place for monitoring and health-checks for all our databases. We just need to know how to define them.
So how do I create a script that reports any tables whose current identity values are above a certain threshold? This is what I came up with. Maybe you’ll find it useful too.
Find Tables in SQL Server Running Out Of Identity Room
declare @percentThreshold int=70;
select t.nameas[table],
c.nameas[column],
ty.nameas[type],
IDENT_CURRENT(t.name)as[identity],
100*IDENT_CURRENT(t.name)/2147483647as[percentfull]from sys.tables t
join sys.columns c
on c.object_id= t.object_idjoin sys.types ty
on ty.system_type_id= c.system_type_idwhere c.is_identity=1and ty.name='int'and100*IDENT_CURRENT(t.name)/2147483647> @percentThreshold
orderby t.name
declare @percentThreshold int = 70;
select t.name as [table],
c.name as [column],
ty.name as [type],
IDENT_CURRENT(t.name) as [identity],
100 * IDENT_CURRENT(t.name) / 2147483647 as [percent full]
from sys.tables t
join sys.columns c
on c.object_id = t.object_id
join sys.types ty
on ty.system_type_id = c.system_type_id
where c.is_identity = 1
and ty.name = 'int'
and 100 * IDENT_CURRENT(t.name) / 2147483647 > @percentThreshold
order by t.name
Other Notes
I really prefer sequences for this kind of thing. Monitoring goes along similar lines
I only worry about ints. Bigints are just too big.
I was looking at the docs for DROP TABLE and I noticed this in the syntax: [ ,...n ]. I never realized that you can drop more than one table in a statement.
You Still Have to Care About Order
I think that’s great. When dropping tables one at a time. You always had to be careful about order when foreign keys were involved. Alas, you still have to care about order. The docs say:
If both the referencing table and the table that holds the primary key are being dropped in the same DROP TABLE statement, the referencing table must be listed first.
That means that when you run
CREATETABLE A ( id INTPRIMARYKEY);
CREATETABLE B ( id INTFOREIGNKEYREFERENCES A(id));
DROPTABLEIFEXISTS A, B;
CREATE TABLE A ( id INT PRIMARY KEY );
CREATE TABLE B ( id INT FOREIGN KEY REFERENCES A(id) );
DROP TABLE IF EXISTS A, B;
It fails with
Msg 3726, Level 16, State 1, Line 4
Could not drop object ‘A’ because it is referenced by a FOREIGN KEY constraint.
This is a suggestion made by Matt Smith. Currently DROP TABLE behaves this way:
DROP TABLE IF EXISTS succeeds if the table exists and is deleted.
DROP TABLE IF EXISTS succeeds when there is no object with that name.
DROP TABLE IF EXISTS fails when that object name refers to an object that is not a table
For example, this script
CREATEVIEW C ASSELECT1AS One;
go
DROPTABLEIFEXISTS C;
CREATE VIEW C AS SELECT 1 AS One;
go
DROP TABLE IF EXISTS C;
gives the error
Msg 3705, Level 16, State 1, Line 9
Cannot use DROP TABLE with ‘C’ because ‘C’ is a view. Use DROP VIEW.
It’s really not in the spirit of what was intended with “IF EXISTS”. If you want to vote for that suggestion, it’s here DROP TABLE IF EXISTS fails
DROP TABLE is not Atomic
I’ve gotten really used to relying on atomic transactions. I know that when I update a set of rows, I can rely on the fact that all of the rows are updated, or in the case of an error, none of the rows are updated. There’s no situation where some of the rows are updated. But a DROP TABLE statement that tries to drop multiple tables using the [ ,...n ] syntax doesn’t behave that way. If there’s an error, SQL Server continues with the list dropping the tables that it can.
We can see that with the first example. Here it is again:
CREATETABLE A ( id INTPRIMARYKEY);
CREATETABLE B ( id INTFOREIGNKEYREFERENCES A(id));
DROPTABLEIFEXISTS A, B;
-- Could not drop object 'A' because it is referenced by a FOREIGN KEY constraint.-- B is dropped
CREATE TABLE A ( id INT PRIMARY KEY );
CREATE TABLE B ( id INT FOREIGN KEY REFERENCES A(id) );
DROP TABLE IF EXISTS A, B;
-- Could not drop object 'A' because it is referenced by a FOREIGN KEY constraint.
-- B is dropped
That example throws an error and drops a table.
The same nonatomic behavior is seen in a simpler example:
CREATETABLE D ( id INT);
DROPTABLE E, D;
-- Invalid object name 'E'-- D is dropped
CREATE TABLE D ( id INT );
DROP TABLE E, D;
-- Invalid object name 'E'
-- D is dropped
Solving real-world problems is different than answering interview questions or twitter polls. The biggest difference is that real problems aren’t always fair. There’s not always a right answer.
Answer this multiple choice question:
Which of the following SQL statements is used to modify existing data in a table?
A) SELECT
B) INSERT
C) DELETE
Give it some thought. Which option would you pick? The correct answer is UPDATE but it wasn’t one of the options listed and that’s not fair. But neither is real life. Many real problems don’t have an easy answer and some real problems are impossible to solve. That can be discouraging.
Real Problems Allow For Creativity
But if your problems are unfair, then maybe you’re allowed to cheat too.
“None of the above” is always an option. Understand the goal so that you can stretch or ignore requirements.
Example – Changing an INT to a BIGINT
I have a table that logs enrollments into courses. It’s append only and looks something like this:
CREATETABLE dbo.LOG_ENROLL(
LogId INTIDENTITYNOTNULL, -- This identity column is running out of space
UserId INTNOTNULL,
CourseId INTNOTNULL,
RoleId INTNULL,
EnrollmentType INTNOTNULL,
LogDate DATETIMENOTNULLDEFAULTGETUTCDATE(),
INDEX IX_LOG_ENROLL_CourseId CLUSTERED( CourseId, UserId ),
CONSTRAINT PK_LOG_ENROLL PRIMARYKEYNONCLUSTERED( LogId ),
INDEX IX_LOG_ENROLL_UserId NONCLUSTERED( UserId, CourseId ),
INDEX IX_LOG_ENROLL_LogDate NONCLUSTERED( LogDate, LogId ));
CREATE TABLE dbo.LOG_ENROLL (
LogId INT IDENTITY NOT NULL, -- This identity column is running out of space
UserId INT NOT NULL,
CourseId INT NOT NULL,
RoleId INT NULL,
EnrollmentType INT NOT NULL,
LogDate DATETIME NOT NULL DEFAULT GETUTCDATE(),
INDEX IX_LOG_ENROLL_CourseId CLUSTERED ( CourseId, UserId ),
CONSTRAINT PK_LOG_ENROLL PRIMARY KEY NONCLUSTERED ( LogId ),
INDEX IX_LOG_ENROLL_UserId NONCLUSTERED ( UserId, CourseId ),
INDEX IX_LOG_ENROLL_LogDate NONCLUSTERED ( LogDate, LogId )
);
The table has over 2 billion rows and it looks like it’s going to run out of space soon because the LogId column is defined as an INT. I need to change this table so that it’s a BIGINT. But changing an INT to a BIGINT is known as a “size of data” operation. This means SQL Server has to process every row to expand the LogId column from 4 to 8 bytes. But it gets trickier than that.
The biggest challenge is that the table has to remain “online” (available for queries and inserts).
Compression?
Gianluca Sartori (spaghettidba) had the idea of enlarging the columns with no downtime using compression. It’s promising, but I discovered that for this to work, all indexes need to be compressed not just the ones that contain the changed column. Also, any indexes which use the column need to be disabled for this to work.
Cheating
I gave up on solving this problem in general and constrained my focus to the specific problem I was facing. There’s always some context that lets us bend the rules. In my case, here’s what I did.
Ahead of time:
I removed extra rows. I discovered that many of the rows were extraneous and could be removed. After thinning out the table, the number of rows went from 2 billion down to 300 million.
I compressed two of the indexes online (IX_LOG_ENROLL_UserId and IX_LOG_ENROLL_CourseId) because I still want to use the compression trick.
But I’m not ready yet. I still can’t modify the column because the other two indexes depend on the LogId column. If I tried, I get this error message:
Msg 5074, Level 16, State 1, Line 22
The index ‘IX_LOG_ENROLL_LogDate’ is dependent on column ‘LogId’.
Msg 5074, Level 16, State 1, Line 22
The object ‘PK_LOG_ENROLL’ is dependent on column ‘LogId’.
Msg 4922, Level 16, State 9, Line 22
ALTER TABLE ALTER COLUMN LogId failed because one or more objects access this column.
So I temporarily drop those indexes!
Drop the constraint PK_LOG_ENROLL and the index IX_LOG_ENROLL_LogDate
Do the switch! ALTER TABLE LOG_ENROLL ALTER COLUMN LogId BIGINT NOT NULL; This step takes no time!
Recreate the indexes online that were dropped.
Hang on, that last step is a size of data operation. Anyone who needs those indexes won’t be able to use them while they’re being built.
Exactly, and this is where I cheat. It turns out those indexes were used for infrequent reports and I was able to co-ordinate my index rebuild around the reporting schedule.
You can’t always make an operation online, but with effort and creativity, you can get close enough. I have found that every real problem allows for a great degree of creativity when you’re allowed to bend the rules or question requirements.
The transaction is touching two different databases. So it makes sense that the two actions should be atomic and durable together using the one single transaction.
However, databases implement durability and atomicity using their own transaction log. Each transaction log takes care of its own database. So from another point of view, it makes sense that these are two separate transactions.
A transaction within a single instance of the Database Engine that spans two or more databases is actually a distributed transaction. The instance manages the distributed transaction internally; to the user, it operates as a local transaction.
I can actually see that happening with this demo script:
use master
ifexists(select*from sys.databaseswhere name ='D1')beginalterdatabase D1 set single_user withrollbackimmediate;
dropdatabase D1;
end
go
ifexists(select*from sys.databaseswhere name ='D2')beginalterdatabase D2 set single_user withrollbackimmediate;
dropdatabase D2;
end
go
createdatabase d1;
go
createdatabase d2;
go
createtable d1.dbo.T1(id int);
createtable d2.dbo.T1(id int);
go
use d1;
CHECKPOINT;
go
begintransactioninsert d1.dbo.T1values(1);
insert d2.dbo.T1values(1);
commitselect[Transaction ID], [Transaction Name], Operation, Context, [Description]from fn_dblog(null, null);
use master
if exists (select * from sys.databases where name = 'D1')
begin
alter database D1 set single_user with rollback immediate;
drop database D1;
end
go
if exists (select * from sys.databases where name = 'D2')
begin
alter database D2 set single_user with rollback immediate;
drop database D2;
end
go
create database d1;
go
create database d2;
go
create table d1.dbo.T1 (id int);
create table d2.dbo.T1 (id int);
go
use d1;
CHECKPOINT;
go
begin transaction
insert d1.dbo.T1 values (1);
insert d2.dbo.T1 values (1);
commit
select [Transaction ID], [Transaction Name], Operation, Context, [Description]
from fn_dblog(null, null);
That shows a piece of what’s going on in the transaction log like this:
If you’re familiar with fn_dblog output (or even if you’re not), notice that when a transaction touches two databases, there are extra entries in the transaction log. D1 has LOP_PREP_XACT and LOP_FORGET_XACT and D2 only has LOP_PREP_XACT. Grahaeme Ross wrote a lot more about what this means in his article Understanding Cross-Database Transactions in SQL Server
Well that’s good. I can count on that can’t I?
Except When …
You Break Atomicity On Purpose
Well, they are two databases after all. If you want to restore one database to a point in time before the transaction occurred but not the other, I’m not going to stop you.
Availability Groups
But there’s another wrench to throw in with Availability Groups. Again Microsoft’s docs are pretty clear on this (Thanks Brent for pointing me to them). In Transactions – availability groups and database mirroring they point out this kind of thing is pretty new:
In SQL Server 2016 SP1 and before, cross-database transactions within the same SQL Server instance are not supported for availability groups.
There’s support in newer versions, but the availability group had to have been created with WITH DTC_SUPPORT = PER_DB. There’s no altering the availability group after it’s been created.
It’s also interesting that availability groups’ older brother, database mirroring is absolutely not supported. Microsoft says so several times and wants you to know that if you try and you mess up, it’s on you:
… any issues arising from the improper use of distributed transactions are not supported.
Long story short:
Cross DB Transactions in the same server are supported with Availability Groups in SQL Server 2017 and later
Cross DB Transactions are not supported with mirrored databases at all
After identifying a database you’re curious about, you may want to drill down further. I wrote about this problem earlier in Tackle WRITELOG Waits Using the Transaction Log and Extended Events. The query I wrote for that post combines results of an extended events session with the transaction log in order to identify which procedures are doing the most writing.
But it’s a tricky kind of script. It takes a while to run on busy systems. There’s a faster way to drill into writes if you switch your focus from which queries are writing so much to which tables are being written to so much. Both methods of drilling down can be helpful, but the table approach is faster and doesn’t require an extended event session and it might be enough to point you in the right direction.
Use This Query
use[specify your databasename here]-- get the latest lsn for current DBdeclare @xact_seqno binary(10);
declare @xact_seqno_string_begin varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_begin ='0x'+CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_begin =stuff(@xact_seqno_string_begin, 11, 0, ':')set @xact_seqno_string_begin =stuff(@xact_seqno_string_begin, 20, 0, ':');
-- wait a few secondswaitfor delay '00:00:10'-- get the latest lsn for current DBdeclare @xact_seqno_string_end varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_end ='0x'+CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_end =stuff(@xact_seqno_string_end, 11, 0, ':')set @xact_seqno_string_end =stuff(@xact_seqno_string_end, 20, 0, ':');
WITH[Log]AS(SELECT Category,
SUM([Log Record Length])as[Log Bytes]FROM fn_dblog(@xact_seqno_string_begin, @xact_seqno_string_end)CROSS APPLY (SELECT ISNULL(AllocUnitName, Operation))AS C(Category)GROUPBY Category
)SELECT Category,
[Log Bytes],
100.0*[Log Bytes]/SUM([Log Bytes])OVER()AS[%]FROM[Log]ORDERBY[Log Bytes]DESC;
use [specify your databasename here]
-- get the latest lsn for current DB
declare @xact_seqno binary(10);
declare @xact_seqno_string_begin varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_begin = '0x' + CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_begin = stuff(@xact_seqno_string_begin, 11, 0, ':')
set @xact_seqno_string_begin = stuff(@xact_seqno_string_begin, 20, 0, ':');
-- wait a few seconds
waitfor delay '00:00:10'
-- get the latest lsn for current DB
declare @xact_seqno_string_end varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_end = '0x' + CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_end = stuff(@xact_seqno_string_end, 11, 0, ':')
set @xact_seqno_string_end = stuff(@xact_seqno_string_end, 20, 0, ':');
WITH [Log] AS
(
SELECT Category,
SUM([Log Record Length]) as [Log Bytes]
FROM fn_dblog(@xact_seqno_string_begin, @xact_seqno_string_end)
CROSS APPLY (SELECT ISNULL(AllocUnitName, Operation)) AS C(Category)
GROUP BY Category
)
SELECT Category,
[Log Bytes],
100.0 * [Log Bytes] / SUM([Log Bytes]) OVER () AS [%]
FROM [Log]
ORDER BY [Log Bytes] DESC;
Results look something like this (Your mileage may vary).
Notes
Notice that some space in the transaction log is not actually about writing to tables. I’ve grouped them into their own categories and kept them in the results. For example LOP_BEGIN_XACT records information about the beginning of transactions.
I’m using sp_replincrementlsn to find the current last lsn. I could have used log_min_lsn from sys.dm_db_log_stats but that dmv is only available in 2016 SP2 and later.
This method is a little more direct measurement of transaction log activity than a similar query that uses sys.dm_db_index_operational_stats
Taking a small break from my blogging sabbatical to post one script that I’ve found myself writing from scratch too often.
My hope is that the next time I need this, I’ll look it up here.
The User Settable Counter
Use this to monitor something that’s not already exposed as a performance counter. Like the progress of a custom task or whatever. If you can write a quick query, you can expose it to a counter that can be plotted by Performance Monitor.
Here’s the script (adjust SomeMeasurement and SomeTable to whatever makes sense and adjust the delay interval if 1 second is too short:
declare @deltaMeasurement int = 0;
declare @totalMeasurement int = 0;
while (1=1)
begin
select @deltaMeasurement = SomeMeasurement - @totalMeasurement
from SomeTable;
set @totalMeasurement += @deltaMeasurement;
exec sp_user_counter1 @deltaMeasurement;
waitfor delay '00:00:01'
end
Monitoring
Now you can monitor “User Counter 1” in the object “SQLServer:User Settable” which will look like this:
Don’t forget to stop the running query when you’re done.
A couple weeks ago, I wrote about how to find lonely tables in Sql Server. This is a follow up to that post. I’m now going to talk about small sets of tables that are joined to eachother, but no-one else.
It’s Not Just Me
It seems everyone’s talking about this.
So as I was writing this post and code I noticed an amazing coincidence. I saw the same ideas I was writing about being discussed on twitter by Kelly Sommers, Ben Johnson and others.
They discuss Uber’s microservice graph. When visualized, it’s a big mish-mash of dependencies. Kelly points out how hard it is to reason about and Ben points to a small decoupled piece of the system that he wants to work on.
Me too Ben! And I think that’s the value of that visualization. It can demonstrate to others how tangled your system is. It can also identify small components that are not connected to the main mess. When I tie it to my last post and consider this idea in the database world, I can expand my idea of lonely tables to small sets of tables that are never joined to other tables.
I want to find them because these tables are also good candidates for extraction but how do I find them? I start by visualizing tables and their joins.
Visualizing Table Joins
I started by looking for existing visualizations. I didn’t find exactly what I wanted so I coded my own visualization (with the help of the d3 library). It’s always fun to code your own physics engine.
Here’s what I found
A monolith with some smaller isolated satellites
An example that might be good to extract
That ball of mush in the middle is hard to look at, but the smaller disconnected bits aren’t! Just like Ben, I want to work on those smaller pieces too! And just like the lonely tables we looked at last week, these small isolated components are also good candidates for extracting from SQL Server.
There’s a query at the end of this post. When you run it, you’ll get pairs of table names and when you paste it into the Show Graph page, you’ll see a visualization of your database.
(This is all client-side code, I don’t collect any data).
The Query
use[your database name goes here];
select
qs.query_hash,
qs.plan_handle,
cast(nullas xml)as query_plan
into #myplans
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_plan_attributes(qs.plan_handle) pa
where pa.attribute='dbid'and pa.value=db_id();
with duplicate_queries as(select ROW_NUMBER()over(partition by query_hash orderby(select1)) r
from #myplans
)delete duplicate_queries
where r >1;
update #myplans
set query_plan = qp.query_planfrom #myplans mp
cross apply sys.dm_exec_query_plan(mp.plan_handle) qp
;WITH XMLNAMESPACES (DEFAULT'http://schemas.microsoft.com/sqlserver/2004/07/showplan'),
mycte as(select q.query_hash,
obj.value('(@Schema)[1]', 'sysname')AS schema_name,
obj.value('(@Table)[1]', 'sysname')AS table_name
from #myplans q
cross apply q.query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple')as nodes(stmt)CROSS APPLY stmt.nodes('.//IndexScan/Object')AS index_object(obj))select query_hash, schema_name, table_name
into #myExecutions
from mycte
where schema_name isnotnullandobject_id(schema_name +'.'+ table_name)in(selectobject_idfrom sys.tables)groupby query_hash, schema_name, table_name;
selectDISTINCT A.table_nameas first_table,
B.table_nameas second_table
from #myExecutions A
join #myExecutions B
on A.query_hash= B.query_hashwhere A.table_name< B.table_name;
use [your database name goes here];
select
qs.query_hash,
qs.plan_handle,
cast(null as xml) as query_plan
into #myplans
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_plan_attributes(qs.plan_handle) pa
where pa.attribute = 'dbid'
and pa.value = db_id();
with duplicate_queries as
(
select ROW_NUMBER() over (partition by query_hash order by (select 1)) r
from #myplans
)
delete duplicate_queries
where r > 1;
update #myplans
set query_plan = qp.query_plan
from #myplans mp
cross apply sys.dm_exec_query_plan(mp.plan_handle) qp
;WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'),
mycte as
(
select q.query_hash,
obj.value('(@Schema)[1]', 'sysname') AS schema_name,
obj.value('(@Table)[1]', 'sysname') AS table_name
from #myplans q
cross apply q.query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') as nodes(stmt)
CROSS APPLY stmt.nodes('.//IndexScan/Object') AS index_object(obj)
)
select query_hash, schema_name, table_name
into #myExecutions
from mycte
where schema_name is not null
and object_id(schema_name + '.' + table_name) in (select object_id from sys.tables)
group by query_hash, schema_name, table_name;
select DISTINCT A.table_name as first_table,
B.table_name as second_table
from #myExecutions A
join #myExecutions B
on A.query_hash = B.query_hash
where A.table_name < B.table_name;