Another recap of a problem we recently tackled where we had to find out why a doomed transaction was still trying to do work. A doomed transaction is one that is about to roll back. In our case, it was caused by a rare set of circumstances:
- Turning XACT_ABORT ON
- Using the “Just Do It (JDI)” pattern for upserts
- Not following the KISSS rule Keep It Simple SQL Server
If you don’t feel like reading ahead, here are the lessons. Don’t program in T-SQL. Avoid triggers. Extended events are awesome.
Getting Error 3930
Our app was getting an error coming from SQL Server. Error 3930 gives this message:
Msg 3930, Level 16, State 1, Line 14
The current transaction cannot be committed and cannot support operations that
write to the log file. Roll back the transaction.
Google pointed me to this Stackoverflow question where Remus Rusanu talks about doomed transactions. You’re not allowed to do work in a catch block if your transaction is doomed. He then points to more information about uncommittable transactions in the TRY/CATCH docs. According to the TRY/CATCH docs, you can get this kind of error when XACT_ABORT is turned ON. This is something we do and I wrote about why in Don’t Abandon Your Transactions.
So there’s a CATCH block catching some error in a TRY block. Now I wonder what that error is. The SqlException that got raised to the client only had the 3930 error above. It had nothing about the original error. I did have the procedure name though. I can look there.
So I took a look at the procedure getting called. I saw that it was calling another procedure. And that procedure called other procedures. It was a big complicated mess. The calling tree looked like this:
So the procedure was complicated and it used explicit transactions, but I couldn’t find any TRY/CATCH blocks anywhere! What I needed was a stack trace, but for T-SQL. People don’t talk about T-SQL stack traces very often. Probably because they don’t program like this in T-SQL. We can’t get a T-SQL stack trace from the SQLException (the error given to the client), so we have to get it from the server.
More Info Needed From the Server
So luckily, extended events do have T-SQL stack traces. I wanted to look at those stack traces whenever there was an error on the server. My session looked something like this:
CREATE EVENT SESSION [errors] ON SERVER ADD EVENT sqlserver.error_reported( ACTION(sqlserver.sql_text,sqlserver.tsql_stack)) ADD TARGET package0.event_file(SET filename=N'errors') GO
Then I waited
Eventually It Failed Again
First I looked for the error. The one where
error_code = 3930. I did it using Management Studio to view the session’s data.
Then I looked at the error immediately before it. The one where
is_intercepted = true. That looks something like this
The T-SQL stack trace for that error is formatted as xml
<frames> <frame level="1" handle="0x03001C021AD32B677F977801C8A6000001000000000000000000000000000000000000000000000000000000" line="17" offsetStart="634" offsetEnd="808" /> <frame level="2" handle="0x03001C02A9D0D86D90977801C8A6000000000000000000000000000000000000000000000000000000000000" line="8" offsetStart="342" offsetEnd="582" /> <frame level="3" handle="0x03001C02202218770CA27801C8A6000001000000000000000000000000000000000000000000000000000000" line="56" offsetStart="2822" offsetEnd="3218" /> </frames>
It doesn’t look to useful at first glance but you can get the actual SQL query using sys.dm_exec_sql_text.
select * from sys.dm_exec_sql_text(0x03001C021AD32B677F977801C8A6000001000000000000000000000000000000000000000000000000000000)
The Original Error
Here’s what I found. The original error was a PK violation in a procedure using the Just Do It (JDI) pattern. It’s a technique which tackles upsert concurrency problems by inserting a row and catching and suppressing any PK errors that might be thrown.
So that’s where the TRY/CATCH block was.
Also, the stack trace told me that the procedure was called from a trigger. So my complicated procedure was actually more complicated than I thought. It actually looked like this
That’s why i didn’t see it before. I hate the way triggers can hide complexity. They’re sneaky.
The CATCH block catches a real error and the XACT_ABORT setting dooms the transaction. Because I didn’t do anything with the error, the transaction was allowed to continue. It was actually some query in
s_ProcM that attempted to do work inside the doomed transaction.
To fix, we adjusted the text of the query to be concurrent without using TRY/CATCH for flow control. For the extra curious, we used method 4 from Mythbusting: Concurrent Update/Insert Solutions.
- Don’t use the Just Do It (JDI) upsert pattern inside triggers
- In fact don’t use TRY/CATCH for flow control
- And also don’t use triggers
- In fact don’t program inside SQL Server if you can help it
- Oh, and Extended Events include a T-SQL stack trace if God forbid you need it