Case study: Troubleshooting Doomed Transactions

January 16, 2017

Case study: Troubleshooting Doomed Transactions

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

Part of this post discusses the topic of UPSERT. For a recent, comprehensive post on UPSERT, see SQL Server UPSERT Patterns and Antipatterns

Another recap of a problem we recently tackled where we had to find out why a doomed transaction was still trying to do work. A doomed transaction is one that is about to roll back. In our case, it was caused by a rare set of circumstances:

Turning XACT_ABORT ON
Using the “Just Do It (JDI)” pattern for upserts
Not following the KISSS rule Keep It Simple SQL Server

If you don’t feel like reading ahead, here are the lessons. Don’t program in T-SQL. Avoid triggers. Extended events are awesome.

Getting Error 3930

Our app was getting an error coming from SQL Server. Error 3930 gives this message:

Msg 3930, Level 16, State 1, Line 14
The current transaction cannot be committed and cannot support operations that
write to the log file. Roll back the transaction.

Google pointed me to this Stackoverflow question where Remus Rusanu talks about doomed transactions. You’re not allowed to do work in a catch block if your transaction is doomed. He then points to more information about uncommittable transactions in the TRY/CATCH docs. According to the TRY/CATCH docs, you can get this kind of error when XACT_ABORT is turned ON. This is something we do and I wrote about why in Don’t Abandon Your Transactions.

So there’s a CATCH block catching some error in a TRY block. Now I wonder what that error is. The SqlException that got raised to the client only had the 3930 error above. It had nothing about the original error. I did have the procedure name though. I can look there.

Complicated Procedure

So I took a look at the procedure getting called. I saw that it was calling another procedure. And that procedure called other procedures. It was a big complicated mess. The calling tree looked like this:

Something Complicated

So the procedure was complicated and it used explicit transactions, but I couldn’t find any TRY/CATCH blocks anywhere! What I needed was a stack trace, but for T-SQL. People don’t talk about T-SQL stack traces very often. Probably because they don’t program like this in T-SQL. We can’t get a T-SQL stack trace from the SQLException (the error given to the client), so we have to get it from the server.

More Info Needed From the Server

So luckily, extended events do have T-SQL stack traces. I wanted to look at those stack traces whenever there was an error on the server. My session looked something like this:

CREATE EVENT SESSION [errors] ON SERVER 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.sql_text,sqlserver.tsql_stack))
ADD TARGET package0.event_file(SET filename=N'errors')
GO

Then I waited

Eventually It Failed Again

Woohoo!
First I looked for the error. The one where error_code = 3930. I did it using Management Studio to view the session’s data.
Then I looked at the error immediately before it. The one where is_intercepted = true. That looks something like this

The T-SQL stack trace for that error is formatted as xml

<frames>
  <frame level="1" handle="0x03001C021AD32B677F977801C8A6000001000000000000000000000000000000000000000000000000000000" line="17" offsetStart="634" offsetEnd="808" />
  <frame level="2" handle="0x03001C02A9D0D86D90977801C8A6000000000000000000000000000000000000000000000000000000000000" line="8" offsetStart="342" offsetEnd="582" />
  <frame level="3" handle="0x03001C02202218770CA27801C8A6000001000000000000000000000000000000000000000000000000000000" line="56" offsetStart="2822" offsetEnd="3218" />
</frames>

It doesn’t look too useful at first glance but you can get the actual SQL query using sys.dm_exec_sql_text.

select * from sys.dm_exec_sql_text(0x03001C021AD32B677F977801C8A6000001000000000000000000000000000000000000000000000000000000)

The Original Error

Here’s what I found. The original error was a PK violation in a procedure using the Just Do It (JDI) pattern. It’s a technique which tackles upsert concurrency problems by inserting a row and catching and suppressing any PK errors that might be thrown.
So that’s where the TRY/CATCH block was.

Also, the stack trace told me that the procedure was called from a trigger. So my complicated procedure was actually more complicated than I thought. It actually looked like this

Something More Complicated

That’s why i didn’t see it before. I hate the way triggers can hide complexity. They’re sneaky.

The CATCH block catches a real error and the XACT_ABORT setting dooms the transaction. Because I didn’t do anything with the error, the transaction was allowed to continue. It was actually some query in s_ProcM that attempted to do work inside the doomed transaction.

To fix, we adjusted the text of the query to be concurrent without using TRY/CATCH for flow control. For the extra curious, we used method 4 from Mythbusting: Concurrent Update/Insert Solutions.

Lessons

Don’t use the Just Do It (JDI) upsert pattern inside triggers
In fact don’t use TRY/CATCH for flow control
And also don’t use triggers
In fact don’t program inside SQL Server if you can help it
Oh, and Extended Events include a T-SQL stack trace if God forbid you need it

-- Comments (7)

7 Comments »

It is amazing how screwed up error management in T-SQL is. Other languages such as C# show that it’s quite possible to make a language in which errors are easy to handle.

Comment by tobi — January 16, 2017 @ 11:13 am
You’re so right tobi,
Coming from a C# background myself, I’ve made so many incorrect assumptions about how T-SQL TRY/CATCH behaves.

My advice is “Don’t use TRY/CATCH unless you really know what’s going on with TRY/CATCH” and as followup advice “Unless your name is Remus Rusanu, you probably don’t really know what’s going on with TRY/CATCH”

Comment by Michael J. Swart — January 16, 2017 @ 12:04 pm
[…] Michael Swart talks about doomed transactions: […]

Pingback by Doomed Transactions – Curated SQL — January 17, 2017 @ 8:11 am
Hi Guys,
Michael, nice approach to tracking down this beast. What a mess!

However, I disagree that you guys about avoiding Try-Catch. The pattern is a significant advance over the old “IF @@Error != 0 GoTo Label” pattern.:)
I think part of the complexity is that T-SQL Try-Catch was added on to the other error handling features and suffers from many complications in those other features.

Error and Transaction Handling in SQL Server by Erland Sommarskog is a comprehensive (exhaustive, exhausting:) ) treatment of the topic. I was surprised at the complexity of Transaction interactions (explicit or not), Xact_Abort, even before considering Try-Catch.

C#’s designers had the advantages of both a relatively clean sheet of paper and knowledge of relatively modern error handling patterns. T-SQL’s designers are stuck with a 30 year old design pattern that they generally must retain for backward compatibility.

I think the real key is Michael’s KISSS principle. Don’t program in T-SQL. Query, update, delete, search, sort if you must but don’t put tons of multi-layer business logic, with hidden triggers and expect it to work flawlessly. I imagine that ya’ll would make similar disparaging comments but direct them at the programmer, should I put together a C# processing tree similar to this example with nested interdependent trees including a couple of Out-of-Process steps implemented behind a curtain.

Comment by Ray Herring — January 18, 2017 @ 3:22 pm
http://www.sommarskog.se/error_handling/Part1.html
Sorry, I left out the URL for Erland’s excellent article.

Comment by Ray Herring — January 18, 2017 @ 3:23 pm
Well put Ray*

Yes, the try-catch pattern is certainly better than IF @@Error ... pattern.
But as you pointed out programming in T-SQL is a tricky subject.
In an ideal world, every procedure follows the KISSS principle and if that’s the case, then all of a sudden, we don’t need any error handling in T-SQL because we only catch exceptions/errors that we can handle (and if procedures are only called from code, then let the error handling happen there).

You mentioned something I hadn’t thought of before. Would I be equally critical if I saw the same programming in C#? I think yes but I’m not sure. In a real programming language, there are better opportunities to “uncomplicate” something complicated. But the example above is a real example (with changed sproc names) and I’m wondering now how I would implement a solution if I had to start from scratch.

Thanks for stopping by Ray

*Really well put, do you have a blog?

Comment by Michael J. Swart — January 18, 2017 @ 3:45 pm
Thanks Michael,
As for the Blog, alas no. I start occasionally and then find that someone else has already covered the topic and usually much better than I would have. Your’s is a good example 🙂
As for real languages and complexity we need to do beer sometime and swap war stories. I might start with the 400+ line conditional that reduced to If 1 = 0 do something. That one took a day or so with logic diagrams to work out. Or the one where the entire data record was a DOM embedded in XML. That one had multiple, independent processes to update the DOM including Client Side, Server Side, T-SQL and Triggers.
I think the real issue is that developers (VB, C#, T-SQL) are pushed hard to get the solution out the door. In the process a monster accretes.
When I first started as a developer we were actively discouraged from what is now called “refactoring” and was then called “meddling”. Sometimes I think it is still the true in many shops.

Comment by Ray Herring — January 19, 2017 @ 12:53 pm