Merry Christmas readers! It’s story time. This is about a problem I encountered a few weeks ago. We were looking at a production site using sp_whoisactive and we noticed a lot of blocking on one particular procedure. I’m going to explain how we tackled it.
In this case, I think it’s interesting that we were able to mitigate the problem without requiring sysadmin access.
Using sp_whoisactive and other tools, we noticed several symptoms.
- SQLException timeout errors were reported by the app when calling one procedure in particular.
- Many sessions were executing that procedure concurrently. Or at least they were attempting to.
- There was excessive blocking and the lead blocker was running the same procedure.
- The lead blocker had been running the longest (about 29 seconds)
- The blocking was caused by processes waiting on
Sch-Mlocks for a table used by that query
Here’s what was going on:
SQL Server was struggling to compile the procedure in time and the application wouldn’t let it catch its breath. The query optimizer was attempting to create statistics automatically that it needed for optimizing the query, but after thirty seconds, the application got impatient and cancelled the query.
So the compilation of the procedure was cancelled and this caused two things to happen. First, the creation of the statistics was cancelled. Second, the next session in line was allowed to run. But the problem was that the next session had already spent 28 seconds blocked by the first session and only had two seconds to try to compile a query before getting cancelled itself.
The frequent calls to the procedure meant that nobody had time to compile this query. And we were stuck in an endless cycle of sessions that wanted to compile a procedure, but could never get enough time to do it.
Why was SQL Server taking so long to compile anyway?
After a bunch of digging, we found out that a SQL Server bug was biting us. This bug involved
- SQL Server 2014
- Trace flag 2389 and 2390
- Filtered Indexes on very large base tables
Kind of a perfect storm of factors that exposed a SQL Server quirk that caused long compilation times, timeouts and pain.
What We Did About It
Well, in this case, I think that the traceflag 2389, 2390 kind of outlived its usefulness (the cure is worse than the disease and all that). So the main fix for this problem is to get rid of those traceflags. But it would be some time before we could get that rolled out.
So for the short term, we worked at getting that procedure compiled and into SQL Server’s cache.
We called the procedure ourselves in Management Studio. Our call waited about thirty seconds before it got its turn to run. Then it spent a little while to compile and run the procedure. Presto! The plan is in the cache now! And everything’s all better right? Nope. Not quite. The timeouts continued.
If you’ve read Erland Sommarskog’s Slow in the Application, Fast in SSMS you may have guessed what’s going on. When we executed the procedure in SSMS, it was using different settings. So the query plan we compiled couldn’t be reused by the application. Remember, all settings (including
ARITHABORT) need to match before cached plans can be reused by different sessions. We turned
ARITHABORT off in SSMS and called the procedure again.
After a minute, the query completed and all blocking immediately stopped. Whew! The patient was stable.
The whole experience was a pain. And an outage is an outage. Though the count of the snags for the year had increased …