Michael J. Swart

December 22, 2016

That Time We Fixed Prod Without Admin Credentials

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:00 am

Merry Christmas readers! It’s story time. This is about a problem I encountered a few weeks ago. We were looking at a production site using sp_whoisactive and we noticed a lot of blocking on one particular procedure. I’m going to explain how we tackled it.

In this case, I think it’s interesting that we were able to mitigate the problem without requiring sysadmin access.

The Symptoms

Using sp_whoisactive and other tools, we noticed several symptoms.

  • SQLException timeout errors were reported by the app when calling one procedure in particular.
  • Many sessions were executing that procedure concurrently. Or at least they were attempting to.
  • There was excessive blocking and the lead blocker was running the same procedure.
  • The lead blocker had been running the longest (about 29 seconds)
  • The blocking was caused by processes waiting on Sch-M locks for a table used by that query

Here’s what was going on:
SQL Server was struggling to compile the procedure in time and the application wouldn’t let it catch its breath. The query optimizer was attempting to create statistics automatically that it needed for optimizing the query, but after thirty seconds, the application got impatient and cancelled the query.

So the compilation of the procedure was cancelled and this caused two things to happen. First, the creation of the statistics was cancelled. Second, the next session in line was allowed to run. But the problem was that the next session had already spent 28 seconds blocked by the first session and only had two seconds to try to compile a query before getting cancelled itself.

The frequent calls to the procedure meant that nobody had time to compile this query. And we were stuck in an endless cycle of sessions that wanted to compile a procedure, but could never get enough time to do it.

Why was SQL Server taking so long to compile anyway?
After a bunch of digging, we found out that a SQL Server bug was biting us. This bug involved

  • SQL Server 2014
  • Trace flag 2389 and 2390
  • Filtered Indexes on very large base tables

Kind of a perfect storm of factors that exposed a SQL Server quirk that caused long compilation times, timeouts and pain.

What We Did About It

Well, in this case, I think that the traceflag 2389, 2390 kind of outlived its usefulness (the cure is worse than the disease and all that). So the main fix for this problem is to get rid of those traceflags. But it would be some time before we could get that rolled out.

So for the short term, we worked at getting that procedure compiled and into SQL Server’s cache.

We called the procedure ourselves in Management Studio. Our call waited about thirty seconds before it got its turn to run. Then it spent a little while to compile and run the procedure. Presto! The plan is in the cache now! And everything’s all better right? Nope. Not quite. The timeouts continued.

If you’ve read Erland Sommarskog’s Slow in the Application, Fast in SSMS you may have guessed what’s going on. When we executed the procedure in SSMS, it was using different settings. So the query plan we compiled couldn’t be reused by the application. Remember, all settings (including ARITHABORT) need to match before cached plans can be reused by different sessions. We turned ARITHABORT off in SSMS and called the procedure again.

After a minute, the query completed and all blocking immediately stopped. Whew! The patient was stable.

The whole experience was a pain. And an outage is an outage. Though the count of the snags for the year had increased …

the Grinch had warded off downtime at least

4 Comments »

  1. […] Michael Swart has a story on blocking due to an edge case scenario: […]

    Pingback by Solving Blocking Without Sysadmin – Curated SQL — December 27, 2016 @ 8:05 am

  2. “Here’s what was going on:
    SQL Server was struggling to compile the procedure in time and the application wouldn’t let it catch its breath. The query optimizer was attempting to create statistics automatically that it needed for optimizing the query, but after thirty seconds, the application got impatient and cancelled the query. ”

    I would love to hear how you went from Sch-M locks to knowing it was struggling to compile the procedure please? 🙂

    Comment by David — May 16, 2017 @ 11:59 am

  3. Hi David,

    I noticed that the number of reads and cpu reported by sp_whoisactive for the lead blocker was at zero. That hinted to me that there was something preventing the query from even getting started. There are a couple things during query compilation that might be going on. One of them is statistics gathering.

    And the reason I suspected that is because I spent a lot of time on a similar lesson from 8 years ago:
    https://michaeljswart.com/2009/12/tracking-down-elusive-query-timeouts/

    Comment by Michael J. Swart — May 18, 2017 @ 8:43 am

  4. Brilliant – thank you Michael 😀

    Comment by David — May 18, 2017 @ 8:52 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress