Michael J. Swart

August 2, 2017

Problem With Too Many version_ghost_records

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

We encountered a CPU issue which took over a month to understand and fix. Since then, it’s been another couple months and so I think it may be time for a debrief.

The cause was identified as a growing number of ghost records that SQL Server would not clean up no matter what. Our fix was ultimately to restart SQL Server.

When there's somethin' strange in your neighborhood

Symptoms and Data Collection

Here’s what we found.

  • At time marked ‘A’ on the graph, we noticed that CPU increased dramatically. It was hard not to notice.
  • We used sp_whoisactive to identify a frequent query that was taking a large amount of CPU. That query had never been a problem before. It was a select against a queue table – a table whose purpose was to store data for an implementation of a work queue. This table had a lot of churn: many INSERTS and DELETES. But it was small, never more than 100 rows.
  • So next, we ran the query manually in Management Studio. SET STATISTICS IO, TIME ON gave us a surprise. A simple COUNT(*) of the table told us there were 30 rows in the table, but reading it took 800K logical page reads!
  • What pages could it be reading? It’s impossible for a table to be that fragmented, it would mean less than one row per page. To look at the physical stats we ran:
    select 
           sum(record_count) as records,
           sum(ghost_record_count) as ghost_records,
           sum(version_ghost_record_count) as version_ghost_records
      from sys.dm_db_index_physical_stats(db_id(), object_id('<table_name>'), default, default, 'detailed')
     where index_id = 1
           and index_level = 0

    And that gave us these results:


    Interesting. The ghost records that remain are version_ghost_records, not ghost_records. Which sounds like we’re using some sort of snapshot isolation (which we’re not), or online reindexing (which we are), or something else that uses row versions.

  • Over time, those version_ghost_records would constantly accumulate. This ghost record accumulation was also present in all other tables, but it didn’t hurt as much as the queue table which had the most frequent deletes.

Mitigation – Rebuild Indexes

Does an index rebuild clean these up? In this case, yes. An index rebuild reduced the number of version ghost records for the table. At the time marked ‘B’ in the timeline, we saw that an index rebuild cleaned up these records and restored performance. But only temporarily. The version_ghost_records continued to accumulate gradually.

At time ‘C’ in the timeline, we created a job that ran every 15 minutes to rebuild the index. This restored performance to acceptable levels.

More investigation online

Kendra Little – Why Is My Transaction Log Growing In My Availability Group?
This is a great video. Kendra describes a similar problem. Long running queries on secondary replicas can impact the primary server in an Availability Group (AG). But we found no long running queries on any replica. We looked using sp_whoisactive and DBCC OPENTRAN. We didn’t see any open transactions anywhere.

Amit Banerjee – Chasing the Ghost Cleanup in an Availability Group
Amit also mentions that log truncation would be prevented in the case of a long-running query on a replica. However, in our case, log truncation was occurring.

Uwe Ricken – Read Committed Snapshot Isolation and high number of version_ghost_record_count
Uwe Ricken also blogged recently about a growing number of version_ghost_records. He talked about looking for open transactions that use one of the snapshot isolation levels. Unfortunately it didn’t apply to our case.

Bitemo Erik Gergely – A very slow SQL Server
Another example of a long running query keeping version_ghost_records around.

dba.stackexchange – GHOST_CLEANUP Lots of IO and CPU Usage
This stackexchange question also describes a problem with lots of CPU and an inefficient, ineffective ghost cleanup task for databases in an AG. There’s an accepted answer there, but it’s not really a solution.

Calling Microsoft Support

So we called Microsoft support. We didn’t really get anywhere. We spoke with many people over a month. We generated memory dumps, PSSDiag sessions and we conducted a couple screen sharing sessions. Everyone was equally stumped.

After much time and many diagnostic experiments. Here’s what we did find.

  • Availability Groups with readable secondaries are necessary (but not sufficient) to see the problem. This is where the version_ghost_records come from in the first place. Readable secondaries make use of the version store.
  • We ran an extended event histogram on the ghost_cleanup event. There was a ridiculous amount of events. Like millions per minute, but they weren’t actually cleaning up anything:
    CREATE EVENT SESSION ghostbusters ON SERVER 
    ADD EVENT sqlserver.ghost_cleanup( ACTION( sqlserver.database_id ) )
    ADD TARGET package0.histogram( SET filtering_event_name=N'sqlserver.ghost_cleanup', source=N'sqlserver.database_id' )
  • Microsoft let me down. They couldn’t figure out what was wrong. Each new person on the case had to be convinced that there were no open transactions. They couldn’t reproduce our problem. And on our server, we couldn’t avoid the problem.

Resolution

Ultimately the time came for an unrelated maintenance task. We had to do a rolling upgrade for some hardware driver update. We manually failed over the Availability Group and after that, no more problem!

It’s satisfying and not satisfying. Satisfying because the problem went away for us and we haven’t seen it since. After the amount of time I spent on this, I’m happy to leave this problem in the past.

But it’s not satisfying because we didn’t crack the mystery. And restarting SQL Server is an extreme solution for a problem associated with an “Always On” feature.

If you’re in the same boat as I was, go through the links in this post. Understand your environment. Look for long running queries on all replicas. And when you’ve exhausted other solutions, mitigate with frequent index rebuilds and go ahead and restart SQL Server when you can.

Powered by WordPress