Michael J. Swart

August 22, 2024

Large Uniquifier Values

Hey what’s up? It’s been a while.

If you define a clustered index that’s not unique, SQL Server will add a hidden 4-byte column called UNIQUIFIER. You can’t see it directly but it’s there. When you add a row whose key is a duplicate of an existing row, the new row gets a new unique value for it’s uniqueifier. If you add over 2.1 billion rows with the same key, the uniquifier value exceeds the limit and you will see error 666.

A while ago, we nearly got into trouble because of a bad choice for clustering key that went undetected for so long.

Is your database nearing trouble?

Here’s a script that might help you find out.

select top 100
  OBJECT_SCHEMA_NAME(t.object_id) as [schema],
  t.name as [table],
  i.name as [index],
  i.type_desc,
  i.is_unique,
  s.range_high_key, 
  s.equal_rows
from 
  sys.indexes i
inner join
  sys.tables t
  on i.object_id = t.object_id
cross apply 
  sys.dm_db_stats_histogram(i.object_id, i.index_id) s
left join
  sys.index_columns ic
  on ic.object_id = i.object_id
  and ic.index_id = i.index_id
  and ic.index_column_id = 2
where 
  i.index_id = 1
  and i.is_unique = 0  
  and ic.index_id is null
order by s.equal_rows desc

This query looks at clustered indexes that are not unique. It looks through the stats histogram and reports any with a high “EQUAL” range count.
If any of the equal_rows values are nearing 2 billion, then look out.

Kimberly Tripp always advocated uniqueness as a property of a good clustering index. That implies that equal_rows in the results here should be closer to 1 – nowhere near 2 billion!. So Microsoft’s advice holds true: “review tables that rely on uniqueifiers and proactively work to improve its design”.

If you find instances where any of the equal_rows are above 10000, you might choose to look closer.

Update: A careful look at the script will tell you that I’m excluding clustered indexes with more than one key column from the results. It’s still possible to get into trouble with multi-column clustered indexes, but because the stats histogram only focuses on the first column, this script can’t (easily) warn you about those times. See the comment section for lots more discussion.

6 Comments »

  1. Ooo, I love the use of sys.dm_db_stats_histogram to quickly identify these!

    You might wanna say that this script only catches single-column clustered indexes, not multi-column ones. I spent a few minutes trying to figure out if I could do it for multi-column indexes using the density vector or something, but it’s going to be a little more complex than that – would require multiple joins to sys.dm_db_stats_histogram for each column in the index, and of course that function requires a stat_id, not a column_id. I’m so curious though and I’m on the hunt!

    Comment by Brent Ozar — August 23, 2024 @ 7:20 am

  2. Yves reminds me that we have a flight to catch, so I’m off, but I had to share this part because it was a fun exercise:

    SELECT ac.object_id, ac.name, ac.column_id, MAX(h.equal_rows) AS equal_rows_max
    FROM sys.all_columns ac
    INNER JOIN sys.stats_columns sc ON ac.object_id = sc.object_id AND ac.column_id = sc.column_id AND sc.stats_column_id = 1
    CROSS APPLY sys.dm_db_stats_histogram(ac.object_id, sc.stats_id) h
    GROUP BY ac.object_id, ac.name, ac.column_id

    For all of the columns, I’m getting the max equal_rows from any available statistic’s histogram. I realized that I didn’t really care which statistic it was – doesn’t matter if it’s an auto-created stat or an index’s stat, and it doesn’t matter which statistic is the most up to date or most sampled. If ANY stat suggests that a single value has a high number of equal rows, then you probably wanna investigate more deeply.

    The next step would be to take the non-unique clustered index, join all of its columns to that. It’ll produce false positives in some cases though. This is such a fun thought exercise!

    Comment by Brent Ozar — August 23, 2024 @ 7:46 am

  3. […] Michael J. Swart is one of a kind: […]

    Pingback by Uniquifiers Doing Heavy Lifting – Curated SQL — August 23, 2024 @ 8:00 am

  4. Neat! One issue I’ve run into in a previous shop was an anti-pattern by the application whereby they’d first insert data into the table with the (same) bogus clustering key and then update it immediately to the correct value. T-log inefficiency aside, this not only incremented the uniquifier but also thwarts things like you’re describing here to detect it since the evidence was super ephemeral. I was there for the *second* time they ran out of uniquifiers on that table (my understanding was the first time it took them several hours while they rebuilt the clustered index on a 2b row table; the second time, they just changed to a different nonce for the bogus clustering key).

    Comment by Ben Thul — August 23, 2024 @ 10:03 am

  5. Hey Brent, thanks for the comments.

    It’s a bit of a tradeoff between avoiding false negatives and avoiding false positives. (I’ll probably update the post to make that clearer)

    With the script in the blog, I can still miss high uniquifier values like you mentioned, i.e. with the multicolumn nonunique clustered indexes.

    But I thought of another way we can miss high uniquifier values. Similar to Ben’s ephemeral issue, this is one way to reproduce the issue while keeping database sizes small:

    use tempdb;
    set nocount on
     
    create table example (
    	id int identity not null
    		index ix_example1 nonclustered,
    	cluster_column int not null 
    		index ix_example2 clustered
     
    )
     
    declare @x bigint = 1;
    while (@x < 100000000) 
    begin
    	insert example (cluster_column) select 0 from generate_series(0,100000,1);
    	delete example where id < (select max(id) from example);
    	set @x += 1;
    end

    So as long as you have enough tempdb log space, you can get a 666 error in a few hours.

    Comment by Michael J. Swart — August 23, 2024 @ 10:47 am

  6. I looked to find the density vector info using DMVs. I don’t think it’s available to use easily 🙁

    Comment by Michael J. Swart — August 23, 2024 @ 11:45 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress