Hashing for Indexes

November 22, 2013

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 12:07 am

Hashing functions. I’m going to talk about two ways to hash your data inside SQL Server and one way outside SQL Server.

CHECKSUM options

Let’s talk about BINARY_CHECKSUM() and CHECKSUM().

These are both vulnerable to collisions but they’re nice and quick. What’s the difference between these two? CHECKSUM is collation aware but BINARY_CHECKSUM is not. So strings which evaluate as equal – for example, lower case ‘a’ and uppercase ‘A’ – produce equal hashes when using CHECKSUM but different hashes when using BINARY_CHECKSUM.

Either of these are good for sanity tests (in test frameworks) and for hash indexes. But I wouldn’t count on such an index to enforce uniqueness, the frequency of hash collisions is too high.

If SHA1 is the new MD5, What's the new SHA1?

HASHBYTES options

HASHBYTES is another good option. You’ve got all kinds of hash algorithms with this one function. You can specify any of the following hash algorithms, MD2, MD4, MD5, SHA, SHA1, or SHA2.

Even with the limitation that you can only hash strings up to 8000 bytes, I like HASHBYTES a lot. I don’t use it for security, but I use it for unique hash indexes for database rows. I’ve described how to be careful when dealing with unicode strings in Careful Hashing

Here’s an example, consider

CREATE TABLE Dim_Tweet
(
    TweetId INT IDENTITY PRIMARY KEY,
    Tweet NVARCHAR(140) UNIQUE, 
    Frequency int
)

It’s got a pretty wide unique key there. I’d prefer something like

CREATE TABLE Dim_Tweet
(
    TweetId INT IDENTITY PRIMARY KEY,
    Tweet NVARCHAR(140),
    TweetHash as HASHBYTES('SHA1', Tweet) PERSISTED UNIQUE,
    Frequency int
)

just for the narrower index.

But what about hash collisions? Lordy, lordy, this again? I’m firmly in the camp that the chance of collisions is so small that the risk is negligible. I mean, the birthday paradox barely takes effect here. Just let me do some math, what if you inserted a unique row to this table once a second for forty years, you’d still be way more likely to win the lottery than to encounter a collision. MD5 and SHA1 isn’t exactly cutting edge cryptography, but it’s still good enough for what I need it to do (At least I’d choose it over the Hill Cipher).

Hashing Outside SQL Server

You’re not storing your files in the database right? That means that if you want to generate a checksum or hash for files, you’re probably going to want to do that processing outside SQL Server. I’ve bookmarked a C# solutions from stackoverflow How do I do a SHA1 File Checksum in C#? It’s a good start.

-- Comments (3)

3 Comments »

Something that’s probably relevant to how you hash is what you’re going to use the hash for.

If you’re using the hash for security purposes, then it’s important to you that collisions are very unlikely and the hashing function is hard to reverse. However if you’re just storing a precalculated hash for verifying the integrity of something, you might care less about those properties.

Comment by Chris Frey — November 24, 2013 @ 7:42 am
[…] Hashing for Indexes – Michael J. Swart (Blog|Twitter) […]

Pingback by (SFTW) SQL Server Links 06/12/13 • John Sansom — December 6, 2013 @ 5:24 am
The Fowler–Noll–Vo hash function is my favorite, as it offers an acceptable rate of collisions (lower than CHECKSUM), but does not sacrifice computational performance of the hash result. There is also a handy SQL Server CLR for the FNV hash algorithm which I typically use throughout many databases that exhibit a hashing requirement.

Comment by Kendra Everett — December 11, 2013 @ 7:49 am

RSS feed for comments on this post. TrackBack URL

Michael J. Swart

November 22, 2013