Hashing functions. I’m going to talk about two ways to hash your data inside SQL Server and one way outside SQL Server.
These are both vulnerable to collisions but they’re nice and quick. What’s the difference between these two? CHECKSUM is collation aware but BINARY_CHECKSUM is not. So strings which evaluate as equal – for example, lower case ‘a’ and uppercase ‘A’ – produce equal hashes when CHECKSUM but different BINARY_CHECKSUM.
Either of these are good for sanity tests (in test frameworks) and for hash indexes. But I wouldn’t count on such an index to enforce uniqueness, the frequency of hash collisions is too high.
HASHBYTES is another good option. You’ve got all kinds of hash algorithms with this one function. You can specify any of the following hash algorithms, MD2, MD4, MD5, SHA, SHA1, or SHA2.
Even with the limitation that you can only hash strings up to 8000 bytes, I like HASHBYTES a lot. I don’t use it for security, but I use it for unique hash indexes for database rows. I’ve described how to be careful when dealing with unicode strings in Careful Hashing
Here’s an example, consider
CREATE TABLE Dim_Tweet ( TweetId INT IDENTITY PRIMARY KEY, Tweet NVARCHAR(140) UNIQUE, Frequency int )
It’s got a pretty wide unique key there. I’d prefer something like
CREATE TABLE Dim_Tweet ( TweetId INT IDENTITY PRIMARY KEY, Tweet NVARCHAR(140), TweetHash as HASHBYTES('SHA1', Tweet) PERSISTED UNIQUE, Frequency int )
just for the narrower index.
But what about hash collisions? Lordy, lordy, this again? I’m firmly in the camp that the chance of collisions is so small that the risk is negligible.
I mean, the birthday paradox barely takes effect here. Just let me do some math, what if you inserted a unique row to this table once a second for forty years, you’d still be way more likely to win the lottery than to encounter a collision. MD5 and SHA1 isn’t exactly cutting edge cryptography, but it’s still good enough for what I need it to do (At least I’d choose it over the Hill Cipher).
Hashing Outside SQL Server
You’re not storing your files in the database right? That means that if you want to generate a checksum or hash for files, you’re probably going to want to do that processing outside SQL Server. I’ve bookmarked a C# solutions from stackoverflow How do I do a SHA1 File Checksum in C#? It’s a good start.