Cryptographic hashing functions like MD5 or SHA-1 are examples of one-way functions that take any data you give it and return a fixed set of bytes (for example, 16 bytes for MD5 and 20 bytes for SHA-1).
I recently had trouble coming up with the same hash value for what I thought were the exact same inputs. Compare the following two examples:
// C# example byte[] byteArray = Encoding.Unicode.GetBytes("I'm a lumberjack and I'm okay."); SHA1 sha = new SHA1CryptoServiceProvider(); byte[] hashedPasswordBytes = sha.ComputeHash(byteArray); Console.WriteLine(BitConverter.ToString(hashedPasswordBytes)); //30-F1-CE-3E-7E-1F-17-27-A7-B5-97-39-44-D0-FE-EA-DA-69-53-2B |
That’s different than this:
-- SQL example select HASHBYTES('SHA1', 'I''m a lumberjack and I''m okay.') --0x4D6936CEDD0DE794AC86F9A4099DCBB4EFED8E1F |
So what gives? Before reading any further, can you spot the problem? In both examples above, two things are happening. First a string of characters is being converted into a string of bytes and then that string of bytes gets hashed. I’ve learned that it’s not the hash function that’s the problem, it’s converting the character string into binary that is inconsistent.
String Conversions
Maybe that’s not surprising, but still, it’s one more thing to be careful about: converting strings to bytes. SQL Server’s wide characters (NCHARs and NVARCHARs) are equivalent to .Net’s Encoding.Unicode and SQL Server’s single non-unicode characters (CHAR and VARCHAR) (single-byte characters) are equivalent to .net’s Encoding.ASCII.
So for completeness sake, these should be equivalent:
-- wide characters (remember unicode string literals are prefixed with 'N') select HASHBYTES('SHA1', N'I''m a lumberjack and I''m okay.') --0x30F1CE3E7E1F1727A7B5973944D0FEEADA69532B -- non-unicode characters: select HASHBYTES('SHA1', 'I''m a lumberjack and I''m okay.') --0x4D6936CEDD0DE794AC86F9A4099DCBB4EFED8E1F |
And
// Unicode byte[] byteArray = Encoding.Unicode.GetBytes("I'm a lumberjack and I'm okay."); SHA1 sha = new SHA1CryptoServiceProvider(); byte[] hashedPasswordBytes = sha.ComputeHash(byteArray); Console.WriteLine(BitConverter.ToString(hashedPasswordBytes)); //30-F1-CE-3E-7E-1F-17-27-A7-B5-97-39-44-D0-FE-EA-DA-69-53-2B //ASCII byteArray = Encoding.ASCII.GetBytes("I'm a lumberjack and I'm okay."); hashedPasswordBytes = sha.ComputeHash(byteArray); Console.WriteLine(BitConverter.ToString(hashedPasswordBytes)); //4D-69-36-CE-DD-0D-E7-94-AC-86-F9-A4-09-9D-CB-B4-EF-ED-8E-1F |
This means that if you’re going to hash strings in SQL Server and compare them to strings that are hashed in C#, you want to use
- Encoding.Unicode for NCHARS and NVARCHARS
- Encoding.ASCII for CHARS and VARCHARS
This also means that for this use case, I wouldn’t feel safe using other encodings (such as Encoding.UTF8) because SQL Server can’t duplicate that string to binary conversion.
Caveats
- Don’t trust me and test it out for yourself!
- Collations matter a little (sort order is not relevant here). The code page for your SQL Server collation should match the code page of the .net encoding when dealing with single byte characters. Unicode strings should not care about collation.
- If you do find any counter-examples for anything I’ve written here, let me know in the comments.
As the person who kept bringing Collation when we talked about this it is interesting that it has little to no effect. Thanks for yet another excellent post.
Comment by Scott MacLellan — February 15, 2012 @ 2:16 pm
[…] Careful Hashing | Michael J. Swart […]
Pingback by SQLQuill – Link Round Up – February 2012 Edition « SQL Feather and Quill — February 16, 2012 @ 10:03 am
[…] Careful Hashing – A cautionary tale this week from Michael J. Swart (Blog|Twitter). […]
Pingback by Something for the Weekend – SQL Server Links 17/02/12 — February 17, 2012 @ 9:13 am
Thank you. Very clear and helpful !
Comment by Rick Willemain — February 16, 2015 @ 12:35 pm