Michael J. Swart

February 27, 2012

I Was Kidding Myself

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 6:34 pm

Do you know who Vizzini is? He’s a character from the Princess Bride story (I loved the book and if you want to read it, I would highly recommend the hard-to-find unabridged version). Most people know Vizzini from the movie. He’s this guy:

... self-kidder

For those who haven’t seen it… do! I’ll wait. This blog post will still be here when you get back.

As one of the villains in the Princess Bride, he faces the protagonist Westley and they have a battle of wits to the death. Vizzini is asked to decide which of two cups contains poison. Through some pretty heavy-duty reasoning he makes his choice and guesses wrong. His fatal flaw was that he was over-confident in his knowledge. He assumed that he knew a fact and was 100% sure that he was correct. He risked his life based on that assumption and failed. He was kidding himself.

I Kid Myself

I do that too. And way more often than I’m comfortable with. I forget who said it first, but you may have heard “it’s less dangerous to not know something than to know something that’s wrong for sure.”

Here’s how I know that I’m kidding myself (And I’ve played through this scenario many times)

Stage 1: I see behaviour from SQL Server that I don’t expect:

For example, some time ago the performance of a stored procedure took a nosedive when we switched some table columns from VARCHAR to NVARCHAR.

“I’ve seen this behaviour before” I thought,  “The parameters are mismatched with the column type and an implicit conversion is being done. We just have to match up the types and problem will be solved. I’m sure of it.”

But that wasn’t the case. The developer came back to me and showed me I was wrong. The parameters for the stored procedures had already been changed to match the table columns. So what was going on?

Stage 2: I get uncomfortable about it:

I’m supposed to know this stuff. But now the behaviour of SQL Server is clearly demonstrating that I don’t and it’s not up for discussion! Sometimes it feels as if I woke up and the sky was green. Why would Unicode string searches behave that many times worse than Ascii string searches?

That’s when I know I’ve been kidding myself. And I feel uncomfortable about it because if I’m wrong about this stuff, what else am I wrong about?

Stage 3: So I go to learn more things:

This is the fun part. Investigation time. I use google, #sqlhelp, and everything I can to help me get to the bottom of the mystery. In the case of the Unicode string mystery, I opened a  StackOverflow item and got some quick quality answers. Also after the mystery got solved, it made a great topic for a blog post.

Stage 4: I get more comfortable about it:

So beforehand, I thought I knew what I knew. But now I feel sure about it.

It’s hard to describe. Do you know how when you’re asleep, you still think you’re awake? But when you’re actually awake there’s somehow this extra level of sureness? It’s like that.

A Slightly Different Point of View

Like I said, I’ve been through this process many many times. My typical learning journey goes from knowing I don’t know, through the discomfort and back to optimism again.

And I don’t get too uncomfortable any more about not knowing something. It’s because when I recognize myself in stage 2, I’m anticipating stage 3 already: It’s learning time!

So how about you reader? Have you been through this yourself? I’d love to hear about some of your own experiences. What is the craziest thing you believed that turned out to be wrong?

February 15, 2012

Careful Hashing

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , , , — Michael J. Swart @ 12:00 pm

Cryptographic hashing functions like MD5 or SHA-1 are examples of one-way functions that take any data you give it and return a fixed set of bytes (for example, 16 bytes for MD5 and 20 bytes for SHA-1).

I recently had trouble coming up with the same hash value for what I thought were the exact same inputs. Compare the following two examples:

// C# example
byte[] byteArray = Encoding.Unicode.GetBytes("I'm a lumberjack and I'm okay.");
SHA1 sha = new SHA1CryptoServiceProvider();
byte[] hashedPasswordBytes = sha.ComputeHash(byteArray);

That’s different than this:

-- SQL example
select HASHBYTES('SHA1', 'I''m a lumberjack and I''m okay.')

So what gives? Before reading any further, can you spot the problem? In both examples above, two things are happening. First a string of characters is being converted into a string of bytes and then that string of bytes gets hashed. I’ve learned that it’s not the hash function that’s the problem, it’s converting the character string into binary that is inconsistent.

String Conversions

Maybe that’s not surprising, but still, it’s one more thing to be careful about: converting strings to bytes. SQL Server’s wide characters (NCHARs and NVARCHARs) are equivalent to .Net’s Encoding.Unicode and SQL Server’s single non-unicode characters (CHAR and VARCHAR) (single-byte characters) are equivalent to .net’s Encoding.ASCII.

So for completeness sake, these should be equivalent:

-- wide characters (remember unicode string literals are prefixed with 'N')
select HASHBYTES('SHA1', N'I''m a lumberjack and I''m okay.')
-- non-unicode characters:
select HASHBYTES('SHA1', 'I''m a lumberjack and I''m okay.')


// Unicode
byte[] byteArray = Encoding.Unicode.GetBytes("I'm a lumberjack and I'm okay.");
SHA1 sha = new SHA1CryptoServiceProvider();
byte[] hashedPasswordBytes = sha.ComputeHash(byteArray);
byteArray = Encoding.ASCII.GetBytes("I'm a lumberjack and I'm okay.");
hashedPasswordBytes = sha.ComputeHash(byteArray);

This means that if you’re going to hash strings in SQL Server and compare them to strings that are hashed in C#, you want to use

  • Encoding.Unicode for NCHARS and NVARCHARS
  • Encoding.ASCII for CHARS and VARCHARS

This also means that for this use case, I wouldn’t feel safe using other encodings (such as Encoding.UTF8) because SQL Server can’t duplicate that string to binary conversion.


  • Don’t trust me and test it out for yourself!
  • Collations matter a little (sort order is not relevant here). The code page for your SQL Server collation should match the code page of the .net encoding when dealing with single byte characters. Unicode strings should not care about collation.
  • If you do find any counter-examples for anything I’ve written here, let me know in the comments.

February 9, 2012

Microsoft Model Databases: Some Unconventional Alternatives?

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:53 pm

It seems like Microsoft has always provided example databases in order to illustrate database concepts.

Pubs: Back when SQL Server 2000 was new, Microsoft provided this sample database and maybe you remember it. Pubs was a really simple database that was designed to represent data that a fictional book publishing company might use. The database had a name (pubs) but I don’t think anyone ever gave the company a name. From here we’re told that it was “used to demonstrate many of the options available” for SQL Server.

Northwind also was available for SQL Server 2000. It was a database that was used for a fictitious company “Northwind Traders”. The company “imports and exports specialty foods from around the world.” It was better than pubs, but that wasn’t saying much. Pubs set a pretty low bar.

AdventureWorks: The Adventureworks database came along with SQL Server 2005 and we were introduced to the bicycle manufacturer Adventure Works Cycle. The database was pretty comprehensive and it touched on a lot more features of SQL Server. I’ve seen countless performance demos that query Sales.SalesOrderDetail. There’s even an Adventureworks Data Warehouse, an Adventureworks OLAP cube and even a light version: AdventureworksLT.

Contoso And lately there’s a new sample database Contoso. It’s another fictional retail company (Contoso Ltd. This time they’re selling electronics and appliances). This database is used for Business Intelligence demos. With Contoso, Microsoft has developed and made a data warehouse and an SSAS cube available.

Presumably, each of these databases is an example of Microsoft putting their best foot forward. Or at least it’s a database which enables them to show off what their database software can do.

Beyond Demos

So those databases are good for demos.

But if we’re looking for other Microsoft-designed databases, we don’t have to stop there. There’s other ones that we can look at. I’m talking about the databases that Microsoft ships along with their products. When looking at these, remember:

  • There are some good examples, …
    We can look at those databases and see how they’re constructed. If you see some interesting db designs, you can use that.
  • … and there are some not so good examples …
    These databases weren’t built as database role models. So there’s bound to be a lot of database design choices that are not ideal. I know that Microsoft has huge pressure to maintain backwards compatibility.
  • … and that’s okay.
    It’s actually a fun exercise to pick out the good designs from the bad. It’s like a smorgasbord, you take the lessons you like and leave the rest.

Some examples of what I’m talking about.


Install and configure an instance of SQL Server Reporting Services. Take a look at the ReportServer. It’s a pretty nice database schema. One thing that was interesting was the choice of identity column data types:

  • Catalog seems to be one of the main tables here and they’re using uniqueidentifiers (GUIDs) for identity columns. Normally people scoff at that, but I think it makes sense in this case. Catalog is the table that tracks reports, datasources and folder structures. Even at it’s largest, it can’t be that big (I just checked my own installation and the clustered index fits on 1 extent). That means that all the worries about GUIDs for primary keys are mitigated and the choice of using a globally unique identifier might make sense here.
  • ExecutionLogStorage Contrast that with ReportServer’s largest table, ExecutionLogStorage. Depending on how busy your Reporting Services installation is, this table can grow very large with one row for every report view. When I look at the choice for their primary key, I see that it’s clustered index on a single BIGINT column defined as it’s IDENTITY column. Well done.

Any Database’s Metadata

  • I like the object oriented influence here:
    • The relationship between sys.objects –> sys.tables and between sys.objects –> sys.procedures is the subtype relationship class –> subclass. And that shows in the schema, for example, there’s one sys.object row for every sys.table row.
    • The relationship between sys.foreign_keys –> and sys.foreign_key_columns is the object composition relationship. And that also shows in the schema too. The “primary key” for the table foreign_key_columns is (constraint_object_id, constraint_column_id) and it contains the primary key of its owner.
  • Another good example is the sys.index_columns table. Look at the schema diagram below. It’s a perfect instance of my favourite DB Design example

Any Sharepoint database

  • [Sigh] These are mostly counter-examples here, I took a brief glance at the schema once and decided quickly that there are no good lessons. Can anyone tell me if that’s changed?

February 2, 2012

Removing Duplicate Dimension Rows in SSIS

Something New For Me

I’m relatively new to SQL Server Integration Services (SSIS). And so I want to explain how I solved a problem and invite Business Intelligence (BI) experts (and wannabe experts) to comment or point out things that I missed.

A Short Description of the Problem

So here’s a short description of the problem I’m trying to solve. It involves populating a dimension table. I want a data flow transformation which removes duplicate key values. In other words, every time I come across a new key value for a dimension, I want to pass through that entire row (not just the key values). This is kind of like a cross between T-SQL’s DISTINCT and FIRST_VALUE.

I found out recently that people have treated this topic before:

But because of my bad luck, I only found these posts after I muddled through a solution. But I was surprised to find out my solution had some interesting differences.

The Setup

First I’ll explain my situation. Say I’ve got a flat file containing data about Prime Ministers:

The source data

My main goal is to use this data to populate a data warehouse. That data warehouse has a schema like this:

The Data Warehouse Schema

This example is just for illustration. In real life, the number of fact rows is very large, and the number of dimension rows is relatively small. The time dimension will be pre-populated, but I want to load the other dimensions using values from the flat file. So that means I plan to process the data twice: The first time will be to load these dimensions, and the second time will be to load the fact table. When I process the data for dimensions, I :

  • Use a lookup transform to find whether that dimension already exists in the warehouse
  • Use the No Match Output to identify dimension rows I haven’t seen before. Those rows will be used to load the dimension. (i.e. I’m treating the dimension like a Never-Changing-Dimension rather than a Slowly Changing Dimension.)

The problem is that these unmatched rows can be duplicated and I don’t want those duplicate dimensions rows getting into the dimension table.

Things I considered:

  • I considered using a staging table on the DB side. But that required too much space, the amount of space needed was comparable to the fact table, not the dimension table.
  • I also considered inserting into dimensions a row-at-a-time using the MERGE statement on the database side. But I didn’t consider that for long. The performance with that strategy was so unacceptable.
  • I could have used a sort/aggregate, but that was a blocking transformation and I really didn’t need the data sorted.

What I ended up with

I created a script transform which seems to do the trick for what I want. My Data Flow task looks like this:

In those script transforms you see up there, I told SSIS that each script will be used to filter rows out or let them pass through. I did that by setting ExclusionGroup to 1 (by default, it’s normally 0). Here’s where that’s done:

Then in the script, I created a list of seen dimensions. And I let rows pass through if they haven’t yet been seen. The script simply looks like this:

private HashSet<string> seenParty = new HashSet<string>(); 
public override void Input0_ProcessInputRow(Input0Buffer Row)
    if (Row.Party_IsNull)
    if (!seenParty.Contains(Row.Party))

Update October 9, 2013: And the seenParty variable should be defined as a HashSet. (See Abacus’s comments)

Logically it behaves exactly the same as the sort transformation (as explained in “Dead Easy Duplicate Key Removal” by Todd McDermid). Except for these things:

  • It seems to perform fast
  • It’s not a blocking transformation
  • But you don’t get all that for nothing. What we’re not getting with this method is that the rows aren’t sorted. In most cases, that doesn’t matter.
  • It takes some memory, so you have to be aware of that. Ask yourself, how well can all unseen dimension keys fit in memory?

Some questions for you BI guys

And here’s where I’d like you to chime in.

  1. What’s the most common approach to this problem?
  2. I get the feeling that I could use the Cache Transform for this technique. But I haven’t tried it.
  3. Are there any new features in SQL 2012 that help?
  4. Am I re-inventing the wheel here?
  5. Am I missing anything else that should be obvious?

Powered by WordPress