Data Modelling: Composite Relationships

July 26, 2012

Data Modelling: Composite Relationships

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Takeaway: In this article, I recommend that you use composite primary keys for child tables in composite relationships. If a child record is part of its parent record, then the parent’s primary key should be part of its child’s primary key.

That means that this:

create table CD
(
    CDId int identity,
    Name nvarchar(max),
    Artist nvarchar(max),
    Genre nvarchar(max),
    primary key (CDId)
)
 
create table Track
(
    CDId int references CD(CDId),
    TrackId int,
    Name nvarchar(max),
    primary key (CDId, TrackId)
)

… is better than this:

create table CD
(
    CDId int identity,
    Name nvarchar(max),
    Artist nvarchar(max),
    Genre nvarchar(max),
    primary key (CDId)
)
 
create table Track
(
    TrackId int identity,
    CDId int references CD(CDId),
    TrackNumber int,
    Name nvarchar(max),
    primary key (TrackId)
)

First some definitions, and then I’ll briefly describe the benefits of this practice.

Composite Relationship

I’m using the UML definitions when I talk about these relationships. Composite relationships between two tables occur when a child points to a parent and when the child is considered part of the parent. A quick test for this relationship is to ask whether records are deleted when the parent is. Or you can ask whether an object gets copied when the parent does. Track-CD is a good example of an composite relationship. Other examples might be Department-Company or SalesOrderDetail-SalesOrderHeader .

Relationships That Are Not Composite

For contrast, some relationships that are not composite include aggregation, association or dependency relationships. For example, the driver-vehicle relationship is not an composite relationship. It’s a “has a” relationship. Both vehicles and drivers have their own life cycle and it’s common for vehicles to change owners.

Benefits

Since the child records live and die (and get copied) with the parent record, it makes sense to have the parent PK column be the first column of a child’s primary key. The values of the child’s primary key almost never change. Some other benefits:

Index lookups and index scans become more efficient.
When you extend this practice to children of children, it’s even better. You have the opportunity for less joins on many queries.
The SQL you have to write turns out so much cleaner. I hinted about that here: My Favourite Data Modelling Example.
In some cases business rules can be enforced that couldn’t before.
Delete statements and purge strategies become simpler.

Is This Obvious?

Maybe when you read the Track-CD example above, it was obvious to you that the first example is better. But I see the second example way too often. So often that I felt it deserved a blog post.

If you want, you can look for this stuff in your own tables to double check. Try this out a couple times:

Find a foreign key somewhere in your database.
Is the relationship enforced by the foreign key an composite relationship? In other words, is one record part of the record it refers to? (e.g. Tracks are a part of their CDs)
Then are the foreign key columns part of the primary key? I hope so.

-- Comments (15)

15 Comments »

[…] Data Modelling: Composite Relationships – Michael J. Swart (Blog|Twitter) […]

Pingback by Something for the Weekend - SQL Server Links 27/07/12 — July 27, 2012 @ 7:41 am
I agree with the point you are making, but not the particulars. Tables models ets so the names need to be collective nouns. IDENTITY is never a key by definition; CDs have industry standard IDs (use the UPC if nothing else!). “Name” is too vague to be a data element; it changes from the track to the album in this schema. Why invite garbage data with the proprietary (MAX) in strings? Where is an ON DELETE CASCADE to preserve data integrity? I would have a CHECK (album_genre IN (..)) constraint. But I can argue that genre goes with the track in case of mixes. Why is the coutn of tracks allowed to be negtive and huge; SMALLINT with a CHECK() would be safer.

Child and parent are terms from the old network databases; we say referenced and referencing tables instead. We also refer to weak and strong entities in a data model.

Ghod! I am obsessive!

Comment by Joe Celko — July 27, 2012 @ 1:12 pm
Hi Joe! Welcome to my blog!

I actually struggled whether to blow out the data model into something I would actually check in, or let it be more minimalist. I opted for the minimalist because it was visually easier to compare the two examples. Now that you mention it, I may be doing a disservice to the readers this way.

I agree with almost every point you make. Some other thoughts about that:
* The proprietary (max) was just laziness.
* So was the “Name” for column names
* The check constraints were omitted for brevity.
* Collective nouns are a matter of taste I think.
* Here’s one extra thing that I also omitted that you didn’t mention. Naming your constraints. It helps with subsequent modifications and migrations.

I actually gave some thought about the parent-child versus referenced-referencing terms. I’d love to use referenced-referencing, but it doesn’t really roll off the tongue does it? If I were a better author, I could make it work.

Thanks for stopping by. I just took another look at “SQL for Smarties” (It’s been a while since I read it). I took a look to see how you write your CREATE TABLE statements. I wouldn’t say you’re obsessive, but man you’re thorough.

Comment by Michael J. Swart — July 27, 2012 @ 2:39 pm
Hi Michael,

I like the post and Joe’s very thorough thought process that went into his comments.

I’ve got a couple of scenarios that might require additional work on the data model, though (assuming for a moment that UPC was adopted as the primary key for the CD table):
1. multi-CD releases – one UPC for the boxed set, but not for the individual CDs
2. self-released CDs by small bands – they might not have a UPC for their limited distribution

For 1. – does that introduce a new table or can you limit yourself to an extra column in the Track table?

For 2. – creating your own UPC for these CDs has its own problems. But what is an alternative natural key in this case – maybe a combination of band name, album name, release date? But – do you want to carry these fields into the referencing table?

Cheers and Happy Weekend!

Comment by Karl Baader — July 27, 2012 @ 4:46 pm
One issues I have with this approach is that in any complex model, you often get children of children, and keys start becoming complex. That’s not a big issue, I admit, particularly if you are using synthetic keys. However, if you are using domain-specific keys then the primary key of a grand-child or great-grandchild becomes cumbersome. For that reason, I make a point of deciding whether a table will have children of its own, and if so, ensuring it has a single-column primary key.

In a *logical* model, it’s all good, and I’d recommend that approach. For a physical model, the size of keys being checked for referential integrity can make a performance difference.

Comment by Bruce W Cassidy — July 28, 2012 @ 10:57 pm
I have to disagree with Mr. Cassidy. You seldom get a key beyond 2 levels; esentailly an (x,y) co-ordindate in a data space. Getting the (x,y,z) is the most likely with z usually a temporal dimension. I cannot think of anything I have done with more than five (multiple warehouse locations that went down to package in a bin).

What makes these longer compound keys hard for SQL Server is that MS does not have ANSI row comparisons. In other complaint products I can write (x,y,z) = (1,2,3) and use cut&paste to make code easy and readable.

What I hate is seeing someone add the overhead of GUID that is actually biger than a few strings or small numerics, think this will save space.

Comment by Joe Celko — July 29, 2012 @ 6:00 pm
There are reasons why we will see non-composite primary keys still used (a lot):

1. Often, not always, it is faster to have a single PK (rows are more unique, making for better statistics on the first column).

Let us assume we have a non-clustered index on the foreign key field(s) to the parent record, a quite common method i would argue. Finding all children of a parent is still quite fast this way, most such relationships have only a small number of children. More likely as not, the unique child numbers are pretty closely spaced, if not adjacent (likely on the same page). This makes I/O overhead of a bookmark lookup minimal and you gain flexibility in that a non-clustered indexes can be made to cover many other queries too. We could add a status column and include it in the this index for example and quickly find all child records with a particular status of a particular parent .

Load/modify/store operations are likely more efficient using this kind of modeling as you have statistics on the primary column, not the secondary column in an index (to my knowledge). Does it matter a whole lot? It depends, more so in cases where there exist many parents that have moderate to large number of child’s. Homing in on the records to work with is what needs to be quick here.

Then there is the tradeoff to be made when several more indexes come into play. Each index containing the PK or the table, which would become two columns instead of one. This adds overhead serving no purpose.

2. Application layers are simpler when primary keys consist of just one number.

It can be argued that a well setup layer does not need to be constraint by this, but fact of life is that it simplifies code a whole lot. Instead of working with collections, representing the composite key, a single number can be used. Given that a lot of modeling is done by application developers and not DBA’s, this consideration will play a big role in deciding which way to go. I am not just speaking of developers with no data modeling experience that just take the easiest possible route.

Comment by Peter — July 30, 2012 @ 6:16 am
Hi @Karl,

For 1. I think then that you have to decide what you’re storing in that table. Is it a product? Or is it an actual compact disc (It’s like the classic example of a volume vs. a book: The bible is one volume with many books and the Lord of the Rings is a book with many volumes (sort of)).

For 2. I’m not sure what the natural key would be. A surrogate key would be almost necessary in the case where no natural key is found. People are another group that are hard to find natural keys for. It’s very difficult to find a natural key for us. Names aren’t unique. Not everyone has a Social Security Number. (I know someone with two!)

To every other reader. I’m going to brag for a bit: I work with Karl!

Comment by Michael J. Swart — July 30, 2012 @ 8:32 am
@Bruce, @Joe,
For what it’s worth, my experience tells me Joe’s right here. I’ve seen a couple examples of three levels of composite relationships. And one example of four levels. In the case of having four levels of ownership, it actually caused the most pain writing queries against that lowest tables which were missing easy ways to join to the higher levels. It was in that case that I wished most that the PKs were composite.

Comment by Michael J. Swart — July 30, 2012 @ 8:37 am
@Peter,

For your first point. Indexing is always done on a case by case basis of course (for example composite relationships often have only a small number of children, but not all). In my experience I’ve never regretted clustering on the parent table. In fact where I’ve forgotten, I’ve learned later that the typical usage of these tables indicated that I needed to re-evaluate indexing and put the clustered index back on the parent’s PK.

For your second point, I have a much *much* different experience than you. I see developers have an easier time with the composite keys. Especially on Deletes and Selects. But every shop is different. There are so many circumstances to take into account; if you found something that works great.

Comment by Michael J. Swart — July 30, 2012 @ 8:53 am
Hi Mike! Who’s on the front cover of that CD? It made me think of Milli Vanilli. Great writing!

Comment by Craig — July 30, 2012 @ 9:45 pm
Nice read!

I have a situation where clustering on the parent key would certainly help the performance of some queries. Say it’s a CRM database so a contact is parented by an organization. In the composite model, every transaction for each contact will require the composite key. Won’t the additional column begin to have negative effects in large systems? Or would the performance benefit from the more alighned indexing outweigh the storage cost?

Comment by Chris Ross — July 31, 2012 @ 9:08 am
@Craig, Totally Milli Vanilli. Many of my co-workers have never heard of them.

Comment by Michael J. Swart — July 31, 2012 @ 9:09 am
Hi Chris,
In this case, the CONTACT is not a part-of the ORGANIZATION. Contacts are owned-by Organizations, but I don’t think that CONTACT-ORGANIZATION relationship is a composite relationship the way I’ve used the term.

For performance and indexing, you really have to take each table on a case by case basis. I didn’t mention too much about indexing in my article on purpose because I don’t know any rules of thumb which are always useful. The best thing I can say is consider clustering on the parent’s key. Make sure it’s an option to be evaluated among others.

Comment by Michael J. Swart — July 31, 2012 @ 9:15 am
Fair enough! The org/contact isn’t the best example, but I am curious as to how a truly composite key model would work with a high volume of grandchildren (or referencing-ing?). In any case it’s a great model to keep in mind!

Comment by Chris Ross — July 31, 2012 @ 9:31 am