SQL Simplicity Methods

January 6, 2014

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:00 am

Takeaway: Keep your database queries simple. Simple for you, simple for others and simple for SQL Server.

This isn’t a controversial topic. It’s widely accepted that simplicity is a virtue. That’s the message, I hope to give some methods and motivation.

*mwah*

Write Simple SQL For Yourself

Revisit old code from time to time and write code that you won’t mind revisiting.

All SQL Developers are translators. We translate English descriptions of what we want into SQL. Simpler queries mean less mental energy is required when “translating” these requirements into SQL. It also goes the other way. If I were write some SQL and then revisit it a year later, then I have to translate backwards in order to understand the query’s “intent”.

What? you never revisit old code? Whether it’s code or blog articles, when I look back at what I wrote two short years ago, I’m often not happy with what I read. I sometimes cringe a bit at what I’ve written and will often update old blog posts that need it. SQL is no exception. But reviewing old writing is a useful exercise. Joan Didion, a favorite author of mine, points out “We are well advised to keep on nodding terms with the people we used to be.”

Write Simple SQL For Others

Try organizing complicated queries using CTEs. It helps others understand them.

Simpler SQL doesn’t need to be shorter SQL. Jeremiah Peschka pointed me at a blog post by Selena Deckelmann How I Write Queries Using PLSQL: CTEs
Selena describes (better than I can) how useful CTEs can be when developing SQL. The CTEs provide a way to construct “a set of understandable ‘paragraphs’ of SQL” which can be used to “explain the logic of the query with others”.

Now that is how you do self-documenting code with SQL. When written with CTEs, your SQL will be much clearer than when you use subqueries. But don’t take my word for it. Check out a brilliant example in Brad Schulz’s “Playing Poker With SQL“. In the blog post he develops a single query which reports the results of a 10 hand poker game. Here’s an abbreviated version of his final query:

with DeckOfCards as
(. . .)
,ShuffleAndDeal as
(. . .)
,HandEvaluation1 as
(. . .)
. . .
,HandEvaluation7 as
(. . .)
select PlayerID
      ,Hand=Card1+' '+Card2+' '+Card3+' '+Card4+' '+Card5
      ,HandDescript
from HandEvaluation7
pivot (max(CardName) for CardSeqName in (Card1,Card2,Card3,Card4,Card5)) P
order by PlayerRanking
/*
PlayerID  Hand            HandDescript
--------  --------------- ---------------
       3  2♠ 7♠ 8♠ 9♠ 10♠ Flush
       1  5♥ 5♣ 5♠ 6♣ K♣  Three of a Kind
       5  2♥ 2♦ 6♥ K♠ K♦  Two Pair
       4  6♠ 10♦ Q♠ Q♥ K♥ Two of a Kind
       6  4♦ 7♣ 7♦ 8♥ J♦  Two of a Kind
       2  2♣ 3♣ 3♠ 9♥ J♣  Two of a Kind
       9  5♦ 9♦ J♥ Q♦ A♦  Nothing
       7  3♥ 4♣ 6♦ 10♥ A♠ Nothing
       8  3♦ 4♥ 8♣ 9♣ A♣  Nothing
      10  4♠ 8♦ 10♣ J♠ Q♣ Nothing
*/

In his post, Brad develops his query incrementally using CTEs. It accomplishes something very complicated, but it seems simple. I use it as a model for how to organize complex queries (when I must).

Write Simple SQL For SQL Server

Shorter SQL performs better. Consider breaking larger queries up into smaller chunks.

Why?

Fewer joins means the optimizer has fewer query plans to evaluate. And that means superior (or even optimal) plans.
Larger query trees can mean less effective cardinality estimates. With less effective estimates, inferior plans can be chosen resulting in poor performance behaviors like excessive CPU, excessive IO or tempdb spills.

A better and more in depth explanation by Paul White here in an answer to a SqlPerformance.com question.

Read his tips. They’re very well explained. After his explanation, he mentions a strategy to deal with complex queries. He writes “My usual approach is to break the query into manageable parts, storing reasonably-sized intermediate results in #temporary tables.” I want to show an example demonstrating how something like that could work.

I use a query I made up against Adventureworks2012 which, for blogging purposes, we call complicated:

USE Adventureworks2012
SELECT 
    p.BusinessEntityID
    ,p.FirstName
    ,p.LastName
    ,e.JobTitle  
    ,pp.PhoneNumber
    ,pnt.Name AS PhoneNumberType
    ,a.AddressLine1
    ,a.AddressLine2
    ,a.City
    ,sp.Name AS StateProvinceName 
    ,a.PostalCode
    ,cr.Name AS CountryRegionName 
    ,cust.Purchases
    ,sale.Sales as SalesCount
FROM Person.Person p
LEFT JOIN HumanResources.Employee e
    ON p.BusinessEntityID = e.BusinessEntityID 
LEFT JOIN Person.BusinessEntityAddress bea 
    JOIN Person.[Address] a 
        ON a.AddressID = bea.AddressID
    JOIN Person.StateProvince sp 
        ON sp.StateProvinceID = a.StateProvinceID
    JOIN Person.CountryRegion cr 
        ON cr.CountryRegionCode = sp.CountryRegionCode
    ON bea.BusinessEntityID = p.BusinessEntityID 
LEFT JOIN Person.PersonPhone pp
    JOIN Person.PhoneNumberType pnt
        ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID
    ON pp.BusinessEntityID = p.BusinessEntityID
LEFT JOIN 
    (
        SELECT COUNT(1), c.PersonID
        FROM Sales.SalesOrderHeader soh
        JOIN Sales.Customer c
            ON c.CustomerID = soh.CustomerID
        GROUP BY c.PersonID
    ) as cust(Purchases, PersonID)
    ON p.BusinessEntityID = cust.PersonID
LEFT JOIN 
    (
        SELECT COUNT(1), SalesPersonID
        FROM Sales.SalesOrderHeader
        GROUP BY SalesPersonID
    ) as sale(Sales, PersonId)
    ON p.BusinessEntityID = sale.PersonId
WHERE p.FirstName = 'Michael'

Most people’s intuition is that a single query is preferable. But just like Paul White, I have found that performance can sometimes be improved when the work is split into many queries. Here’s an example of what that might look like:

use AdventureWorks2012
 
CREATE TABLE #Results (
    BusinessEntityID int,
    FirstName nvarchar(50),
    LastName nvarchar(50),
    JobTitle nvarchar(50),
    PhoneNumber nvarchar(25),
    PhoneNumberType nvarchar(50),
    AddressLine1 nvarchar(60),
    AddressLine2 nvarchar(60),
    City nvarchar(30),
    StateProvinceName nvarchar(50),
    PostalCode nvarchar(15),
    CountryRegionName nvarchar(50),
    Purchases int,
    SalesCount int 
);
 
INSERT #Results (BusinessEntityID, FirstName, LastName)
SELECT BusinessEntityID, FirstName, LastName
FROM Person.Person
WHERE FirstName = 'Michael';
 
UPDATE #Results
SET JobTitle = e.JobTitle
FROM #Results r
JOIN HumanResources.Employee e
    on r.BusinessEntityID = e.BusinessEntityID;
 
UPDATE #Results
SET AddressLine1 = a.AddressLine1,
    AddressLine2 = a.AddressLine2,
    City = a.City,
    StateProvinceName = sp.Name,
    PostalCode = a.PostalCode,
    CountryRegionName = cr.Name
FROM #Results r
JOIN Person.BusinessEntityAddress bea 
    ON bea.BusinessEntityID = r.BusinessEntityID 
JOIN Person.[Address] a 
    ON a.AddressID = bea.AddressID
JOIN Person.StateProvince sp 
    ON sp.StateProvinceID = a.StateProvinceID
JOIN Person.CountryRegion cr 
    ON cr.CountryRegionCode = sp.CountryRegionCode;
 
UPDATE #Results
SET PhoneNumber = pp.PhoneNumber,
    PhoneNumberType = pnt.Name
FROM #Results r
JOIN Person.PersonPhone pp
    ON pp.BusinessEntityID = r.BusinessEntityID
JOIN Person.PhoneNumberType pnt
    ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID;
 
WITH cust (Purchases, PersonID) AS
(
    SELECT COUNT(1), c.PersonID
    FROM Sales.SalesOrderHeader soh
    JOIN Sales.Customer c
        ON c.CustomerID = soh.CustomerID
    GROUP BY c.PersonID
)
UPDATE #Results
SET Purchases=cust.Purchases
FROM #Results r
JOIN cust
    on cust.PersonID = r.BusinessEntityID;
 
WITH sale (SalesCount, PersonId) AS
(
    SELECT COUNT(1), soh.SalesPersonID
    FROM Sales.SalesOrderHeader soh
    GROUP BY soh.SalesPersonID
)
UPDATE #Results
SET SalesCount=sale.SalesCount
FROM #Results r
JOIN sale
    ON sale.PersonId = r.BusinessEntityID;
 
SELECT * FROM #Results;

use AdventureWorks2012 CREATE TABLE #Results ( BusinessEntityID int, FirstName nvarchar(50), LastName nvarchar(50), JobTitle nvarchar(50), PhoneNumber nvarchar(25), PhoneNumberType nvarchar(50), AddressLine1 nvarchar(60), AddressLine2 nvarchar(60), City nvarchar(30), StateProvinceName nvarchar(50), PostalCode nvarchar(15), CountryRegionName nvarchar(50), Purchases int, SalesCount int ); INSERT #Results (BusinessEntityID, FirstName, LastName) SELECT BusinessEntityID, FirstName, LastName FROM Person.Person WHERE FirstName = 'Michael'; UPDATE #Results SET JobTitle = e.JobTitle FROM #Results r JOIN HumanResources.Employee e on r.BusinessEntityID = e.BusinessEntityID; UPDATE #Results SET AddressLine1 = a.AddressLine1, AddressLine2 = a.AddressLine2, City = a.City, StateProvinceName = sp.Name, PostalCode = a.PostalCode, CountryRegionName = cr.Name FROM #Results r JOIN Person.BusinessEntityAddress bea ON bea.BusinessEntityID = r.BusinessEntityID JOIN Person.[Address] a ON a.AddressID = bea.AddressID JOIN Person.StateProvince sp ON sp.StateProvinceID = a.StateProvinceID JOIN Person.CountryRegion cr ON cr.CountryRegionCode = sp.CountryRegionCode; UPDATE #Results SET PhoneNumber = pp.PhoneNumber, PhoneNumberType = pnt.Name FROM #Results r JOIN Person.PersonPhone pp ON pp.BusinessEntityID = r.BusinessEntityID JOIN Person.PhoneNumberType pnt ON pp.PhoneNumberTypeID = pnt.PhoneNumberTypeID; WITH cust (Purchases, PersonID) AS ( SELECT COUNT(1), c.PersonID FROM Sales.SalesOrderHeader soh JOIN Sales.Customer c ON c.CustomerID = soh.CustomerID GROUP BY c.PersonID ) UPDATE #Results SET Purchases=cust.Purchases FROM #Results r JOIN cust on cust.PersonID = r.BusinessEntityID; WITH sale (SalesCount, PersonId) AS ( SELECT COUNT(1), soh.SalesPersonID FROM Sales.SalesOrderHeader soh GROUP BY soh.SalesPersonID ) UPDATE #Results SET SalesCount=sale.SalesCount FROM #Results r JOIN sale ON sale.PersonId = r.BusinessEntityID; SELECT * FROM #Results;

When is this technique appropriate?

I like to use this performance technique before I consider query hints, (but after other simpler improvements like indexing). Even so, this technique is not always appropriate. I’ve seen it work best on complicated queries (How do you know when they’re complicated?). And I’ve seen this work best against large datasets (processing millions of rows for example). Complicated queries have a higher risk of generating poor query plans. But breaking these huge queries into smaller parts addresses this problem.

In my example, I’ve split the original query into seven. That was just for illustration. Maybe better is splitting your monster query into only three queries. Always test.

Empirical evidence tells me that simpler SQL performs better. I’ve split up complicated queries and often they become much easier to maintain but almost as often I’ve never needed to!

Really?

I’ve added a few comments below that qualify some of the things I’ve said here.

-- Comments (9)

9 Comments »

I had a couple conversations with work colleagues who questioned me about this article. The first conversation is about a pitfall of CTEs that I didn’t mention.

Can CTEs sometimes mask complexity?
It involves the problem of using a subquery twice. When a CTE is written once, but referenced twice, it’s not always clear that it’s being evaluated twice. The same query written using derived tables will be written with two identical subqueries. This makes it obvious that the subquery is being evaluated twice.

I think I still recommend using CTEs. I acknowledge there’s a small chance that CTE usage masks complexity – complexity which would be obvious otherwise. But I’ll take that risk if it means easier-to-understand SQL. I’m on board with Selena Deckelman’s approach of “designing with CTEs and optimizing for performance only if needed.”

Comment by Michael J. Swart — January 9, 2014 @ 10:53 am
The other conversation I had was surrounding the usage of temp tables.

Why are you advocating #temp tables, I thought you avoid those?
I advocate breaking up large complex queries into smaller chunks using #temporary tables. But what about our recent tempdb woes as described in recent blog posts:
I seem to be saying two things: In those articles I’m advocating reduced usage of things that hit tempdb. But in this article, I’m unashamedly pushing #temp tables again.

The difference is in how these queries are being used. Pick your favorite query to tune and try to answer these questions:
1. Are you running this query hundreds of times per second? (as opposed to daily)
2. Are you hoping to tune this query from 5 seconds down to 5 milliseconds? (as opposed to 5 minutes to 5 seconds)
3. Does this query get triggered with frequent end-user activity? (As opposed to infrequent business analyst activity?)
4. Does your query plan fit on one screen?
If you answered mostly “yes”, then it seems you’re tuning a query in an OLTP system. And I would not recommend using temporary tables in this case. You probably won’t be able to reach acceptable query durations using this technique any way. For OLTP systems, avoid tempdb.

But for infrequent reporting use cases, this technique may help you. Remember, most tempdb problems are commonly caused by frequent tempdb queries.

Comment by Michael J. Swart — January 9, 2014 @ 11:10 am
[…] SQL Simplicity Methods – Michael J. Swart (Blog|Twitter) on why it’s important to try to lead a simple life as a Data Professional. […]

Pingback by (SFTW) SQL Server Links 10/01/14 • John Sansom — January 10, 2014 @ 5:22 am
” When a CTE is written once, but referenced twice, it’s not always clear that it’s being evaluated twice. The same query written using derived tables will be written with two identical subqueries. This makes it obvious that the subquery is being evaluated twice.”
That’s an interesting perspective, and not one I really subscribe to. If one was to write the same subquery twice then that violates the principle of DRY (http://en.wikipedia.org/wiki/Don%27t_repeat_yourself), and I consider DRY to be *the* most important guiding principle when I write any code (not just T-SQ code).
I love me some good encapsulation 🙂

Comment by Jamie Thomson — January 10, 2014 @ 2:49 pm
You bet Jamie, But I think my colleague was thinking that when one writes a subquery twice, it acts as a visual prompt so that the developer considers rewriting the whole thing to avoid the duplication. The argument is that a developer might not see or recognize the duplicated work when a CTE is used.

I like the DRY in principle in theory, but in practice I’ve seen it betray us quite badly. In order to encourage code re-use, I’ve seen views (and views within views) that sweep complexity under the rug. One such query had a dozen or so nested views, but no single sql statement ever indicated what a monster this query actually was (Of course the query plan doesn’t lie).

I happen to have Martin Fowler’s Refactoring on my desk. He has Duplicated Code as his #1 code smell, but it is still just one of 22.

I’m just saying that I can understand and sympathize with both points of view, but I’ll re-iterate that “I think I still recommend using CTEs”

Comment by Michael J. Swart — January 10, 2014 @ 4:57 pm
[…] J. Swart’s SQL Simplicity Methods recommends that SQL is kept simple and provides some advice on how that can be […]

Pingback by My links of the week – December 12, 2014 | R4 — January 12, 2014 @ 2:28 am
Use of temp tables is not really a bad idea in OLTP systems. We use lots of temp tables within stored procedures to breakup the large queries into several small queries. The main advantage is that if you use a large query with five, six JOINs to live tables, it places the locks on those live tables until the statement completes. Since the query is large it may take more time to execute thus not leasing the locks. This creates blockings in busy OLTP systems. The recommendation would be to use temp tables and then use those temp tables in the large queries so that blockings will be minimized.
It is clear that use of temp tables will increase the use of tempdb and it might creates a bottleneck. To avoid that you need to optimize your tempdb. Follow this link for more details (http://technet.microsoft.com/en-us/library/ms175527(v=sql.105).aspx)

Comment by Susantha B — January 14, 2014 @ 1:11 am
Hi Susantha,

I understand where you’re coming from.
The tempdb bottleneck I fear most is PAGELATCH contention. But the link you provided mostly deals with tempdb capacity planning and placement. Microsoft does have a tempdb concurrency article here (http://support.microsoft.com/kb/328551). I’ve found a lot of their advice helps, but as a last resort they suggest “making changes to the workload/code”. And that’s why I avoid using temp tables frequently.

You talk about the trouble you have with regular concurrency (locking and excessive blocking) that is alleviated with the use of temp tables. In my experience, I have often been able to tackle such queries on a case by case basis, mostly by tuning (via rewrites, indexing, application-caching etc…).

Basically when presented with such scenarios. I like to highlight the contrast between a “query that is large” and a “busy OLTP system”. Those don’t usually go hand in hand.

Comment by Michael J. Swart — January 14, 2014 @ 11:03 am
Hi Michael,
Thanks for the feedback.
I have experienced the LATCH contention on tempdb due to frequently creating and dropping temp tables in stored procedures which has very high executions (in my case more than 3000 calls per sec). I agree with you in this type of cases you may avoid temp tables or consider application re-write to reduce the no. of sp calls. There are some other solutions suggested for this type of contentions using TF.(same link you mentioned)
Finally, some times you may have stored procedures which has complex business logic which often contains large queries with many joins. These can be found in busy OLTP systems and it is not uncommon.

Comment by Susantha B — January 15, 2014 @ 12:37 pm