A Trick for Aggregating Data Fast

October 14, 2009

A Trick for Aggregating Data Fast

Filed under: Technical Articles — Tags: "sql server 2005", "sql server 2008", "sql server", aggregate — Michael J. Swart @ 5:49 pm

I want to explain a trick I came up with that helps an OLTP database do part of the job that OLAP cubes do. (Isn’t that cute, the database engine thinks it’s a cube.*)

If for some reason you find yourself wanting to use an OLTP database (i.e. regular old database) for something that is better suited to an OLAP cube (like Analysis Services for example) then you’re in good company! It seems to be a very common challenge. Even Microsoft seems to recognize this by implementing T-SQL tricks that step over the line into OLAP territory. Grouping sets comes to mind; Also another sign is that CUBE is now a keyword.

Aggregating Data Along a Dimension Before it’s Needed.
One task that is deep in the heart of OLAP territory is reporting the aggregate of a measurement along a dimension. I want to explain how to do this efficiently inside an OLTP db.

In the following examples the OLTP database (and wannabe OLAP cube) will be the standard AdventureWorks database. I’ll be using the Sales.SalesOrderDetail table. The measurement here is the LineTotal field, and the dimension here is the Product.

So we can report on this kind of data easily by maintaining statistics using an indexed view. (BTW, this is a great use-case for indexed views).

Here’s the trick. If n is the measurement, then the only stats we need to store are the sums of n⁰, n¹, n², grouped by the dimension. In other words, we keep track of the count, the sum and the sum of squares.

CREATE VIEW Sales.v_SalesOrderDetailAggregates WITH SCHEMABINDING
AS
SELECT ProductID,
   SUM(LineTotal) AS Sum1,
   COUNT_BIG(*) AS Count1,
   SUM(LineTotal*LineTotal) AS Sum2
FROM Sales.SalesOrderDetail
GROUP BY ProductID
GO
 
CREATE UNIQUE CLUSTERED INDEX IX_v_SalesOrderDetailAggregates
ON Sales.v_SalesOrderDetailAggregates(ProductID)
GO

Using the Indexed View

So that’s the clustered index I propose we create. As I’ll show later, this is enough. The work of aggregating the data is offloaded to the db. And the work is done once at creation and maintained automatically during any relevant transactions.

In the following code sample, I show how to use this view to report on many common stats (aggregates):

SET STATISTICS IO ON
 
-- a report that aggregates LineTotal over Products
SELECT
   ProductId,
   COUNT(LineTotal),
   SUM(LineTotal),
   AVG(LineTotal),
   STDEVP(LineTotal),
   STDEV(LineTotal),
   VARP(LineTotal),
   VAR(LineTotal)
FROM Sales.SalesOrderDetail
GROUP BY ProductID
ORDER BY ProductID
 
-- the same report (aggregate LineTotal over Products)
-- that uses the indexed view.
SELECT
   ProductID,
   [COUNT] = Count1,
   [SUM] = Sum1,
   [AVG] = Sum1 / Count1,
   [STDEVP] = 1.0/Count1 * SQRT(Count1*Sum2 - Sum1*Sum1),
   [STDEV] = 1.0/SQRT(Count1*(Count1-1)) * SQRT(Count1*Sum2 - Sum1*Sum1),
   [VARP] = (Sum2 - (Sum1*Sum1/Count1))/(Count1),
   [VAR] = (Sum2 - (Sum1*Sum1/Count1))/(Count1-1)
FROM Sales.v_SalesOrderDetailAggregates
WHERE Count1 > 1
ORDER BY ProductID
 
/*
results:
 
(266 row(s) affected)
Table 'SalesOrderDetail'. Scan count 1, logical reads 485, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
(266 row(s) affected)
Table 'v_SalesOrderDetailAggregates'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
*/

So why not expand the indexed view to include everything from the beginning? Well, there are a couple problems with that. First is that you’ll get an error message like:

Msg 10125, Level 16, State 1, Line 2
Cannot create index on view "AdventureWorks.Sales.v_SalesOrderDetailAggregates" because it uses
aggregate "STDEV". Consider eliminating the aggregate, not indexing the view, or using alternate
aggregates. For example, for AVG substitute SUM and COUNT_BIG, or for COUNT, substitute
COUNT_BIG.

And so what I’ve shown here is that you can use a calculation involving COUNT_BIG, SUM(n) and SUM(n*n) to calculate common statistical aggregates like AVG, STDEV and VAR.

Besides, this strategy also takes the least space, and for a little extra work, you can aggregate on a grouping higher up in the dimension. That is, by using the entries in the index as subtotals in order to generate totals for larger groupings.

*– I’m probably missing an opportunity to include an awesome illustration for this post; something involving databases, cubes and square and round pegs.

-- Comments (5)

5 Comments »

By the way, if you try the reports above, you'll likely get larger reads for the normal report. I wanted to give the normal report as much of an advantage as possible so I created a non-clustered covering index on
Sales.SalesOrderDetail(ProductId) INCLUDE (LineItem)

Comment by Michael J. Swart — October 15, 2009 @ 4:47 am
Excellent… cool stuff!

Keep up the great work.

How about this for an image (no cubes or round pegs, but perhaps appropriate): http://www.hgs.k12.va.us/Bruce_Norton_folder/statistics_made_easy.gif

Comment by Brad Schulz — October 15, 2009 @ 9:59 am
HA HA HA! Very good. I like it Brad, There's always next time.

Comment by Michael J. Swart — October 15, 2009 @ 10:18 am
[…] A Trick for Aggregating Data Fast […]

Pingback by The Aggregate Function PRODUCT() | Michael J. Swart — March 8, 2011 @ 9:37 pm
[…] views support a couple aggregate functions like COUNT_BIG() and SUM(). And with some trickery you can calculate AVG() and STDEV(). But SQL Server restricts the use of MAX() and MIN() in indexed […]

Pingback by Well That Wasn't Obvious | Michael J. Swart — December 5, 2012 @ 12:01 pm

RSS feed for comments on this post. TrackBack URL

Michael J. Swart

October 14, 2009

A Trick for Aggregating Data Fast

5 Comments »

Leave a comment