Michael J. Swart

February 9, 2012

Microsoft Model Databases: Some Unconventional Alternatives?

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:53 pm

It seems like Microsoft has always provided example databases in order to illustrate database concepts.

Pubs: Back when SQL Server 2000 was new, Microsoft provided this sample database and maybe you remember it. Pubs was a really simple database that was designed to represent data that a fictional book publishing company might use. The database had a name (pubs) but I don’t think anyone ever gave the company a name. From here we’re told that it was “used to demonstrate many of the options available” for SQL Server.

Northwind also was available for SQL Server 2000. It was a database that was used for a fictitious company “Northwind Traders”. The company “imports and exports specialty foods from around the world.” It was better than pubs, but that wasn’t saying much. Pubs set a pretty low bar.

AdventureWorks: The Adventureworks database came along with SQL Server 2005 and we were introduced to the bicycle manufacturer Adventure Works Cycle. The database was pretty comprehensive and it touched on a lot more features of SQL Server. I’ve seen countless performance demos that query Sales.SalesOrderDetail. There’s even an Adventureworks Data Warehouse, an Adventureworks OLAP cube and even a light version: AdventureworksLT.

Contoso And lately there’s a new sample database Contoso. It’s another fictional retail company (Contoso Ltd. This time they’re selling electronics and appliances). This database is used for Business Intelligence demos. With Contoso, Microsoft has developed and made a data warehouse and an SSAS cube available.

Presumably, each of these databases is an example of Microsoft putting their best foot forward. Or at least it’s a database which enables them to show off what their database software can do.

Beyond Demos

So those databases are good for demos.

But if we’re looking for other Microsoft-designed databases, we don’t have to stop there. There’s other ones that we can look at. I’m talking about the databases that Microsoft ships along with their products. When looking at these, remember:

  • There are some good examples, …
    We can look at those databases and see how they’re constructed. If you see some interesting db designs, you can use that.
  • … and there are some not so good examples …
    These databases weren’t built as database role models. So there’s bound to be a lot of database design choices that are not ideal. I know that Microsoft has huge pressure to maintain backwards compatibility.
  • … and that’s okay.
    It’s actually a fun exercise to pick out the good designs from the bad. It’s like a smorgasbord, you take the lessons you like and leave the rest.

Some examples of what I’m talking about.

ReportServer

Install and configure an instance of SQL Server Reporting Services. Take a look at the ReportServer. It’s a pretty nice database schema. One thing that was interesting was the choice of identity column data types:

  • Catalog seems to be one of the main tables here and they’re using uniqueidentifiers (GUIDs) for identity columns. Normally people scoff at that, but I think it makes sense in this case. Catalog is the table that tracks reports, datasources and folder structures. Even at it’s largest, it can’t be that big (I just checked my own installation and the clustered index fits on 1 extent). That means that all the worries about GUIDs for primary keys are mitigated and the choice of using a globally unique identifier might make sense here.
  • ExecutionLogStorage Contrast that with ReportServer’s largest table, ExecutionLogStorage. Depending on how busy your Reporting Services installation is, this table can grow very large with one row for every report view. When I look at the choice for their primary key, I see that it’s clustered index on a single BIGINT column defined as it’s IDENTITY column. Well done.

Any Database’s Metadata

  • I like the object oriented influence here:
    • The relationship between sys.objects –> sys.tables and between sys.objects –> sys.procedures is the subtype relationship class –> subclass. And that shows in the schema, for example, there’s one sys.object row for every sys.table row.
    • The relationship between sys.foreign_keys –> and sys.foreign_key_columns is the object composition relationship. And that also shows in the schema too. The “primary key” for the table foreign_key_columns is (constraint_object_id, constraint_column_id) and it contains the primary key of its owner.
  • Another good example is the sys.index_columns table. Look at the schema diagram below. It’s a perfect instance of my favourite DB Design example

Any Sharepoint database

  • [Sigh] These are mostly counter-examples here, I took a brief glance at the schema once and decided quickly that there are no good lessons. Can anyone tell me if that’s changed?

February 2, 2012

Removing Duplicate Dimension Rows in SSIS

Something New For Me


I’m relatively new to SQL Server Integration Services (SSIS). And so I want to explain how I solved a problem and invite Business Intelligence (BI) experts (and wannabe experts) to comment or point out things that I missed.

A Short Description of the Problem

So here’s a short description of the problem I’m trying to solve. It involves populating a dimension table. I want a data flow transformation which removes duplicate key values. In other words, every time I come across a new key value for a dimension, I want to pass through that entire row (not just the key values). This is kind of like a cross between T-SQL’s DISTINCT and FIRST_VALUE.

I found out recently that people have treated this topic before:

But because of my bad luck, I only found these posts after I muddled through a solution. But I was surprised to find out my solution had some interesting differences.

The Setup

First I’ll explain my situation. Say I’ve got a flat file containing data about Prime Ministers:

The source data

My main goal is to use this data to populate a data warehouse. That data warehouse has a schema like this:

The Data Warehouse Schema

This example is just for illustration. In real life, the number of fact rows is very large, and the number of dimension rows is relatively small. The time dimension will be pre-populated, but I want to load the other dimensions using values from the flat file. So that means I plan to process the data twice: The first time will be to load these dimensions, and the second time will be to load the fact table. When I process the data for dimensions, I :

  • Use a lookup transform to find whether that dimension already exists in the warehouse
  • Use the No Match Output to identify dimension rows I haven’t seen before. Those rows will be used to load the dimension. (i.e. I’m treating the dimension like a Never-Changing-Dimension rather than a Slowly Changing Dimension.)

The problem is that these unmatched rows can be duplicated and I don’t want those duplicate dimensions rows getting into the dimension table.

Things I considered:

  • I considered using a staging table on the DB side. But that required too much space, the amount of space needed was comparable to the fact table, not the dimension table.
  • I also considered inserting into dimensions a row-at-a-time using the MERGE statement on the database side. But I didn’t consider that for long. The performance with that strategy was so unacceptable.
  • I could have used a sort/aggregate, but that was a blocking transformation and I really didn’t need the data sorted.

What I ended up with

I created a script transform which seems to do the trick for what I want. My Data Flow task looks like this:

In those script transforms you see up there, I told SSIS that each script will be used to filter rows out or let them pass through. I did that by setting ExclusionGroup to 1 (by default, it’s normally 0). Here’s where that’s done:

Then in the script, I created a list of seen dimensions. And I let rows pass through if they haven’t yet been seen. The script simply looks like this:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
 
    if (Row.Party_IsNull)
        return;
 
    if (!seenParty.Contains(Row.Party))
    {
        seenParty.Add(Row.Party);
        Row.DirectRowToOutput0();
    }
 
}

Logically it behaves exactly the same as the sort transformation (as explained in “Dead Easy Duplicate Key Removal” by Todd McDermid). Except for these things:

  • It seems to perform fast
  • It’s not a blocking transformation
  • But you don’t get all that for nothing. What we’re not getting with this method is that the rows aren’t sorted. In most cases, that doesn’t matter.
  • It takes some memory, so you have to be aware of that. Ask yourself, how well can all unseen dimension keys fit in memory?

Some questions for you BI guys

And here’s where I’d like you to chime in.

  1. What’s the most common approach to this problem?
  2. I get the feeling that I could use the Cache Transform for this technique. But I haven’t tried it.
  3. Are there any new features in SQL 2012 that help?
  4. Am I re-inventing the wheel here?
  5. Am I missing anything else that should be obvious?

January 25, 2012

Rebuild Your Indexes Online (When You Can)

So I was recently reading SQL Server’s blog, specifically the article Customers using SQL Server 2012 today! and some brave businesses are already using and enjoying some of SQL Server 2012 features that help their databases stay available. I’m excited about the new features too. For example, index rebuilds have been improved. We can now rebuild indexes online that include “blob” columns (like nvarchar(max), image, etc…). This means that (almost) every index can be rebuilt without requiring a table lock for the entire operation. And that’s good news for availability!

This wasn’t the case in earlier versions. In earlier versions, you couldn’t rebuild an index online if it included columns that were large strings. If you tried, you would get this message:

/*
An online operation cannot be performed for index 'ix_t1_1' because the 
index contains column '<column name>' of data type text, ntext, image, varchar(max), 
nvarchar(max), varbinary(max), xml, or large CLR type. For a non-clustered 
index, the column could be an include column of the index. For a clustered 
index, the column could be any column of the table. If DROP_EXISTING is 
used, the column could be part of a new or old index. The operation must be 
performed offline. 
*/

Logic to determine when ONLINE=ON is supported

So that means there’s one more thing to check when finding out whether you can rebuild an index on-line. The logic now goes like this. You can rebuild an index online if:

  • You’re using Enterprise Edition or higher
  • And you’re using SQL Server 2012 or later
  • Or you’re using SQL Server 2008 or earlier
  • And your index is clustered and the table contains no blob columns
  • Or your index is non-clustered and the index includes no blob columns

Hmm… You can’t really put parentheses in a bullet point list. Here try this flow chart:

Clear as mud

Hmm… that really doesn’t clear things up too well either. And I haven’t even mentioned partitioned indexes.

Just Try It

Maybe you’re like me, you may have to deal with multiple versions, multiple editions, and multiple tables and their indexes. Instead of wading through that logic above, just TRY it and let SQL Server figure it out. Here’s a sproc which takes a table name and index name and tries to rebuild the index online. If it can’t, it builds it offline.

create procedure s_TryRebuildOnlineOtherwiseOffline
(
	@schema sysname = 'dbo',
	@tablename sysname,
	@indexname sysname
)
as
begin
	set @schema = QUOTENAME(@schema);
	set @tablename = QUOTENAME(@tablename);
	set @indexname = QUOTENAME(@indexname);
 
	declare @sqlRebuild nvarchar(max)
	set @sqlRebuild = N'ALTER INDEX ' + @indexname + ' ON ' + @schema + '.' + @tablename + ' REBUILD';
	declare @sqlRebuildOnline nvarchar(max)
	set @sqlRebuildOnline = @sqlRebuild + ' WITH (ONLINE=ON)';
 
	begin try
		EXEC sp_executesql @sqlRebuildOnline;
		print @sqlRebuildOnline;
	end try
	begin catch
		EXEC sp_executesql @sqlRebuild;
		print @sqlRebuild;
	end catch
end
go

Alternatively, you can adjust the script to REORGANIZE an index when it can’t rebuild it online.

By the way, if you use Ola Hallengren’s maintenance scripts, he’s already taken all this logic into account! If you write your own maintenance scripts, feel free to incorporate what I have here.

January 13, 2012

Thank you notes…

Filed under: SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

(It’s been a hectic week. The good news is that I’ve got loads of topics to write about. The bad news is that I’ve got no time to do it! Illustrations will make a return next week too because I’m finally getting that Adobe Illustrator license I’ve had my eye on. Maybe I’ll also include a retroactive Jimmy Fallon illustration).

Aaron Bertrand maintains the definitive list of free SQL Server tools. I want to give a shout out to just a couple of them. Without these tools, this week would have been a lot rougher at work for our team.

sp_whoisactive

Thank you Adam Machanic,
For making a lightweight sproc to help me see at a glance the activity on a SQL Server database. (Also kudos to the parameters @get_plans and @find_block_leaders).

If you’re unfamiliar with this stored procedure, start here.

Plan Explorer

Thank you SQL Sentry,
For making a tool that helps me understand everything about a query plan is doing. (Also kudos to the Expressions tab and the Parameters tab).

Plan Explorer presents SQL Server query plans just a bit nicer than SQL Server Management Studio does. When looking at a problematic query plan, I want to go from looking at a plan to a solution as quick as possible. For me this week, it would be safe to say that Plan Explorer cut that analysis time in half.

If you’re unfamiliar with this tool, start here.

Dynamic Management Views

Thank you Microsoft (SQL-Server-2005-and-later-DMVs),
For tracking statistics about query executions and making it possible to examine the history of the db activity without having to trace server activity (so extra kudos to dm_exec_query_stats, dm_exec_cached_plans and dm_exec_query_plan).

I don’t know whether to call these views free. They’re available with any database at SQL Server 2005 (or later). So if you’re using SQL Server, then you can use these DMVs.

If you’re unfamiliar with these views, start here.

January 4, 2012

Then, Now and Later

Filed under: SQLServerPedia Syndication — Tags: , — Michael J. Swart @ 12:00 pm

So happy New Year! This is the week that a lot of people look back on their year and make plans for their new year. Well, I’m no different. This week I’m taking a break from the technical content to take a look at 2011 and what 2012 might bring.

The Past (Looking back at 2011)

I started blogging a couple years ago. I picked the domain MichaelJSwart.com (with the middle initial) because MichaelSwart.com was taken at the time. I liked the sound of it any way, probably because of Michael J. Fox. My favorite character of his was Marty McFly:

And a year ago, (prompted by Jenn McCown’s t-sql tuesday post about resolutions) I made a new year’s resolution to blog weekly, keep the articles technical and to include illustrations.

And I had an absolute blast doing it! So it was easy to keep it up:

  • The topics came from whatever interested me. Luckily I have an interesting job with new and different challenges almost daily.
  • The time came as by treating the blog as a hobby. I figure the average person has that much free time, it’s just a matter of how to spend it.
  • I like to write when I’m feeling particularly awake.
  • I like to draw when I’m not because it’s more fun and it probably uses a different part of the brain so that it feels like less mental effort.

I particpated in a couple SQL Saturdays. The first one in Cleveland was great. Among other things, I learned that if you visit Cleveland’s art museum at 10 am on  Superbowl Sunday, you have the entire museum to yourself. The second SQL Saturday I attended this year was in my own backyard. I presented for the first time in Toronto and it was exciting.

The Present (this week)

So I was excited to learn this week that I received Microsoft’s Most Valuable Professional (MVP) award for SQL Server (achievement unlocked!). I feel honoured to receive the award and inspired to continue writing. I feel like I’ve learned and benefited from the SQL crowd more than I provided (expertise consumer first, expertise provider second). So many people have helped me but I’m not going to name names. That’s next week – seriously :-) .

The Future (2012)

So I’m going to resolve to continue blogging the way I have for the past year. It’s a challenging (if not a surprising) resolution. It’s challenging even though this resolution boils down to resolving to continue a fun hobby. This starts now but I’m going to hold off on the technical stuff for just one more week: I want to write something about people who inspire me (and why they might inspire you too). Stay tuned.

And also this year, I’m going to try something new. I want to try to participate in a local user group. Easier said than done because for me, there is no local user group and I’m going to do something about that. (Ohioans, I love you but the commute is killer :-) )

It’s all very exciting. Look out 2012, you won’t know what hit you.

December 22, 2011

A simple message: Merry Christmas

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Tongue In Cheek — Michael J. Swart @ 10:00 am

From my family to yours, Merry Christmas.

And just to throw in some SQL, here is a query that will return the complete lyrics to Feliz Navidad, (the most repetitive Christmas song in the world):

select [processing-instruction(complete_lyrics)] = REPLICATE(REPLICATE(REPLICATE('
Feliz Navidad, ', 3) + 'próspero año y felicidad', 2) + REPLICATE(REPLICATE('
I wanna wish you a Merry Christmas, ', 3) + 'from the bottom of my heart', 2),3)
for xml path(''),type

December 8, 2011

CXPACKET? What’s That? And What’s Next?

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , — Michael J. Swart @ 8:54 pm

Takeaway: Is the wait type CXPACKET bothering you? It means parallelism. Here’s how to find the queries that are giving you this grief.

The Wait Type CXPACKET

Since 2005, SQL Server provides dynamic management views (DMVs). These views help monitor database health and they’re the first place I go to when troubleshoot problems. For example the view sys.dm_os_wait_stats gives an aggregated way to tell you where the waits are. Often you’ll come across the wait type CXPACKET. Microsoft explains:

“[This wait type] Occurs when trying to synchronize the query processor exchange iterator.”

Hmmm… I know what those words mean individually… Any way, if you keep searching, you’ll see others explain: “CXPACKET means parallelism”. We’re getting a bit closer now. If a query is run with multiple threads, and one thread completes before the others, then SQL Server will report the wait type CXPACKET for that thread.

It’s like several chefs are making soup. If one chef’s only job is to add the cilantro, there’s going to be some waiting involved. So is the extra help worth the trouble?

SQL Server thought it could use more cooks for a particular “recipe”… Maybe it’s time to make simpler recipes. (The answer is almost always yes, especially for OLTP systems). SQL Server doesn’t use parallelism often. It only comes up with a multi-threaded query plan when it needs to, i.e. when the plan is estimated to be costly enough. It’s usually those queries with the high estimated cost that need to be fixed, the CXPACKET wait type is a symptom of that.

Those High Cost Queries

So which queries are those? Again, I like to go to the DMVs. The one I like to look at is sys.dm_exec_query_stats except that it doesn’t have a flag called uses_parallelism. For the definitive way to find parallel queries, you have to scan and parse all the query plans that are cached using sys.dm_exec_query_plan.

But that takes for ever so I don’t recommend it. But recently I found out that we don’t really need it: Parallel queries are interesting, they’re one of the only kind of queries that can use more CPU time than actual elapsed time. So if that’s true, then we just need to examine these queries:

select * from sys.dm_exec_query_stats 
where total_elapsed_time < total_worker_time
	-- total_worker_time is measured to the microsecond, but is accurate to the millisecond so:
	and ( ( total_worker_time – total_elapsed_time ) / execution_count ) * 1.0 > 1000 -- avg difference is at least 1 ms

But it might not be true, for example, a parallel query might be suffer from blocking too long. Long enough that the elapsed time might be longer than the total amount of CPU time. That’s an interesting question: How good an indication of parallelism is the ratio between elapsed time and CPU time?

(total_elapsed_time < total_cpu_time) Vs. uses_parallelism

I’m lucky. I’ve got a large set of query data that can help me.

This chart needs a little explanation. Pick a dot in the chart above. It represents a set of queries whose ratio of cpu to elapsed time are the same. Look at the value on the x-axis to see what that common ratio is. Say that it’s 1.5, this means that that dot represents a set of queries whose elapsed time is exactly 50% greater than its cpu time. Now look at its value on the Y axis. You might see 2%. That’s means that 2% of those queries use parallelism.

I expected a slope, not a waterfall and this is what that means to me. Knowing that a query’s worker time is greater than its elapsed time is a great indication of parallelism. And knowing by how much doesn’t change that.

CXPACKET? Here’s What’s Next

This query lists the top 20 worst parallel queries cached in your database.

-- TOP 20 parallel queries (by CPU)
SELECT TOP (20)
    [Total CPU] = total_worker_time,
	[Total Elapsed Time] = total_elapsed_time,
	[Execution Count] = execution_count,
    [Average CPU in microseconds] = cast(total_worker_time / (execution_count + 0.0) as money),
    [DB Name] = DB_NAME(ST.dbid),
    [Object Name] = OBJECT_NAME(ST.objectid, ST.dbid),
    [Query Text] = (SELECT [processing-instruction(q)] = CASE 
            WHEN [sql_handle] IS NULL THEN ' '
            ELSE (SUBSTRING(ST.TEXT,(QS.statement_start_offset + 2) / 2,
                (CASE 
                        WHEN QS.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX),ST.text)) * 2
                        ELSE QS.statement_end_offset
                        END - QS.statement_start_offset) / 2))
            END
			FOR XML PATH(''), type),
    [Query Plan] = qp.query_plan
FROM sys.dm_exec_query_stats QS
CROSS APPLY sys.dm_exec_sql_text([sql_handle]) ST
CROSS APPLY sys.dm_exec_query_plan ([plan_handle]) QP
WHERE total_elapsed_time < total_worker_time
	AND ( ( total_worker_time - total_elapsed_time ) / execution_count ) * 1.0 > 1000 -- average difference is more than a millisecond
ORDER BY total_worker_time DESC

Some interesting things about this query:

  • I’ve written it for SQL Server Management Studio. In SSMS, you can click on the query and the query plan to explore these queries further.
  • I’ve added a filter to weed out queries whose average cpu time is more than a millisecond. I’m sure you won’t miss those stats. Microsoft reports CPU time in microseconds, but it’s only accurate to milliseconds. So you can get false-positives if you don’t ignore the blazing fast queries. That’s why I added that filter.

November 28, 2011

SQL Fingertips

Filed under: SQLServerPedia Syndication,Technical Articles,Tongue In Cheek — Tags: , , — Michael J. Swart @ 11:36 pm

They Might Be Giants (TMBG) released an album called Apollo 18. It has 38 tracks and it runs for 42 minutes. It’s an album that will always remind me of my friend and first University roommate (Hey, what’s up Chris!). So the average track is about a minute long and this is due to a set of 21 tracks collectively called “fingertips”. Most of these fingertips are five to fifteen seconds long. Some are a bit longer. These fingertips are probably not good enough to be put on an album on their own. But collectively, they make a really cool album (especially on shuffle mode).

The List

So when I found myself with a collection of tiny tips that might not qualify as a blog post on their own, I thought of that album and a microsecond later, we’ve got SQL Fingertips!

So here they are, in no particular order.

  • Stackoverflow tip: Use the bounty feature without being afraid of the hit to your reputation. A bounty offered on a decent question usually attracts enough attention to get your question up-votes. And unless it’s a crummy question, those up-votes will earn you back the reputation points you spent.
  • T-SQL tip: You can use the following script as a template to be able to run a single query on multiple databases and have it return a single result set:
    DECLARE @temp TABLE (
    	ServerName sysname,
    	DBName sysname,
    	ProductCount int
    );
     
    insert @temp (ServerName, DBName, ProductCount)
    exec master..sp_MSforeachdb N'use [?];
    	if exists (select 1 from sys.tables where name = ''Product'' and SCHEMA_NAME(schema_id) = ''Production'')
    		exec sp_executesql N''
    			select @@servername, db_name(), Count(1) as [count]
    			from Production.Product''';
     
    select * from @temp;
  • SQL Server Management Studio tip: GO can take a parameter. If you write GO 100, your batch will be executed 100 times. Many people know this but I always here a couple “wow”s each time I demonstrate it. And here’s a couple more notes about that batch separator.
    • Don’t forget to SET ROWCOUNT OFF.
    • The batch separator doesn’t have to be GO. It’s configurable! I used to set it to ENGAGE for a while, but the joke got old and I switched it back.
  • Blog tip: If you like following blogs or people, but maybe you’re falling out of love with your RSS reader. No worries. You can get any RSS feed at all sent straight to your inbox.  There are lots of RSS-to-inbox websites out there. The one I recommend is FeedMyInbox.com. So, for example, to get my own blog delivered to your inbox, start here.
  • Data → Eye → Brain, A Data Visualization tip: Data visualizations (charts, diagrams, etc..) help us gain knowledge and understand data that maybe we couldn’t understand otherwise (See Anscombe’s Quartet). And for large and complex datasets, the best visualizations are usually custom designed and interactive. There’s a lot of power in customized visualizations: See what David McCandless and Miriah Meyers have to say.
  • Twitter DBA Tip #1: I put a call on twitter asking people to tweet the one database tip they thought more people should know. The first one comes from @BrentO: “Don’t leave your database in full recovery mode if you’re not doing t-log backups.” He’s right. Full recovery mode usually implies a backup strategy that includes transaction log backups. (And full or differential backups are not enough either).
  • Twitter DBA Tip #2: This one’s from @wnylibrarian“If using SQLExpress & no SQLAgent, try using SQLCMD in a batchfile.” I’m glad Gary tweeted this one. I’ve come up against this problem in my own work. It turns out that when downgrading editions of SQL Server, often one of the first things that you miss is SQL Agent. If you follow Gary’s advice, it’s not a hurdle, it’s only an inconvenience.
  • Twitter DBA Tip #3: Next I heard from @DaveH0ward who wants to remind people that “You need to size your data /log files and set autogrowth amounts appropriately.” I think a lot of people don’t know that the default auto-growth settings for SQL Server databases are a bad idea. Tim Ford warns us of dire consequences for not following this tip.
  • Twitter DBA Tip #4: Finally I heard from @SQLSandwhiches (aka Adam Mikolaj). He shared a pet peeve “I’ve seen so many servers that don’t have logs to auto cycle. Don’t you love opening a log file with 300,000 entries?” Adam wrote about his experience earlier this year (including a solution of course).
  • MixedTape Fingertip Tip: Say you find yourself back in the nineties making a mixed tape and you find that you have a minute left to fill on one side of the tape. They Might Be Giants to the rescue! Pick several fingertips from their Apollo 18 album and optimize every second of that mixed tape! My personal favorite is the 45 second Minimum Wage off of their other album Flood.

So how many of these SQL Fingertips did you know? Have you got a small tip? Share it in the comments. Have you got a whole bunch of small tips? Write your own SQL Fingertip post. I’d be especially grateful to read a set of Business Intelligence fingertips.

November 16, 2011

Be Quick About Reading The Data You Asked For

So last week, I mentioned that if an application takes it’s time processing data returned from a query it executed, it will not cause ADO.net to time out. This week I’m going to qualify that and on the way, I’ll explain a problem associated with “lazy” readers. I’ll talk about what lazy readers look like from SQL Server’s point of view. And I’ll talk about a potential pitfall with the C# keyword “yield“.

So say you’ve written an application that accesses the database using ADO.net’s SqlDataReader. As you process the rows returned from the server, you might take a while between one Read() call and the next. If that happens, it can lead to problems. In other words:

If…

  • Your application has issued a query using a DataReader,
  • And the amount of data returned is more than 8 kilobytes (the size of an SNI packet),
  • And your application takes its time processing rows (i.e. time between Read() calls),

Then…

  • Your database query is still active!
  • And your transaction is still open! (Remember that even if you don’t use an explicit BEGIN TRANSACTION, SQL Server’s AUTOCOMMIT mode means that each query is it’s own mini-transaction and that transaction isn’t over yet!)
  • Memory used by the database for your query is still in use
  • And resource locks that were taken to process your query are still held!

But you might get lucky…

  • For example, you might be using READ COMMITTED SNAPSHOT isolation level (which is “READ COMMITTED with more awesome” according to Kendra Little). Because even though your query is still using the same number of resources, there’s a smaller risk of seeing concurrency issues.
  • Or if you’re using the default isolation level READ COMMITTED (which you probably are) SQL Server might have given up some of the locks that have been acquired while processing your rows. In this case, the number of locks held at any time is fairly small. And you might not see any problems (at first).

Or you might not be so lucky…

  • More likely, your query will be doing something that’s not simple. It could be using a sort, or parallelism or a particularly involved join. In that case, the query will hold onto the locks it uses without releasing them.
  • Then you might have concurrency issues.
  • Concurrency problems lead to deadlocks and excessive blocking.
  • And that can lead to…

Timeouts!
All because of the processing  on the client side took too long.

But How Can You Tell?
How do you know if this is a problem in your case?

From The Server Side

Look at the wait stats that are collected with sys.dm_os_wait_stats system view. (Or even better, look at Glenn Berry’s DMV query that reports on the same thing). Now take a look at the top wait statistics. If SQL Server is waiting on a client to process data, you’ll see ASYNC_NETWORK_IO as one of the top wait statistics. One interesting note about SQL Server 2012. I noticed that you might see this wait stat instead: PREEMPTIVE_OS_WAITFORSINGLEOBJECT.

That tells you whether this has been a problem since the last server restart. But if you want to know if the database is waiting on some client right now, you can use sp_whoisactive by Adam Machanic. (You could also query sys.dm_exec_requests, or in a pinch sys.sysprocesses). Again, if you see a query waiting on things like ASYNC_NETWORK_IO or PREEMPTIVE_OS_WAITFORSINGLEOBJECT, then that query is waiting for a lazy client application to read its data.

From The Client Side

Make sure your DataReaders are quick. You can tell pretty easily most of the time. Check out these two examples:

IEnumerable<int> ReturnASum()
{
	int sum = 0;
	using (SqlConnection conn = new SqlConnection("blah blah"))
	using(SqlCommand cmd = new SqlCommand(sqlStatement, conn))
	{
		conn.Open();
		using (SqlDataReader dr = cmd.ExecuteReader())
		{
			while (dr.Read())
			{
				sum += dr.GetInt32(0);
			}
		}
	}
	return sum;
}

Pretty straightforward, but compare that to the next example. This next example might be a problem:

void SendDataIntoSpace()
{
	using (SqlConnection conn = new SqlConnection("blah blah"))
	using(SqlCommand cmd = new SqlCommand(sqlStatement, conn))
	{
		conn.Open();
		using (SqlDataReader dr = cmd.ExecuteReader())
		{
			while (dr.Read())
			{
				SendToMarsRover(dr.GetInt32(0));
			}
		}
	}
}

Sometimes it can be tricky to tell whether code will be processed quickly or not. If your data rows are yielded to the calling procedure, then the processing speed is no longer under your control. In this next example, we can’t tell by inspection how quickly the data will be processed. To do that, we have to look at all the places where this method is called.

IEnumerable<int> LetSomeoneElseReadDataAtTheirOwnPace()
{
	using (SqlConnection conn = new SqlConnection("blah blah"))
	using(SqlCommand cmd = new SqlCommand(sqlStatement, conn))
	{
		conn.Open();
		using (SqlDataReader dr = cmd.ExecuteReader())
		{
			while (dr.Read())
			{
				yield return dr.GetInt32(0);
			}
		}
	}
}

A good data access layer shouldn’t make the rest of the application care about how fast they process rows. If you’re familiar with yield already, then you have to weigh memory benefits of yield against concurrency problems and make an informed decision (good luck!)

November 7, 2011

T-SQL Tuesday #024: Procedures “By The Numbers”

Filed under: SQLServerPedia Syndication,Tongue In Cheek — Tags: , , , — Michael J. Swart @ 10:12 pm

T-SQL Tuesday Logo

I’m super excited about this month’s T-SQL Tuesday, for a couple reasons. First of all, this month’s T-SQL Tuesday is hosted by Brad Schulz. Brad is a guy that I got to interview a couple years ago. Second, is that the topic is Sprox ‘n’ Funx (Stored Procedures and Functions). Which I like because the topic is nice and technical.

I started out thinking that I could treat the question “Are stored procedures a good thing?” But that issue is a contentious one and it would take a book to treat that issue properly (So maybe later)

Today, I’m giving some attention to some interesting numbers related to stored procedures (the format is taken from Harper’s Index a popular feature in Harper’s magazine)

Stored Procedures By The Numbers

250: The largest size in megabytes allowed for the source text of a stored procedure.

2,100: The largest number of parameters allowed for a stored procedure.

32: The maximum number of nested levels allowed for stored procedures.

1,598: The number of mentions in sys.messages that contain the word procedure.

858: The number of mentions for function.

389,000: The number of Google results for should use stored procedures.

29,500: The number of Google results reported for should not used stored procedures.

1 in 866: The odds that a random word taken from MichaelJSwart.com is procedure.

1 in 879: The odds that a random word taken from the 2011 State of the Union address is budget.

18: The number of participating blogs in Brad Schulz’ T-SQL Tuesday on the subject of stored procedures and functions.

1875: Earliest year I could find a published mention for the word sproc*

* ~  Lives of the Irish Saints (1875) mentions a St. Sproc daughter of Colum. I am not making that up!

« Newer PostsOlder Posts »

Powered by WordPress