Michael J. Swart

February 16, 2011

Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , , — Michael J. Swart @ 12:00 pm

Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

Opening a pocket on a utility belt.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

This is probably karma coming back to bite me after I said that I/O is the only performance metric you need.

The Setup

Luckily, I was able to reproduce and isolate the behavior. I show the example here. First we build a table of a million rows:

use tempdb
 
create table test
(
    testid bigint primary key,
    v varchar(36),
    nv nvarchar(36)
)
 
go
-- Itzik Ben Gan's trick (via Brad Schulz) to get around SQL Server's lack of a numbers table
with
   L0 as (select 1 as C union all select 1)       --2 rows
  ,L1 as (select 1 as C from L0 as A, L0 as B)    --4 rows
  ,L2 as (select 1 as C from L1 as A, L1 as B)    --16 rows
  ,L3 as (select 1 as C from L2 as A, L2 as B)    --256 rows
  ,L4 as (select 1 as C from L3 as A, L3 as B)    --65536 rows
  ,L5 as (select 1 as C from L4 as A, L4 as B)    --4,294,967,296 rows
  ,Nums as (select row_number() over (order by (select 0)) as N from L5)
insert test
select N,
    CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
from Nums
where N<=1000000;
go

The Queries

Now look at these two queries, The query that searches the unicode string performs eight times worse than its single-byte counterpart even though they use identical I.O.:

set statistics time on
set statistics io on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 797 ms,  elapsed time = 791 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6828 ms,  elapsed time = 6862 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
/*
(Aside: 36 characters isn't much of a test, but it turns out CPU usage
scales linearly on tests using larger input strings: Searching 72 character
strings take twice the CPU, 108 character strings take three times the CPU etc...)
*/

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

-- search unicode string with binary collation
select COUNT(1)
from test
where nv COLLATE Latin1_General_Bin
    like N'%ABCD%'
option (maxdop 1)
-- Table 'test'. Scan count 1, logical reads 16472, ...
-- CPU time = 781 ms,  elapsed time = 777 ms.

However if we use a case sensitive, accent senstive collation, we can hope for better CPU, but we’d be disappointed:

select COUNT(1)
from test
where nv COLLATE SQL_Latin1_General_CP1_CS_AS
    like N'%ABCD%'
option (maxdop 1)
-- CPU time = 6688 ms,  elapsed time = 6702 ms.
-- Table 'test'. Scan count 1, logical reads 16472, ...
/* similar results for all other non-binary collations */

So what do we know so far?

  • Searching inside NVARCHAR strings is slower than searching inside VARCHAR.
  • Specifying different collations (i.e. case sensitivity, accent sensitivity) for NVARCHAR strings doesn’t improve performance
  • Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

  • lstrcmpi or lstrcmp
  • FindNLSStringEx, FindNLSString and FindStringOrdinal
  • CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

February 2, 2011

Ridiculously Unnormalized Database Schemas – Part Three

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , — Michael J. Swart @ 12:00 pm
Ridiculously Unnormalized Database Schemas

Today, I describe 3rd Normal Form and give a ridiculous example of a table that breaks the rule badly. I’ve seen 3NF defined this way:

“A relation (table) R, in second normal form is in third normal form if every non-prime attribute of R is non-transitively dependent on every candidate key of R.”

That’s Wikipedia’s definition of third normal form. And they do a decent job explaining what that means. They reference E. F. Codd’s paper Further Normalization of the Data Base Relational Model 1972. But I couldn’t find a reference online and I was too busy this week to dig out my alumni card and visit the Davis Centre Library.

The House That Jack Built

Before we get into the ridiculous example, I want to talk a bit about the English nursery rhyme The House That Jack Built.

It’s a cumulative poem that starts with the line This is the house that jack built. It repeats and expands (like the songs I Know an Old Lady Who Swallowed a Fly or The Twelve Days of Christmas)
By the end of the poem we have a beautiful run-on sentence where the subject of the poem has very very little to do with the house that Jack built. Here is the rhyme:

This is the house that Jack built.

This is the malt that lay in the house that Jack built.

This is the rat that ate the malt
that lay in the house that Jack built.

and so on until

This is the farmer sowing his corn
That kept the cock that crowed in the morn
That waked the priest all shaven and shorn
That married the man all tattered and torn
That kissed the maiden all forlorn
That milked the cow with the crumpled horn
That tossed the dog that worried the cat
That killed the rat that ate the malt
That lay in the house that Jack built.

The cow with the crumpled horn tosses the dog.

But the thing that makes this nursery rhyme interesting is the same thing that makes tables violate third normal form. If we think of each database row as a sentence and the subject as the primary key. This brings us to our ridiculous example.

The Ridiculous Example

Houses (hope you like scroll bars)

Builder

Contents

Content Consumer

Content Consumer Killer

Content Consumer Killer Worrier

Content Consumer Killer Worrier Tosser

Content Consumer Killer Worrier Tosser Milker

Content Consumer Killer Worrier Tosser Milker Spouse

Content Consumer Killer Worrier Tosser Milker Spouse Wedder

Content Consumer Killer Worrier Tosser Milker Spouse Wedder Waker

Content Consumer Killer Worrier Tosser Milker Spouse Wedder Waker Owner

Jack

Malt

Rat

Cat

Dog

Cow (crumpled horn)

Maiden (all forlorn)

Man (tattered and torn)

Priest (shaven and shorn)

Cock (crowed in the morn)

Farmer (sowing corn)

This table is nowhere near third normal form; it’s in a different time zone! This table is supposed to be about houses, not livestock and 16th century English people. So when you’re designing your tables remember: No run on sentences!

The Other Normal Forms

So that’s the series. I could continue with 4NF, 5NF and BCNF (and others!) but I won’t for a few reasons.

  • Most people who pay attention to 1NF, 2NF and 3NF almost always design databases in the other normal forms without trying.
  • The other normal forms are dull enough that it becomes a challenge to actually come up with a counter example that can seriously be called ridiculous
  • I’ve never heard of any problems encountered by anyone caused by not paying attention to the other normal forms. If you know of any, let me know!

Liked This Series?

Hey you, yeah you. The web surfer. Thanks for coming by. I’m glad you’re visiting this post. That’s what I wrote it for. If you liked this series, you may also like my series on Transaction Levels. Or maybe have a look at some of my favorite posts (or some of yours).

Now clicking the links, that’s the first step. Subscribing to my RSS feed – That’s a whole other thing. So here’s the the deal, if you put my blog into your RSS reader, I promise to write stuff worth belonging in it. Go on, subscribe.

January 26, 2011

Ridiculously Unnormalized Database Schemas – Part Two

Filed under: Data Cartoons,SQLServerPedia Syndication,Technical Articles — Tags: , , , — Michael J. Swart @ 12:00 pm

Ridiculously Unnormalized Database Schemas

Second Normal Form (2NF)

I’ve seen 2NF defined this way:

“A table is considered 2NF if it is 1NF and if its non-prime attributes depends on the whole of every candidate key not just part of it.

It means that for tables that have multicolumn primary keys (or could), the other columns depend on all the columns in the key, not just a subset. If you follow this rule you’ll see that a field in a record will contain data that is a fact only about the record it belongs to, not others. From another point of view, I’ve seen normalization defined as removing redundant data

Get rid of redundant data (Star Trek's Data gets fired)

Things to Remember

  • For 2NF, you only need to look at the “non-prime attributes” or in other words the attributes that aren’t part of the primary key.
  • Look at these columns and ask whether they depend on the whole primary key.
  • Tables with a single-column primary keys are automatically in 2NF
  • BUT it’s not fair to make identity columns your primary keys on every table and call your job done (The definition of 2NF closes this loophole by mentioning candidate keys).

The Example

Take a look at the following table. It tracks reviews from the talent search t.v. show “American Idol”. The primary key is defined on the columns (Date, Singer, Song, Reviewer). Look at each other column and ask whether it depends on the whole key, or a subset of the key.

American Idol Reviews

Date

Singer

Song

Reviewer

Age

Show Order

Show Theme

Result

Review

Aug-21

Justin Guarini

Before Your Love

Paula Abdul

23

1

Finale

Runner-Up

Beautiful

Aug-21

Justin Guarini

Before Your Love

Randy Jackson

23

1

Finale

Runner-Up

A little pitchy dawg

Aug-21

Justin Guarini

Before Your Love

Simon Cowell

23

1

Finale

Runner-Up

Rubbish

Aug-21

Kelly Clarkson

A Moment Like This

Paula Abdul

23

2

Finale

Winner

Beautiful

Aug-21

Kelly Clarkson

A Moment Like This

Randy Jackson

23

2

Finale

Winner

A little pitchy dawg

Aug-21

Kelly Clarkson

A Moment Like This

Simon Cowell

23

2

Finale

Winner

Rubbish

Aug-21

Justin Guarini

Get Here

Paula Abdul

20

3

Finale

Runner-Up

Beautiful

Aug-21

Justin Guarini

Get Here

Randy Jackson

20

3

Finale

Runner-Up

A little pitchy dawg

Aug-21

Justin Guarini

Get Here

Simon Cowell

20

3

Finale

Runner-Up

Rubbish

Aug-21

Kelly Clarkson

Respect

Paula Abdul

23

4

Finale

Winner

Beautiful

Aug-21

Kelly Clarkson

Respect

Randy Jackson

23

4

Finale

Winner

A little pitchy dawg

Aug-21

Kelly Clarkson

Respect

Simon Cowell

23

4

Finale

Winner

Rubbish

Aug-21

Justin Guarini

A Moment Like This

Paula Abdul

23

5

Finale

Runner-Up

Beautiful

Aug-21

Justin Guarini

A Moment Like This

Randy Jackson

23

5

Finale

Runner-Up

A little pitchy dawg

Aug-21

Justin Guarini

A Moment Like This

Simon Cowell

23

5

Finale

Runner-Up

Rubbish

Aug-21

Kelly Clarkson

Before Your Love

Paula Abdul

23

6

Finale

Winner

Beautiful

Aug-21

Kelly Clarkson

Before Your Love

Randy Jackson

23

6

Finale

Winner

A little pitchy dawg

Aug-21

Kelly Clarkson

Before Your Love

Simon Cowell

23

6

Finale

Winner

Rubbish

You can see that with the exception of the column Review that all the columns in tables are not dependent on the whole key. We can pull these columns into separate tables:

Normalized:

ShowContestants

Singer

Date

Age

Result

Justin Guarini

Aug-21

23

Runner-Up

Kelly Clarkson

Aug-21

20

Winner

Performances

Date

Singer

Song

Show Order

Aug-21

Justin Guarini

Before Your Love

1

Aug-21

Kelly Clarkson

A Moment Like This

2

Aug-21

Justin Guarini

Get Here

3

Aug-21

Kelly Clarkson

Respect

4

Aug-21

Justin Guarini

A Moment Like This

5

Aug-21

Kelly Clarkson

Before Your Love

6

Shows

Date

Show Theme

Aug-21

Finale

We still have a table to hold all the reviews by the judges. Defined as:

    Reviews(Date, Singer, Song, Reviewer, Review)

But it’s still a point of debate whether or not the reviews depend on the whole primary key or a subset of the key (especially based on the example).

January 19, 2011

Ridiculously Unnormalized Database Schemas – Part One

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , — Michael J. Swart @ 12:00 pm
Ridiculously Unnormalized Database Schemas

So last week I introduced this series where I try to explore the different normal forms by breaking them in absurd ways. This time I’m talking about first normal form.

First Normal Form (1NF)

First normal form has a kind of fuzzy definition. A normalization process was mentioned by E. F. Codd in 1970 which came to be known as first normal form (Even though 1NF has been redefined and clarified a few times since) The paper is called A Relational Model of Data For Large Shared Data Banks. So in this article I’m going to deal with Codd’s description of 1NF in that paper. A database model is in 1NF if it is relational and has

“all relations are defined on simple domains”

Think of relations as tables and simple domains as column spaces that are atomic (i.e. can’t be decomposed). And so today’s ridiculously un-normalized database example is from that paper.
Good and bad relational schemas
So having tables as values inside tables is not something that SQL Server (or any RDBMS supports) but actually the example was given to explain how to normalize (to 1NF) and to avoid hierarchical data models that were common at the time. These hierarchical data models were and are implemented by systems like IMS and other systems that look like file systems. So in 1970, something like xml might have been easier to understand than relational data models in 1NF. The example above would actually be visualized like this in xml:

<Employee Man#="123" Name="Michael Swart" Birthdate="Nov. 22">
    <JobHistory JobDate="2000" Title="Lackey">
            <SalaryHistory SalaryDate="2000" Salary="$1,000,000" />
            <SalaryHistory SalaryDate="2001" Salary="$2,000,000" />
    </JobHistory>
    <JobHistory JobDate="2002" Title="Senior Lackey">
            <SalaryHistory SalaryDate="2002" Salary="$3,000,000" />
            <SalaryHistory SalaryDate="2003" Salary="$4,000,000" />
    </JobHistory>
    <Children ChildName="Mini-me 1" Birthyear="2000" />
    <Children ChildName="Mini-me 2" Birthyear="2002" />
</Employee>

XML Inside Databases

But in 2005, SQL Server introduced the xml data type. And my take is that it’s good… mostly. It’s as good as long as you treat as an atomic value (i.e. non-decomposable).

  • We can think of an xml column as just a string column with a constraint that makes it well formed xml. And that’s not terribly different than an integer column with a constraint that enforces that values are not negative.
  • XML columns are perfect for software messages (like SQL Server broker messages).
  • And they’re great for serialized objects (like query plans).

But …

XML is the most decomposable data type I can think of and when we treat these columns as non-atomic, then the tables that contain them are not in 1NF. It’s certainly not what E. F. Codd had in mind when he designed the relational data model.
An upset looking E. F. Codd
I’ve found that trouble comes when we try to query it or when we modify it in place. Because then we’re not treating the xml as atomic. And then the shortcomings of xml columns inside a database become very apparent. In SQL Server – as of 2005 – there are ways to query xml values or modify xml values such as when we use xml indexes and when we use XQuery. We find complicated code down this road and performance issues that are just as complicated.
For example:

What are your thoughts?

Do you have opinions on 1NF or the history of databases? It’s all fascinating stuff. Let me know in the comments what you think. I’ve got a bunch of ideas that can’t fit in one article and your comments might let me know what to continue writing about.

January 12, 2011

Ridiculously Unnormalized Database Schemas – Part Zero

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , — Michael J. Swart @ 12:00 pm
Ridiculously Unnormalized Database Schemas

I’m presenting a small series of articles that describes database normalization. But the topic is already covered quite well by Wikipedia’s Database Normalization article. And so in this series I’m going to explain database normalization by using examples. Actually by counter-example. The goal is to come up with counter-examples that break the rules so bad that you’ll never forget the rules for each normal form.

DB Normalization is a good thing

But I’m not going to explain why. At least not in depth.  That’s a topic for another day. Let me just say that SQL Server – and Oracle and DB2 and the rest – were designed to support Normalized Databases. Designing Normalized databases makes the best use of SQL Server and doing so avoids a lot of messy problems.

Before We Start…

I’m not going to jump into first normal form (1NF). That’s next article. All database schemas that are 1NF must be relational. So in this article I’m looking at information stores that aren’t relational.

I start by giving an example of an information container. It’s not a database schema, It’s not even a data model (strictly).

The Human Brain

The Human Brain

So as an information store, the brain is not super-reliable but it has it’s place. I chose this example because if you’re not in I.T., the brain still holds a ton of information!

  • To the theoretical physicist, it has a lot information by virtue of the fact that it’s a physical object.
  • The chemist and biologist would be fascinated by the information stored in the DNA of every cell.
  • And of course if connected to a living body, the brain has information for everyone else, in the usual sense (just ask!)

The Relational Database Model

Getting back to databases. Specifically ones that aren’t relational.

It’s surprising that DB normalization was defined at the exact same time as the Relational Database Model (Well, not exactly the same time. They were a chapter apart in Codd’s paper A Relational Model of Data for Large Shared Data Banks 1970)

So for a schema to even be considered for first normal form, it has to be relational. And while SQL Server is called an RDBMS, it turns out to be surprisingly easy to define tables that aren’t relational. Take this example:

-- not a relational table
CREATE TABLE Employee (
    EmployeeNumber INT NOT NULL,
    FirstName NVARCHAR (50) NOT NULL,
    LastName NVARCHAR (50) NOT NULL
)

There’s no primary key! If your tables are missing primary keys, then you’re automatically disqualified. No 1NF for you! (Incidentally the same goes for missing foreign keys.)

In the next few articles I leave the non-relational stuff behind. We’ll look at examples of relational tables (don’t worry, there’s still a lot of ridiculousness coming).

January 5, 2011

Table Valued Parameters, A Short, Complete Example

After surfing for a while I had trouble finding an example of an application that shows how to use table valued parameters (tvp) from beginning to end using C#, ado.net, and SQL Server 2008. Official docs covering TVPs are found at Table-Valued Parameters in SQL Server 2008 (ADO.NET).

So this is my own TVP example, it consists of a SQL Script, a C# script and a batch file that runs and executes the program.

The DB Setup (a SQL Script)

Run this on your SQL Server 2008 (or later) database that you can test on:

use tempdb;
 
CREATE TYPE BigIntList
    AS TABLE (i bigint)
GO
 
CREATE PROCEDURE ReturnEvenNumbers (@list BigIntList READONLY) AS
SELECT i
FROM @list
WHERE i % 2 = 0
GO

The C# program

Aptly called Program.cs, this is the definition of a program that calls the new procedure with a list of seven numbers and prints the list of numbers that comes back (i.e. the even numbers).
Edit the connection string here and then save this as Program.cs in some directory.

using System.Data.SqlClient;
using System.Data;
 
namespace TVParameterTest {
    class Program {
        static void Main( ) {
            // build table
            DataTable dt = new DataTable();
            dt.Columns.Add( "i", typeof( long ) );
            foreach( long l in new long[] {1,2,3,4,5,6,7} )
                dt.LoadDataRow( new object[] { l }, true );
 
            // build connection and command
            SqlCommand cmd = new SqlCommand(
                "ReturnEvenNumbers",
                new SqlConnection() );
            cmd.Connection = new SqlConnection( @"Data Source=.\sql2k8;Initial Catalog=tempdb;Integrated Security=True" );
            cmd.Connection.Open();
            cmd.CommandType = CommandType.StoredProcedure;
            cmd.Parameters.Add( "list", SqlDbType.Structured).Value = dt;
 
            // execute and output
            SqlDataReader reader = cmd.ExecuteReader();
            while( reader.Read() )
                System.Console.WriteLine( reader[0].ToString());
            cmd.Connection.Close();
 
        }
    }
}

Running it!

First make sure you’ve got Visual Studio on your machine, then you should be able to open a command prompt (or powershell!) and see this:

E:\Temp>csc program.cs
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.1
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.

E:\Temp>Program.exe
2
4
6

October 11, 2010

Updating Statistics Helps, But Not For The Reasons You Think

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , — Michael J. Swart @ 8:12 pm

T-SQL Tuesday Logo This month it’s Sankar Reddy’s turn to host T-SQL Tuesday. I wish him luck! I know I had a blast last month. This month we’re busting SQL Server misconceptions wherever we find them.

So today I want to explain something about the practice of updating statistics as a technique to improve or maintain performance. It works and it works well, but often not for the reasons you think and I’m going to explain a simpler equivalent technique.

The Story

So you’ve been given the job of finding out why a particular stored procedure is performing poorly. It’s a complicated query plan and one of the first things you notice is that there’s hundreds of thousands of rows coming out of a particular table. After looking closer you figure out that the estimated number of rows is only supposed to be around thirty!

Remember that estimated row counts are based on what the database knows about the data and that info is stored in objects called statistics. As the table grows, these statistics can get out of date. So you try:

UPDATE STATISTICS dbo.MyTable WITH FULLSCAN

which solves the problem. (Good job, take a bow!)

Why It Works

Statistics are used to get an idea of how data is distributed. Based on the data distribution, SQL Server can make better choices on how it chooses query plans. If tables change and stats are out of date, the chosen query plans can be less than optimal. And bad plans have a bad habit of performing orders of magnitude worse than optimal plans.

Why It Really Works

I’ve found that out-of-date statistics can be responsible for bad query plans. But more frequently, I see that the parameter values that are used to compile the plan have just as big an effect.

SQL Server doesn’t just use statistics to estimate the number of rows it’s going to retrieve from a table. It also uses variable values and constants found in the query. And in the case of variables, it always uses the first variable values that it sees when the query gets compiled.

So for example, If I’m looking up a person in the phonebook by last name. The number of people I find is going to depend on whether that last name is common like “Martin” or unique like “Jingleheimer-Schmidt”. Based on the search argument, different plans can get picked which affects database performance.

So even if statistics are already up to date, the act of calling UPDATE STATISTICS  can still have an effect. When you update statistics on a table, SQL Server throws out all the query plans that depend on that table (but see comments). And that query gets recompiled with the next parameter values that come along. And these may be different than the original values.

It’s the equivalent of a giving SQL Server a good thumping, or jolt.

So if you’re getting ahead of me, you can see that updating stats won’t always work, but I’ve seen others use it as part of a trial-and-error approach to query tuning. So that when it works sometimes, it becomes part of your mental toolbelt of tricks to try.

Alternative Technique 1

So you’ve used STATS_DATE and you know that statistics are up to date and you want to give SQL Server the same “thump” to build a new query plan without updating stats. You can use:

sp_recompile @objname='MyTable'

Or even more targeted:

DBCC FREEPROCCACHE (plan_handle)

But I would only recommend this for acute performance problems, it by no means addresses the root cause.

Alternative Technique 2

If you decide that SQL Server needs help in picking its plan consistently you can use something more permanent. Use query hints like

  • OPTIMIZE FOR UNKNOWN,
  • OPTIMIZE FOR @LastName = “Martin”
  • USE PLAN …

But this means you’re telling SQL Server you know better than it does. So make sure you understand every nook and cranny of your tables’ data distribution and query plans.

Conclusion

The misconception is that STATISTICS has a greater effect on query performance than it actually does. The query plans that SQL Server chooses depend on both statistics and compiled parameter values.

Try to get away from a trial-and-error method of query tuning. Learn how to look at the following before making a plan:

  • Find the query plan handle and query plan for a poor performing query.
  • Estimated row counts vs. Actual row counts
  • What parameter values were used to compile a query plan.
  • When statistics were last udpated for a particular index.
  • The data distribution (histogram) of statistics.

September 14, 2010

Guts Of An Clustered Index

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , — Michael J. Swart @ 8:31 am

T-SQL Tuesday Logo

So T-SQL Tuesday rolls around again. And this month it’s hosted by some handsome devil named Michael J Swart. I got to pick the topic, so you might expect great things. Well, we’ll see (no pressure) Here we go.

I want to show visually what’s happening on the disk when you create and grow a clustered index. I think this post will be most useful if you’re an application developer. You know what a table is and what a clustered index is, but beyond that, it’s not clear where SQL Server keeps your data physically. Hopefully you’ll soon find it clearer.

So I’m going to keep it simple. I’m going to show a simple table with a clustered index on it’s identity column (no partitions, no compression, one filegroup on one file). Remember a clustered index defines the ordering of the table and so essentially a clustered index is the table. That’s good, because it means that in the example below, I’ll only talk about one thing. You can call it either the table or the clustered index, it’s the same thing.

First Rule of Database Storage…

First thing you should know about database storage is that everything, I mean absolutely everything in the database is stored in pages. Each page is a chunk of storage on disk exactly 8 kilobytes in size, no more, no less. You’ll see that a page never contains records from different tables.

Create A Table

CREATE TABLE TEST (
  	id INT IDENTITY NOT NULL,
  	variableLength VARCHAR (MAX) NOT NULL DEFAULT (REPLICATE('#', 1000)),
  	CONSTRAINT pk_TEST PRIMARY KEY CLUSTERED (id)
);

Nothing happens on the disk here. There is no new space that is reserved in your database for this new table. Think of SQL Server as a lazy librarian. It’s not going to find shelves for books that don’t exist. But if you could look closely, you would see that there are changes to the system. SQL Server records in system tables that there’s a new table. Here’s one of the system tables:

Some System Table

New Row In A System Table

This particular system table describes the space that’s allocated to table TEST. It’s the system table you go to when you want to find where your table is located on disk. Right now no space is allocated until we create some data.

First Row

insert TEST default values;

Okay, something’s happening now. Two new pages are allocated for table TEST after just a single row.

New Pages

Two New Pages

Page Addresses: Notice that pages always have a name (or address) that looks like 1:153. The first digit is the file number in that database (In our example, the digit will always be 1 because the database uses only one file). The second number here (153) is an incrementing counter that identifies the page number.

So a page with the address (1:153) is the 153rd page (or stretch of 8 kilobytes) in the 1st file of the current database.

The IAM page: The first you see here is an Index Allocation Map. If database tables were books on shelves, the IAM page would be like a floor plan of the library. Paul Randal wrote the definitive web resource on this stuff and so it’s something I’m not going to cover any deeper.

The Data Page: This is where your data is. There’s a header that SQL Server uses for some context. And there’s also a footer that acts helps track the order for the rows on a page.  But still there’s roughly 8000 bytes left on data pages for your actual data. In my example, the rows I create are pretty big (about 1kb) and so there’s space for about six more rows in this page. What happens when we try to add seven?

First Index Node

set nocount on
insert TEST default values;
GO 7

Okay, There’s two data pages now. The first one, 1:153 got full and SQL Server allocated a second one, 1:156, for the eighth row:

New Index Page

New Index Page

The Index Page: A new kind of page is created.  If you’re looking for a single row and you know the id, you don’t start at the first data page. It’s better to look at the root page. Using the information in the root page, you know the address of the data page your row is in.

In fact a page address and a integer Id column are pretty tiny fields and so there’s a lot of space in this index field. There’s space in this page to index over 600 pages of data. That works out to over 4000 rows (inside the 600 data pages). What happens when you create one too many?

More and More Rows

set nocount on
insert TEST (variableLength) default values;
GO 4347

New Index Level

New Index Level

There’s a new level!

Finding Row 42: So let’s say that you want to look up row with Id=42. First read the root page (1:791). This information tells you to look at page (1:155). You’re still not at a data page. So reading (1:155), you know to look at (1:160). And voila, after three page reads, there’s your row.

B-Trees: Tables that are quite large or that have really wide indexes can have more and more levels. This structure is called a B-Tree (B for balanced). The idea is that no matter where stuff is inserted or deleted, it’s not really difficult to shuffle page pointers around to make the tree balanced.

See For Yourself

This is all stuff that you can see for yourself. The following are examples of commands I’ve used. They’re not documented, but they’re extremely useful when trying to understand this stuff.

SELECT pl.*
FROM sys.system_internals_allocation_units iau
JOIN sys.partitions p
	ON iau.container_id = p.partition_id
CROSS APPLY sys.fn_PhysLocCracker(iau.root_page) as pl
WHERE p.object_id = OBJECT_ID('test')
	AND p.index_id = 1 -- the clustered index
	AND iau.type = 1 -- in row data
 
DBCC PAGE('sandbox', 1, 153, 3);
 
DBCC IND('sandbox', 'test',1);

So that’s it. There’s tons left to say about all this stuff (metric tons!) and I’m hoping that the other T-SQL Tuesday bloggers will cover some of it.

September 7, 2010

Invitation to Participate in T-SQL Tuesday #10 – Indexes

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , , — Michael J. Swart @ 10:00 am

T-SQL Tuesday Logo

What are your thoughts on Database Indexes?

(Update 09/20/2010: Find the roundup here) September already! Summer holidays are over and that means no more lazing around. The second Tuesday is fast approaching and that means that T-SQL Tuesday is almost here. And this month I will be your humble host and the topic is Indexes.

A book index

A book's index is not that different than a database index.

Indexes

Indexes are strange things. You never need to explicitly create one to create a fully-functional database, but if you want a database to perform well, they’re indispensable.

And there are so many aspects to write about! Like internals, covering, clustered, xml, fulltext, b-trees, hints, maintenance of, included columns, filtered, redundant, missing and tons more.

In fact my SQL Server 2008 Administrator’s Pocket Consultant (The first handy textbook I could grab) has an index entry on “indexes” that has 22 sub-entries.

About T-SQL Tuesday

For those not familiar with T-SQL Tuesday, it’s an idea hatched by Adam Machanic (Blog|Twitter). A monthly event, T-SQL Tuesday is a chance for any blogger to write and post an article on a single topic determined by the host (i.e. yours truly). Leave a comment here on that day and a day or two later, I’ll include your post in the round up.

Follow These Rules

  1. The post must go live on your blog between 00:00 GMT Tuesday, September 14, 2010 and 00:00 GMT Wednesday, September 15, 2010.
    In other words, set your sql server date, time and timezone properly and run this script:

    IF GETUTCDATE() BETWEEN '20100914' AND '20100915'
    	SELECT 'You Can Post'
    ELSE
    	SELECT 'Not Time To Post'
  2. Your post has to link back to the hosting blog, and the link must be anchored from the logo (found above) which must also appear at the top of the post
  3. Leave a comment here (below) or I won’t be able to find your post. I expect trackbacks work properly, but if they don’t check back here just in case and leave a comment if necessary.

We also encourage you to …

  • … include a reference to T-SQL Tuesday in the title of your post. (The more we bloggers advertise T-SQL Tuesday, the more we bloggers get T-SQL tuesday readers)
  • … tweet using the hash tag #TSQL2sDay to follow links and other relevant conversations.
  • … consider hosting T-SQL Tuesday yourself. If you’re interested let Adam Machanic Know. If you’ve participated in two T-SQL Tuesdays previously and you don’t let your blog go stale (blog once a month for the last six months) then he’ll put you in the rotation.
  • And out of curiosity… Are you team indexes or are you team indices? Personally I’m team indexes but I work with a guy who seems to be so team indices that I’ve caught him saying indice for the singular!

That’s it! Good luck! Can’t wait to see what you guys have in store.

Your humble host,
Michael J. Swart

August 26, 2010

Ten Things I Hate to See in T-SQL

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , — Michael J. Swart @ 11:35 am

So here’s my top ten list of things that I see in written in SQL that I dislike or that irk me in some way. And yeah, in some cases, I would use the word hate.

Tweedledum's fury.

Fury

They’re not necessarily things that go against best practices. But each one gets under my skin in some way and I thought I’d share that list at the risk of giving my enemies buttons to push.

So without further ado.

10. Code that’s Been Commented Out

It’s in every set of code I’ve ever worked with and I’m sure you’ve seen it too. It’s not even SQL Server specific, but you’ll see it in any code that changes over time. The code that get’s commented out instead of deleted.

I’m guessing that the reasoning behind this is that some developers want to keep code around to make reverts easier, or to keep a sense of history of the code. But that’s what source control systems were built for!

9. The isDeleted Column

Deleted records aren’t deleted. Look, they’re right there!”

You sometimes see this. Records that are marked as deleted with a flag. It’s sometimes needed to support an undelete feature but it still bugs me.

It’s also confusing when looking at procedures called s_OBJECT_Delete(). Without looking at the definition, does this procedure delete the record, or just mark it as deleted?

8. Fluff in Object Definitions

When I create a table from scratch like:

CREATE TABLE MyTest
(
	Id INT IDENTITY NOT NULL,
	Value1 NVARCHAR (100) NOT NULL,
	Value2 NVARCHAR (max),
	CONSTRAINT PK_MyTest PRIMARY KEY (Id)
)

I can get SQL Server to script out the table for me later on and it will look like this:

/****** Object:  Table [dbo].[MyTest]    Script Date: 08/25/2010 19:08:29 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[MyTest](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[Value1] [nvarchar](100) NOT NULL,
	[Value2] [nvarchar](max) NULL,
 CONSTRAINT [PK_MyTest] PRIMARY KEY CLUSTERED
(
	[Id] ASC
) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

All the added stuff is marked in green. This is stuff that is either optional, or that specifies an option that happens to be the default. This is okay, because SQL Server didn’t know if the table was created with options left as the default, or whether they were explicitly set. So the scripter spells it out.

And that’s totally fine. I just get peeved when some of this stuff makes it into checked-in object definitions.

7. Filter in the JOIN Clause Instead of the WHERE Clause

Look at this query:

SELECT p.name, COUNT(1)
FROM Sales.SalesOrderDetail sod
JOIN Production.Product p
	ON sod.ProductID = p.ProductID
	AND p.Name like 'Sport%'
GROUP BY p.Name

I’d rather see that filter on p.Name in a where clause:

SELECT p.name, COUNT(1)
FROM Sales.SalesOrderDetail sod
JOIN Production.Product p
	ON sod.ProductID = p.ProductID
WHERE p.Name like 'Sport%'
GROUP BY p.Name

Of course if you’re dealing with LEFT OUTER JOINS, then it matters where the filter is placed.

6. Concatenating Strings to Build SQL in C#

This is more of a C# gripe than a SQL gripe but it’s my list 🙂

Old C++ developers create strings filled with SQL to be passed to the database. But long statements would get broken up in code like:

string sql = "SELECT name, location " +
    "FROM MyTable " +
    "WHERE id = 12312 " +
    "ORDER BY name";

I hate that format because it makes things impossible to cut and paste. But C# has this nifty string prefix @ that allows newlines inside string literals:

string sql = @"SELECT name, location
    FROM MyTable
    WHERE id = 12312
    ORDER BY name";

5. Concatenating Variables Values Into SQL

Pretty self-explanatory. This practice puts a database in danger of SQL injection attacks.

(Always happy for an excuse to mention little Bobby Tables from xkcd)
Exploits of a Mom

4. Triggers that maintain data in other tables

This is a symptom of a database design that’s not normalized. Essentially, the same data is stored in two places. The development effort needed to maintain both tables properly is often more trouble than it’s worth.

Very closely related is the indexed view. But I have nothing but love for these.

3. Missing Foreign Keys

Missing foreign keys are more common than other problems because there’s no obvious symptom to react to:

  • A new table will support an application just as nicely with or without foreign keys.
  • Software testers will see no error messages coming from the database.

It’s often later on that data integrity becomes a problem, now you’ve got more things to worry about:

  • Adding the foreign key.
  • Cleaning up dirty data.
  • and Public Relations.

Here’s a post I wrote earlier on finding missing keys.

2. Identifiers That Are Also Reserved Key Words

It’s confusing to have column names, alias names or table names that are also keywords. In some cases the syntax is allowed and what’s left is merely confusing.

I often see this pet peeve when columns are named the same as a particular type. like Date or Text. I’d rather see names like CreateDate, LogDate or CommentText. The following script can identify those:

SELECT c.TABLE_NAME, c.COLUMN_NAME, c.DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS c
JOIN sys.types t
	ON t.name = c.COLUMN_NAME
ORDER BY c.TABLE_NAME, c.COLUMN_NAME

1. Column Names That End in a Digit

Now, I don’t mind stuff like AddressLine1 and AddressLine2. They serve a purpose in their own way.

But normally when I see this it smells like unnormalized database design (not denormalized, there’s a distinction). So this is a good litmus test to check to see whether tables are at least in first normal form.

I use this script to check:

select t.name as table_name,
	c.name as column_name
from sys.tables t
join sys.columns c
	on t.[object_id] = c.[object_id]
where c.name like '%[0-9]'
order by t.name, c.name

~ Thanks to xkcd.com and victorianweb.org for sharing their images

« Newer PostsOlder Posts »

Powered by WordPress