Michael J. Swart

February 16, 2011

Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: , , , , — Michael J. Swart @ 12:00 pm

Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

Opening a pocket on a utility belt.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

This is probably karma coming back to bite me after I said that I/O is the only performance metric you need.

The Setup

Luckily, I was able to reproduce and isolate the behavior. I show the example here. First we build a table of a million rows:

use tempdb
 
create table test
(
    testid bigint primary key,
    v varchar(36),
    nv nvarchar(36)
)
 
go
-- Itzik Ben Gan's trick (via Brad Schulz) to get around SQL Server's lack of a numbers table
with
   L0 as (select 1 as C union all select 1)       --2 rows
  ,L1 as (select 1 as C from L0 as A, L0 as B)    --4 rows
  ,L2 as (select 1 as C from L1 as A, L1 as B)    --16 rows
  ,L3 as (select 1 as C from L2 as A, L2 as B)    --256 rows
  ,L4 as (select 1 as C from L3 as A, L3 as B)    --65536 rows
  ,L5 as (select 1 as C from L4 as A, L4 as B)    --4,294,967,296 rows
  ,Nums as (select row_number() over (order by (select 0)) as N from L5)
insert test
select N,
    CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
from Nums
where N<=1000000;
go

The Queries

Now look at these two queries, The query that searches the unicode string performs eight times worse than its single-byte counterpart even though they use identical I.O.:

set statistics time on
set statistics io on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 797 ms,  elapsed time = 791 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6828 ms,  elapsed time = 6862 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
/*
(Aside: 36 characters isn't much of a test, but it turns out CPU usage
scales linearly on tests using larger input strings: Searching 72 character
strings take twice the CPU, 108 character strings take three times the CPU etc...)
*/

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

-- search unicode string with binary collation
select COUNT(1)
from test
where nv COLLATE Latin1_General_Bin
    like N'%ABCD%'
option (maxdop 1)
-- Table 'test'. Scan count 1, logical reads 16472, ...
-- CPU time = 781 ms,  elapsed time = 777 ms.

However if we use a case sensitive, accent senstive collation, we can hope for better CPU, but we’d be disappointed:

select COUNT(1)
from test
where nv COLLATE SQL_Latin1_General_CP1_CS_AS
    like N'%ABCD%'
option (maxdop 1)
-- CPU time = 6688 ms,  elapsed time = 6702 ms.
-- Table 'test'. Scan count 1, logical reads 16472, ...
/* similar results for all other non-binary collations */

So what do we know so far?

  • Searching inside NVARCHAR strings is slower than searching inside VARCHAR.
  • Specifying different collations (i.e. case sensitivity, accent sensitivity) for NVARCHAR strings doesn’t improve performance
  • Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

  • lstrcmpi or lstrcmp
  • FindNLSStringEx, FindNLSString and FindStringOrdinal
  • CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

September 25, 2009

CHAR: What is it Good For?

Filed under: Technical Articles — Tags: , , , , , , — Michael J. Swart @ 7:06 am

Takeaway: … Absolutely nothing (say it again).

What’s the difference between the CHAR datatype and the VARCHAR datatype? At the time of this writing Google gives thousands of web pages that are willing to explain the difference. Or if not explaining the difference at least explaining which is better and when. Here are two of the more interesting or important links:

What this Blogger Thinks
I’m going to put my vote with VARCHAR over CHAR in almost all circumstances. I’m going to list what I think are the stated benefits of CHAR and then hopefully show that the benefits are outweighed by consistency you get from applying VARCHAR as a rule of thumb always. (n.b. Throughout this post, the same arguments apply to NCHAR vs. NVARCHAR.)

Data Integrity:
Does SQL Server Complain if you give CHAR (10) less than ten characters?
At first glance, you might assume (the way I did) that the following benefit: That

CREATE TABLE example1 ( col CHAR (10) )

is somehow equivalent to

CREATE TABLE example2
(
    col VARCHAR (10),
    CHECK (LEN (col) = 10)
)

But it turns out not to be the case. Inserting strings shorter than ten characters into the column in example 1 does not produce an error. Instead, SQL Server pads the value with enough spaces to fill out the value of the column.

Size
And so we get to it. The main consideration when deciding between these types always seems to be size. VARCHAR as you know takes only the characters it needs, but it uses an extra two bytes to store the length of the string. That’s why Microsoft recommends using CHAR for strings that are going to be a consistent length.

So, I’ve decided to find out how much exactly. Using the sample database Adventureworks, I changed the tables that used NCHAR columns to identical tables that use NVARCHAR columns. And I found that the number of pages consumed was exactly the same (even after rebuilding indexes). I expected that though. The small savings you get in that case was not enough to be able to fit extra rows into a database page.

But here’s something that surprised me regarding null-able columns. If you are counting every byte, you’ll see that VARCHAR is the clear space saver when it comes to NULLS. You’ll find that CHAR (10) columns that are NULL still take ten bytes while VARCHAR (10) columns take zero bytes. This is another reason to pick VARCHAR over CHAR.

Semantics
What about semantically? When application developers sees a CHAR (10) column, it’s clear to them that data in the column is expected to be exactly 10 characters. This is a benefit to the CHAR data type.
It is not something that the VARCHAR data type conveys nicely. Even with a CHECK constraint or with documentation.

So I’ll concede that. Even though it’s possible to create user defined types that do something similar:

CREATE TYPE PostalCode FROM VARCHAR (6);
CREATE TYPE SIN FROM VARCHAR (9);
CREATE TYPE OneChar VARCHAR (1);

So are CHARS Completely Useless?
No, but I don’t see any real significant advantage either. And so I think it’s easier to apply VARCHAR as a rule of thumb and as a best practice without worrying about whether it’s better than CHAR. So pick VARCHAR and be done with it.

Powered by WordPress