tsql | Michael J. Swart

February 16, 2011

Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: "sql server", collation, nvarchar, tsql, varchar — Michael J. Swart @ 12:00 pm

Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

This is probably karma coming back to bite me after I said that I/O is the only performance metric you need.

The Setup

Luckily, I was able to reproduce and isolate the behavior. I show the example here. First we build a table of a million rows:

use tempdb
 
create table test
(
    testid bigint primary key,
    v varchar(36),
    nv nvarchar(36)
)
 
go
-- Itzik Ben Gan's trick (via Brad Schulz) to get around SQL Server's lack of a numbers table
with
   L0 as (select 1 as C union all select 1)       --2 rows
  ,L1 as (select 1 as C from L0 as A, L0 as B)    --4 rows
  ,L2 as (select 1 as C from L1 as A, L1 as B)    --16 rows
  ,L3 as (select 1 as C from L2 as A, L2 as B)    --256 rows
  ,L4 as (select 1 as C from L3 as A, L3 as B)    --65536 rows
  ,L5 as (select 1 as C from L4 as A, L4 as B)    --4,294,967,296 rows
  ,Nums as (select row_number() over (order by (select 0)) as N from L5)
insert test
select N,
    CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
from Nums
where N<=1000000;
go

The Queries

Now look at these two queries, The query that searches the unicode string performs eight times worse than its single-byte counterpart even though they use identical I.O.:

set statistics time on
set statistics io on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 797 ms,  elapsed time = 791 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6828 ms,  elapsed time = 6862 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
/*
(Aside: 36 characters isn't much of a test, but it turns out CPU usage
scales linearly on tests using larger input strings: Searching 72 character
strings take twice the CPU, 108 character strings take three times the CPU etc...)
*/

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

-- search unicode string with binary collation
select COUNT(1)
from test
where nv COLLATE Latin1_General_Bin
    like N'%ABCD%'
option (maxdop 1)
-- Table 'test'. Scan count 1, logical reads 16472, ...
-- CPU time = 781 ms,  elapsed time = 777 ms.

However if we use a case sensitive, accent senstive collation, we can hope for better CPU, but we’d be disappointed:

select COUNT(1)
from test
where nv COLLATE SQL_Latin1_General_CP1_CS_AS
    like N'%ABCD%'
option (maxdop 1)
-- CPU time = 6688 ms,  elapsed time = 6702 ms.
-- Table 'test'. Scan count 1, logical reads 16472, ...
/* similar results for all other non-binary collations */

So what do we know so far?

Searching inside NVARCHAR strings is slower than searching inside VARCHAR.
Specifying different collations (i.e. case sensitivity, accent sensitivity) for NVARCHAR strings doesn’t improve performance
Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

lstrcmpi or lstrcmp
FindNLSStringEx, FindNLSString and FindStringOrdinal
CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

-- Comments (18)