Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings

February 16, 2011

Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: "sql server", collation, nvarchar, tsql, varchar — Michael J. Swart @ 12:00 pm

Every time I see behavior from SQL Server that I don’t understand, it’s very disconcerting. But it often turns out to be a learning experience. And if I’m lucky to get to the bottom of it, the knowledge becomes one more tool in my mental utility belt. I had that experience a couple weeks ago and I want to write about it.

The thing I learned most recently is that searching inside Unicode strings (NVARCHAR strings) is a lot more cpu-intensive than searching inside single-byte strings (VARCHAR strings). By searching I mean looking for substrings. This has nothing to do with index lookups and full-text indexing is a different topic.

So here’s how I learned it: In a database I work on, we recently changed many of our database strings into unicode strings (VARCHAR to NVARCHAR) in order to support multiple languages. And we discovered that the CPU time taken by of a couple procedures shot through the roof! It was an 800% increase in CPU and this was without any significant I/O increase.

This is probably karma coming back to bite me after I said that I/O is the only performance metric you need.

The Setup

Luckily, I was able to reproduce and isolate the behavior. I show the example here. First we build a table of a million rows:

use tempdb
 
create table test
(
    testid bigint primary key,
    v varchar(36),
    nv nvarchar(36)
)
 
go
-- Itzik Ben Gan's trick (via Brad Schulz) to get around SQL Server's lack of a numbers table
with
   L0 as (select 1 as C union all select 1)       --2 rows
  ,L1 as (select 1 as C from L0 as A, L0 as B)    --4 rows
  ,L2 as (select 1 as C from L1 as A, L1 as B)    --16 rows
  ,L3 as (select 1 as C from L2 as A, L2 as B)    --256 rows
  ,L4 as (select 1 as C from L3 as A, L3 as B)    --65536 rows
  ,L5 as (select 1 as C from L4 as A, L4 as B)    --4,294,967,296 rows
  ,Nums as (select row_number() over (order by (select 0)) as N from L5)
insert test
select N,
    CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
from Nums
where N<=1000000;
go

The Queries

Now look at these two queries, The query that searches the unicode string performs eight times worse than its single-byte counterpart even though they use identical I.O.:

set statistics time on
set statistics io on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 797 ms,  elapsed time = 791 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6828 ms,  elapsed time = 6862 ms.
-- Table 'test'. Scan count 1, logical reads 16472,...
 
/*
(Aside: 36 characters isn't much of a test, but it turns out CPU usage
scales linearly on tests using larger input strings: Searching 72 character
strings take twice the CPU, 108 character strings take three times the CPU etc...)
*/

But Why?

The extra CPU cannot be explained away by the wider characters. My gut feeling says that strings twice as long should not take eight times the CPU to search. My first thought was that there was an implicit conversion somewhere but that wasn’t the case.

After some stackoverflow.com help it turns out that this has something to do with the different collations. Many different strings can be compared as equal even though they have different binary representations. VARCHAR strings with different binary representations can compare as equal (e.g. ‘MICHAEL’ is equal to ‘michael’). And Unicode string comparisons have even more complicated rules than these.

So if SQL Server collations have something to do with it, then we can hope that by using a binary collation, we’ll save the extra CPU. And in fact, we see something like that:

-- search unicode string with binary collation
select COUNT(1)
from test
where nv COLLATE Latin1_General_Bin
    like N'%ABCD%'
option (maxdop 1)
-- Table 'test'. Scan count 1, logical reads 16472, ...
-- CPU time = 781 ms,  elapsed time = 777 ms.

However if we use a case sensitive, accent senstive collation, we can hope for better CPU, but we’d be disappointed:

select COUNT(1)
from test
where nv COLLATE SQL_Latin1_General_CP1_CS_AS
    like N'%ABCD%'
option (maxdop 1)
-- CPU time = 6688 ms,  elapsed time = 6702 ms.
-- Table 'test'. Scan count 1, logical reads 16472, ...
/* similar results for all other non-binary collations */

So what do we know so far?

Searching inside NVARCHAR strings is slower than searching inside VARCHAR.
Specifying different collations (i.e. case sensitivity, accent sensitivity) for NVARCHAR strings doesn’t improve performance
Binary collations are the exception. Searching inside strings using binary collations are much faster

Not Just SQL Server

It turns out that this is most likely not SQL Server’s fault. SQL Server relies on the operating system for its string methods. In particular, it probably relies on any one of these methods found in Kernel32.dll:

lstrcmpi or lstrcmp
FindNLSStringEx, FindNLSString and FindStringOrdinal
CompareStringEx, CompareString and CompareStringOrdinal

The docs for the ~Ordinal functions indicate that these functions are meant for binary (non-linguistic) string comparisons. I’d bet a lot of money that this explains the behavior we see in SQL Server. It accounts for why comparisons using binary collations are faster while comparisons using other collations are not.

-- Comments (18)

18 Comments »

Nice post! Thanks for sharing.

Comment by SQLRockstar — February 16, 2011 @ 3:17 pm
Really cool investigation …

I feel bad because you did the digging and I am going to add this to my utility belt as well. 🙂

Thanks again for a great post!

Comment by David Nelles — February 17, 2011 @ 11:15 am
Whew, turns out that R Meyyappan covers this quite well in a SQL Bits session he gave almost a year ago: http://bit.ly/hsjfJs

I wish I had seen it then, it could have saved me some time this year. It was also very encouraging because he describes a solution we ended up using. It was also encouraging that his demo performance numbers are in line with mine (8x worse with unicode).

Comment by Michael J. Swart — February 17, 2011 @ 6:13 pm
Michael this is an excellent post, thanks very much for raising it and it certainly gives me a few things to think about. I have to confess I’d missed the SQL Bits session too so I’ll definitely be taking a look at that.

Comment by Mark Broadbent — February 18, 2011 @ 4:48 am
Very nic tip, I very much appreciate you sharing it!

Comment by Greg Palmer — February 20, 2011 @ 5:41 pm
Just had to tweak your setup code:
,L5 as (select 1 as C from L4 as A, L2 as B) –1,048,576 rows
(Since you had a cutoff, doesn’t make any difference. Just feels better.)

— yours: CPU time = 797 ms, elapsed time = 791 ms.
— mine: CPU time = 904 ms, elapsed time = 956 ms.
— yours: CPU time = 6828 ms, elapsed time = 6862 ms.
— mine: CPU time = 5928 ms, elapsed time = 5955 ms.

Then I just had to tweak your test code:
select COUNT(*) from test where charindex(‘ABCD’,nv COLLATE Latin1_General_Bin)>0 option (maxdop 1)

No surprise that lower case had no matches. This did have results which surprised me because I wasn’t comparing a Unicode string to the binary unicode. (I threw in charindex to see if that had any performance benefit. No more than random execution stats could improve it, but this binary comparison was nearly twice as fast as the byte comparison)

Nice article. (In the first tests using your original test script, the first query was 0.2 seconds slower on my machine and the second was nearly a second faster.)

Comment by Ken — February 20, 2011 @ 11:17 pm
Thanks Ken, Your timing is interesting (6.5 times slower instead of 8.5). I can’t account for it. There’s a million different variables to consider 🙂 But the conclusion is still the same.

Now CHARINDEX, that’s something I should have considered. Thanks for reporting back.

Comment by Michael J. Swart — February 21, 2011 @ 9:33 am
Unicode string searches are definitely more expensive — the comparison rules are much more complex than the relatively simple ones for non-Unicode SQL sort orders. But it’s worth pointing out that this overhead is generally only noticeable if you are doing full table or index scans of a fairly large table. In a queries with search predicates that can be satisfied via a selective index seek, this will generally be a non-issue.

In other words, do not assume that moving to Unicode will make an arbitrary application 8X slower. That’s close to the worst case scenario… that kind of additional cost will only be seen by a subset of applications (and even with the affected apps, only in a small subset of queries).

Comment by BartD — February 21, 2011 @ 12:12 pm
Of course it’s worth noting that “Your mileage may vary”. But when coming across a huge jump in CPU as I and others have done, it’s good to have an explanation and that’s what I’m doing here.

I did point out that this has “This has nothing to do with index lookups”. SARG-able queries are in fact best of course. I never meant to imply that Unicode = 8x worse always but only when searching for substrings (hence the title).

Comment by Michael J. Swart — February 21, 2011 @ 12:22 pm
Sorry, I didn’t mean to imply that you had stated anything that contradicted the point I was making.

I’ve run into this problem myself. But I’ve also worked with many, many apps that used Unicode types without issue. I just want to make sure that people don’t walk away with an incorrect conclusion that they should avoid using Unicode types. If you want your app to be multilingual, you want Unicode; I would caution people not to try storing multilingual data in varchar fields. With most apps the Unicode overhead will be so small that it would be difficult to measure.

That said, this potential perf issue is important to be aware of. And in cases where you’re storing strings that are simple programmatic identifiers that don’t need to store multilingual data (in other words, the field’s content is interpreted by an app but is not intended for direct consumption by end users), the additional cost of Unicode string comparisons is a good reason to use varchar instead of nvarchar.

Comment by BartD — February 21, 2011 @ 2:12 pm
Ahh… Sorry for misunderstanding your original post. And I understand your point now. 🙂

It’s interesting how I’m a slave to my own point of view sometimes. I looked at this Unicode vs. cpu issue from an “explaining weird behavior” point of view and that’s how I wrote the post.

It never occurred to me that someone might mistake this article to mean: “When designing databases columns, avoid Unicode or face the dire CPU consequences”. Which couldn’t be farther from what I meant.

Cheers Bart, thanks for your feedback.

Comment by Michael J. Swart — February 21, 2011 @ 4:54 pm
Something related (from Microsoft)
https://connect.microsoft.com/SQLServer/feedback/details/512459/replace-function-extremely-slow-with-non-binary-windows-collation

Comment by Michael J. Swart — February 22, 2011 @ 2:15 pm
Interesting observation Michael, I read it with pleasure! Collations, once in a while we all have some fun with them 😉

What I’d like to point out though: the speed improvement of casting the value to binary collation is good, but the results may end up to be different. Using binary collation, every character is different. The biggest impact of that is case-related: if your regular collation is CI, you won’t be able to ignore case using the binary collation (well, unless you use the LOWER() or UPPER() function on your column and search term). I’m sure you’re aware of this but I didn’t see this explicitly mentioned above, which means that some readers may be getting unexpected results.

Having said that, this method is definitely interesting to keep in mind for certain circumstances such as filtering on GUIDs (erm, yeah right :-))

Best regards, Valentino.

Comment by Valentino Vranken — March 3, 2011 @ 7:55 am
Thanks Valentino!

In troubleshooting mode, we all look for the solution to a problem. In this article it seems like I’m describing a problem (slow string searching). But it would be a mistake to think that switching collations is a solution I’m proposing. That certainly wasn’t my intent. String comparison behaviour is greatly impacted by collations (which is the point 🙂 )

But it’s interesting that in the particular circumstances that prompted me to write this article, I temporarily implemented a comparison that looked something like:
WHERE UPPER(NvarcharColumnToSearch) COLLATE Latin1_General_Bin like UPPER(N’%SearchString%’)
until I could replace the whole thing with something “SARG”-able.

Comment by Michael J. Swart — March 3, 2011 @ 8:43 am
I`m here right now!

Comment by Ted Musgrove — May 23, 2011 @ 3:01 am
@Ted,
Wherever you go, there you are. I try to be where I am 24/7 (or at least 24/5, weekends are my time).

Comment by Michael J. Swart — May 26, 2011 @ 1:38 pm
[…] do find). I like to think that I’m helping people in the same situation I was in. (Examples: Searching Inside Strings: CPU is Eight Times Worse For Unicode Strings, Eliminated Null […]

Pingback by Write Better | Michael J. Swart — December 20, 2011 @ 11:20 pm
[…] This is the fun part. Investigation time. I use google, #sqlhelp, and everything I can to help me get to the bottom of the mystery. In the case of the Unicode string mystery, I opened a StackOverflow item and got some quick quality answers. Also after the mystery got solved, it made a great topic for a blog post. […]

Pingback by I Was Kidding Myself | Michael J. Swart — February 27, 2012 @ 6:34 pm