Michael J. Swart

June 15, 2018

ORDER BY newid() is an Unbiased Way To Randomize

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:47 am

Mike Bostock is a data-visualization specialist. And it really shows in his blog. Every article is really well designed (which makes sense… many of the articles are about design).

One of his articles, Visualizing Algorithms has some thoughts on shuffling at https://bost.ocks.org/mike/algorithms/#shuffling.

He says that sorting using a random comparator is a rotten way to shuffle things. Not only is it inefficient, but the resulting shuffle is really really biased. He goes on to visualize that bias (again, I really encourage you to go see his stuff).

Ordering by random reminded me of the common technique in SQL Server of ORDER BY newid(). So I wondered whether an obvious bias was present there. So I shuffled 100 items thousands of times and recreated the visualization of bias in a heat map (just like Mike did).

Here is the heatmap. If you can, try to identify any patterns.

Order By NewID Bias

Where:

    columns are the position before the shuffle,
    rows are the position after the shuffle,
    green is a positive bias and
    red is a negative bias.

I don’t think there is any bias here. The problem that introduces bias in Mike Bostock’s example is that his “random comparator” that he defined does not obey transitivity. His words. “A comparator must obey transitivity: if a > b and b > c, then a > c.”
But in SQL Server, because each row is assigned a newid(), ORDER BY newid() doesn’t have that flaw and so it doesn’t have that bias.

But Be Careful

Although the method is unbiased, ORDER BY newid() is still inefficient. It uses a sort which is an inefficient way of shuffling. There are alternative shuffle algorithms that are more efficient.
ORDER BY newid() is good for quick and dirty purposes. But if you value performance, shuffle in the app.

1 Comment »

  1. […] Michael J. Swart tests whether ORDER BY NEWID() produces a biased result: […]

    Pingback by Randomization With NEWID() – Curated SQL — June 18, 2018 @ 8:00 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress