Michael J. Swart

September 9, 2014

Take Care When Scripting Batches

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 1:22 pm
100 Percent Online Deployments
How to deploy schema changes without scheduled downtime

Takeaway: When performing long-running modifications, I’m sure many of you enjoy using batches to increase concurrency. But I want to talk about a pitfall to be aware of. If you’re not careful, the method you use to implement batching can actually worsen concurrency.

... we don't need no stinkin' batches either

Why Use Batches?

Even without an explicit transaction, all SQL statements are atomic – changes are all or nothing. So when you have long-running modifications to make, locks on data can be held for the duration of your query and that can be too long. Especially if your changes are intended for live databases.

But you can make your modifications in several smaller chunks or batches. The hope is that each individual batch executes quickly and holds locks on resources for a short period of time.

But care is needed. I’m going to give an example to show what I mean. The example uses the FactOnlineSales table in the ContosoRetailDW database (available as a download here). The FactOnlineSales table has

  • one clustered index on OnlineSalesKey and no other indexes,
  • 12 million rows,
  • and 46 thousand database pages

Metrics to Use
In this example, I want to know how long each query takes because this should let me know roughly how long locks are held.
But instead of duration, I’m going to measure logical reads. It’s a little more consistent and in the examples below it’s nicely correlated with duration.

The Straight Query

Suppose we want to remove sales data from FactOnlineSales for the “Worcester Company” whose CustomerKey = 19036. That’s a simple delete statement:

DELETE FactOnlineSales WHERE CustomerKey = 19036;

This delete statement runs an unacceptably long time. It scans the clustered index and performs 46,650 logical reads and I’m worried about concurrency issues.

Naive Batching

So I try to delete 1,000 rows at a time. This implementation seems reasonable on the surface:

DECLARE	
	@RC INT = 1;
 
WHILE (@RC > 0)
BEGIN
 
  DELETE TOP (1000) FactOnlineSales
  WHERE CustomerKey = 19036;
 
  SET @RC = @@ROWCOUNT
 
END

Unfortunately, this method does poorly. It scans the clustered index in order to find 1,000 rows to delete. The first few batches complete quickly, but later batches gradually get slower as it takes longer and longer to scan the index to find rows to delete. By the time the script gets to the last batch, SQL Server has to delete rows near the very end of the clustered index and to find them, SQL Server has to scan the entire table.

In fact, this last batch performs 46,521 logical reads (just 100 fewer reads than the straight delete). And the entire script performed 1,486,285 logical reads in total. If concurrency is what I’m after, this script is actually worse than the simple DELETE statement.

Careful Batching

But I know something about the indexes on this table. I can make use of this knowledge by keeping track of my progress through the clustered index so that I can continue where I left off:

DECLARE
	@LargestKeyProcessed INT = -1,
	@NextBatchMax INT,
	@RC INT = 1;
 
WHILE (@RC > 0)
BEGIN
 
  SELECT TOP (1000) @NextBatchMax = OnlineSalesKey
  FROM FactOnlineSales
  WHERE OnlineSalesKey > @LargestKeyProcessed
    AND CustomerKey = 19036
  ORDER BY OnlineSalesKey ASC;
 
  DELETE FactOnlineSales
  WHERE CustomerKey = 19036
    AND OnlineSalesKey > @LargestKeyProcessed
    AND OnlineSalesKey <= @NextBatchMax;
 
  SET @RC = @@ROWCOUNT;
  SET @LargestKeyProcessed = @NextBatchMax;
 
END

The delete statements in this script performed 46,796 logical reads in total but no individual delete statement performed more than 6,363.

Graphically that looks like:

Logical Reads Per Delete Statement

Logical Reads Per Delete

The careful batching method runs in roughly the same time as the straight delete statement but ensures that locks are not held for long.
The naive batching method runs with an order of complexity (compared to the expected complexity of n) and can hold locks just as long as the straight delete statement.
This underlines the importance of testing for performance.

32 Comments »

  1. Ah yes, a lesson I’ve re-learned a few times. Usually it goes something like this: Start a batched operation, it’s running long so let it go over night (there’s lots of rows to go through after all), come back to work the next day to find it still running, realize I did something stupid, fix the batched query, completes in minutes.

    Comment by Aaron Cooper — September 9, 2014 @ 10:28 pm

  2. […] Take Care When Scripting Batches – Michael J. Swart (Blog|Twitter) […]

    Pingback by (SFTW) SQL Server Links 12/09/14 - John Sansom — September 12, 2014 @ 10:03 am

  3. Short and sweet. Thank you Michael.

    Comment by Srdjan — September 17, 2014 @ 2:20 pm

  4. Hi Michael,

    I really like this idea, and I’ve been applying it to a lot of my batched operations. Here’s a simplified version of an update I did. It ran through about 900k rows in 50 seconds. Let me know if you see a better way to do this.

    Thanks

    DECLARE @HighKey BIGINT = -1 ,
        @BatchMx BIGINT ,
        @Origin NVARCHAR(2000) = N'oldpath' ,
        @Destination NVARCHAR(2000) = N'newpath';
     
     
    WHILE EXISTS ( SELECT TOP 1
                            *
                   FROM     [FILE] WITH ( NOLOCK )
                   WHERE    Location LIKE @Origin + N'%' )
        BEGIN
     
            SELECT TOP ( 10000 )
                    @BatchMx = FileID
            FROM    [FILE] WITH ( NOLOCK )
            WHERE   Location LIKE @Origin + N'%'
            ORDER BY FileID;
     
            UPDATE  F
            SET     F.Location = REPLACE(F.Location, @Origin, @Destination)
            FROM    [FILE] F
            WHERE   F.FileID > @HighKey
                    AND F.FileID <= @BatchMx
                    AND Location LIKE @Origin + N'%';
     
            SET @HighKey = @BatchMx;
     
        END

    Comment by Erik — November 18, 2014 @ 2:15 pm

  5. Hi Erik,
    Thanks for stopping by.
    In each loop, when you’re selecting the next @BatchMx value, you don’t restrict FileID to be larger than the previous @HighKey.
    This means you may find yourself seeing the same poor behavior I described with the Naive batching method. But it’s actually pretty easy to measure.

    Your locks will be reduced on each batch (because you’re limiting the number of rows that get updated) but I wonder about the performance of each batch:
    Run the script on a test machine and collect the stats (IO/CPU) of each statement (using a trace). Compare it to the stats you get for the corresponding straight query:

    UPDATE  F
    SET     F.Location = REPLACE(F.Location, @Origin, @Destination)
    FROM    [FILE] F
    WHERE   Location LIKE @Origin + N'%';

    Comment by Michael J. Swart — November 20, 2014 @ 9:22 am

  6. Hi Michael,

    Are you saying where I select top (N), I should be filtering on the @BatchMx variable?

    I’ve reviewed lots of information about this vs. other methods using SQL Sentry monitoring tools. The only bottlenecks I’ve noticed are when I don’t/can’t have indexes that support other WHERE conditions I may use. For instance, the last time I ran this I was filtering on a bit column that I wasn’t allowed to put an index on.

    I’ll try your suggestion out today and follow up.

    Thanks,
    Erik

    Comment by Erik — November 20, 2014 @ 9:48 am

  7. No, not on the @BatchMx variable. Where you select top (n), I would add this filter:

    AND FileID > @HighKey

    So presumably, SQL Server can search for keys to update where it left off.

    I don’t know everything about your [FILE] table, for example, if there’s an index on Location, then the naive batching method will be very efficient.

    But the main authority here is SQL Server. Test it and see. If there are no bottlenecks with the method you’re using, then it’s good.

    (Also, if you don’t have an index on Location, the WHILE statement can become expensive too)

    Michael

    Comment by Michael J. Swart — November 20, 2014 @ 9:59 am

  8. Okay, that makes way more sense. I’ll give that a go.

    No index on Location. It’s NVARCHAR(2000) – a pretty long UNC file path – and part of a vendor product that I can’t really alter.

    Thanks,
    Erik

    Comment by Erik — November 20, 2014 @ 10:12 am

  9. Hi Michael,

    If we could add an index on CustomerKey, which one of the three approaches would have been the best?
    I guess that the first one, but it may depend on the number of records to be deleted.
    Maybe if the number of records for CustomerKey = 19036 we should take into consideration the log size required to run a single delete, and then run it batched. Would, in this case with the new index, the “naive batching” be good?

    I guess yes.

    Thanks

    Comment by Jaime — April 23, 2015 @ 2:36 am

  10. Hi Jaime,

    If we had added an index on CustomerKey, then that does change the story.
    If we add an index on CustomerKey, then that index creation has the largest impact on the log. But I’m not actually worried about that.

    I added the index you suggested and using the first method (the straight query), I see that the number of logical reads has jumped to 410,000! It’s not immediately obvious why. It has to do with the fact that now we have two indexes to delete from. So instead of

    1. Scan the whole clustered index
    2. Delete matching rows as you find them

    we have

    1. Look up the customer rows using the new index and for each row:
    2. Look up the corresponding clustered index row and delete it
    3. Delete the row from the new index

    Those lookups make all the difference

    As for the second method (the naive batching method with new index). Because of that index, it automatically performs equally as well as the third method (careful batching with new index).

    Comment by Michael J. Swart — April 27, 2015 @ 10:07 am

  11. Hi Michael,

    Thanks for your answer!

    We were experiencing the problem with the lookups and the “surprising” increase of logical reads, and we had determined exactly the same reason.
    But it is somehow disturbing that SQL performs worse with a non-clustered index in the field(s) that you use to filter the records to be deleted.
    In this scenario, the lookup will always be needed, and the bigger the amount of rows to delete is, the worse the performance will be. Am I right?
    Then SQL should determine that it would be a better execution plan if it chose the clustered index to delete.

    Jaime

    Comment by Jaime — April 27, 2015 @ 10:27 am

  12. Hi Jaime,
    SELECT statements and other queries can always perform better when appropriate indexes are added. But we pay for this performance gain by being obliged to maintain the indexes. This means that DELETE, UPDATE and INSERT statements must perform an extra operation for every extra index that’s involved. And the coordination of these actions can get expensive as you’ve discovered.

    You asked another question about the performance of a delete operation and the execution plan that SQL Server chooses. I’m not quite sure I understood that. SQL Server usually chooses pretty well.

    Comment by Michael J. Swart — April 27, 2015 @ 11:07 am

  13. Michael in message 4 is updating the lookup field, that’s probably why he isn’t using an increasing index column in the where clause:

    SET F.Location = REPLACE(F.Location, @Origin, @Destination)

    Of course, this only works if the updated value doesn’t match any longer…

    For the Original update query there is also a danger, which is not explicitly excluded:

    •one clustered index on OnlineSalesKey and no other indexes,
    •12 million rows,
    •and 46 thousand database pages

    DELETE FactOnlineSales
    WHERE CustomerKey = 19036
    AND OnlineSalesKey > @LargestKeyProcessed
    AND OnlineSalesKey @LargestKeyProcessed

    This may also skip records if duplicate values are allowed and present.

    I assume the that Clustered Index is also the Primary Key in this table, so this would not be a problem.
    But it may be if your search value isn’t unique for the column(s).

    Comment by SQL Server Performance Tune — December 13, 2016 @ 8:40 am

  14. Some lines have disappeared:

    DELETE FactOnlineSales
    WHERE CustomerKey = 19036
    AND OnlineSalesKey GT @LargestKeyProcessed
    AND OnlineSalesKey LTE @NextBatchMax

    Imagine @LargestKeyProcessed is equal to @NextBatchMax.
    No rows would be deleted and the while loop is quit, leaving records.

    WHERE OnlineSalesKey GT @LargestKeyProcessed

    If duplicate values are allowed, this could also skip and leave records.

    Comment by SQL Server Performance Tune — December 13, 2016 @ 8:46 am

  15. Hi there,
    It turns out that this technique also works for indexes that aren’t unique.
    In the original query, @LargestKeyProcessed can never be equal to @NextBatchMax because it’s set in this query:

    SELECT TOP (1000) @NextBatchMax = OnlineSalesKey
      FROM FactOnlineSales
      WHERE OnlineSalesKey > @LargestKeyProcessed
        AND CustomerKey = 19036
      ORDER BY OnlineSalesKey ASC;

    Try it out, when the indexes are not unique, the consequence is that the batches of rows that get processed may be larger than 1000, but none will get skipped.

    Comment by Michael J. Swart — December 13, 2016 @ 9:13 am

  16. Yes, that’s right. I forgot about that. I sometimes use a similar process to update or delete rows with non-unique identifiers by setting ROWCOUNT to the number of duplicates – 1. That is not the case here, but would be the danger if you think there are only 1000 records with that specific value.

    Thanks for correcting me.

    Comment by SQL Server Performance Tune — December 13, 2016 @ 9:27 am

  17. What if my original query looks like:

    DELETE addr FROM customer_address addr
    WHERE [BENEFICIARY_FK] IS NOT NULL
    OR [CONDUCTOR_FK] IS NOT NULL
    OR [THIRD_PARTY_FK] IS NOT NULL;

    Comment by MlbSQL — March 6, 2018 @ 1:47 pm

  18. Hello there, my question would be, what if my filters are different, would looks as shown below ?

    –Original query
    DELETE addr FROM customer_address addr
    WHERE [BENEFICIARY_FK] IS NOT NULL
    OR [CONDUCTOR_FK] IS NOT NULL
    OR [THIRD_PARTY_FK] IS NOT NULL;

    –Careful batching technique
    DECLARE
    @LargestKeyProcessed INT = -1,
    @NextBatchMax INT,
    @RC INT = 1; — rowcount

    WHILE (@RC > 0)
    BEGIN

    SELECT TOP (1000) @NextBatchMax = customer_address_key
    FROM customer_address
    WHERE customer_address_key > @LargestKeyProcessed
    and [BENEFICIARY_FK] IS NOT NULL
    OR [CONDUCTOR_FK] IS NOT NULL
    OR [THIRD_PARTY_FK] IS NOT NULL
    ORDER BY customer_address_key ASC;

    DELETE customer_address
    WHERE [BENEFICIARY_FK] IS NOT NULL
    OR [CONDUCTOR_FK] IS NOT NULL
    OR [THIRD_PARTY_FK] IS NOT NULL
    AND customer_address_key > @LargestKeyProcessed
    AND customer_address_key <= @NextBatchMax;

    SET @RC = @@ROWCOUNT;
    SET @LargestKeyProcessed = @NextBatchMax;

    END

    Comment by MlbSQL — March 6, 2018 @ 1:50 pm

  19. Hi MlbSQL,
    That’s certainly how I would do it. The IS NOT NULL filter shows up the exact same way as the CustomerKey filter shows up in my example.
    The way you’ve written it, presumably the customer_address_key is the first column of the clustered index.
    Make sure to measure on a test table so that the duration of any particular query is acceptable.

    I can’t by inspection tell whether any particular batching implementation is adequate.
    Test and measure yourself and then look at the numbers.

    Good luck

    Comment by Michael J. Swart — March 6, 2018 @ 2:12 pm

  20. Hello Michael J. Swart thanks for your answer, sounds good and I’m gonna check the timings and see how does it work, also.. do you think the same approach would work for and update statements as shown below ?

    DECLARE
    @LargestKeyProcessed INT = -1,
    @NextBatchMax INT,
    @RC INT = 1;

    WHILE (@RC > 0)
    BEGIN

    SELECT TOP (10000) @NextBatchMax = cust_infoKey
    FROM cust_info
    WHERE cust_infoKey > @LargestKeyProcessed
    and type=”
    ORDER BY cust_infoKey ASC;

    update cust_info
    set type=null
    where type=”
    AND cust_infoKey > @LargestKeyProcessed
    AND cust_infoKey <= @NextBatchMax;

    SET @RC = @@ROWCOUNT;
    SET @LargestKeyProcessed = @NextBatchMax;

    END

    Comment by MlbSQL — March 6, 2018 @ 2:49 pm

  21. It’s difficult to continue to review examples of your code. Have a colleague of yours review it.
    Good luck,
    Michael

    Comment by Michael J. Swart — March 6, 2018 @ 2:51 pm

  22. Hi Michael,
    I have a few very large (100 million rows+) tables which require purging of old data on a regular basis. The current table structure uses a compound key based on DATETIME and BIGINT IDENTITY. I delete in batches by the DATETIME column. I ran a small test and in fact the logical reads increased as it deleted more and more rows. How can I use your technique to improve the delete performance? Or should I rearchitect the table to just use the BIGINT IDENTITY a the key?

    Thanks in advance.

    Comment by Tony Green — May 10, 2018 @ 12:04 pm

  23. Great question Tony, but I think you’ve already learned the most important lesson which is to measure this.
    There’s no easy answer because every situation is different.

    The thing I would focus on is the indexes that get used to find the next set of rows to process.
    You say that the current structure uses a key that looks like (datetime, bigint).
    Some questions I would ask are: Is it the Primary Key? Is it the clustered index? Or is it both?
    Is the key exactly in that order?

    Because I would expect the technique to work IF the clustered index is on (datetime) or (datetime,bigint) in that order

    Comment by Michael J. Swart — May 10, 2018 @ 2:11 pm

  24. Hi Michael,

    Thank you for explaining that thoroughly. I have a table with 100 million rows and more. I am tasked to purge the data older than 9 months and for this I have created a script by using your technique (used identity column to process 10000 records). However, I am seeing excessive memory grant in my execution plan. As far as I understand, this is generally associated with varchar(x) or nvarchar(x) which are the data types of my tables. There are indexes that support my delete statement as well. Can you advise what should be looked at to get rid of that warning?

    Comment by Sid — June 6, 2018 @ 11:13 am

  25. Hi Sid,
    A small batch size (like 10K records) should only need to focus on processing a small number of records, so the warning is a surprise.
    But this really depends on context.

    My first thought is to explore the idea that an excessive memory grant warning actually indicates a problem.
    Double check (by measuring, not by inspection) that the execution plan isn’t somehow falling into the naive batching pattern.
    If you reduce the size of each batch, is there a batch-size that reduces the memory grant?
    Another idea is to maybe create a temporary index on the column that stores the date, and see if naive batching works (remember, naive batching only failed in my example because there was no index on CustomerKey)

    Write back and let me know what solution you come up with.

    Comment by Michael J. Swart — June 6, 2018 @ 11:34 am

  26. Thanks for replying Michael. So, let me give a brief background. I have few tables where the data older than 9 months needs to be deleted. What I have done is (sorry for not clear enough earlier) the table that contains metadata contains date column and identity column and others. So, I am getting the 10000th record from this metadata table and using this as MaxRun (in the other tables) to convert into careful batching. Therefore, I am ending up deleting close to 60k – 70K records that are between first record and 10000th record. I am only using date column where I am checking if a record exist that is 9 months old, if does then continue with 10000 chunk and not in my actual delete statements. I want to make sure that my other queries are not deprived of memory when this process is running and this will be a regular maintenance job (once I have deleted almost 2 years of data). However, what I have noticed is that my actual number of rows and estimated number of rows are different. At this time, I am still investigating.

    Comment by Sid — June 7, 2018 @ 8:29 am

  27. […] variables against values in the clustered index of the table we are updating. (Or deleting, as Michael J Swart showed in an example a few years […]

    Pingback by Batching data manipulation is great as long as you do it correctly | The Desert DBA — September 15, 2018 @ 4:04 pm

  28. Why do you / we not just select all PK-ID from the customer into a #temp-Table and run a DELETE using it:

    SELECT OnlineSalesKey
    INTO #tmp
    FROM FactOnlineSales
    WHERE CustomerKey = 19036
    ;

    DELETE FROM t
    FROM FactOnlineSales AS t
    WHERE EXISTS(SELECT * FROM #temp AS tmp where tmp.OnlineSalesKey = t.OnlineSalesKey)

    Comment by Thomas Franz — November 19, 2018 @ 4:26 am

  29. Hi Thomas, you could do that. But it wouldn’t be faster and would take some extra space. The initial SELECT INTO will do the same scan as naive batching. Because it takes such a long time, there’s a chance that new rows with CustomerKey = 19036 can get added after the SELECT statement and then get missed with the DELETE statement.
    It reminds me of another technique: create a filtered index (online!) and delete it after.

    Both of these methods may be simpler, but they don’t let us off the hook. We’d still have to measure the timing.

    There’s probably a lot of other ways to tackle this particular problem as it’s presented. But sometimes in other situations, batching is the right choice and the point of this article is to present a way to do batching safely and online (and to measure and demonstrate that it’s safe).

    Comment by Michael J. Swart — November 19, 2018 @ 9:00 am

  30. @Thomas Franz I agree, that’s pretty much how I do it.

    Get all the Clustered Index key column(s) [rather that PKey, which might not be Clustered], and also an IDENTITY, into a #TEMP table, ORDER BY the Clustered Index column(s).

    That needs just one query to find all the relevant rows, and they are ordered in #TEMP by ID

    Then JOIN the #TEMP table to the FactOnlineSales ON the Clustered Index column(s), and loop using a RANGE on the ID column in #TEMP. Use a WAITFOR between each delete loop iteration to allow other processes to run, and alter the batch size depending on the loop duration so that when the system gets busy the batch size reduces – if over a preset threshold reduce batch size by 50%, if under then increase batch size by 10%.

    Comment by Kris — November 19, 2018 @ 5:55 pm

  31. Would is be worth the effort to create a partition scheme, set the partition function to Customer Key and then drop the partition? Too much work? No benefit?. I’d heard that a sliding window partition was a good way to delete large blocks of data with minimal locking & blocking?

    Comment by SqlNightOwl — November 19, 2018 @ 7:00 pm

  32. When dropping partitions that way, all the indexes on the table need to be partitioned or “aligned”.
    It’s not too much work, but it comes with a lot of dependencies.
    Every scenario including partitioning is worth exploring. Always measure 🙂

    Comment by Michael J. Swart — November 19, 2018 @ 7:11 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress