Michael J. Swart

September 11, 2017

Remove SQL Junk (Brackets and Other Clutter)

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 10:42 am

“Above all else, show the data” says Edwarde Tufte. He’s the data visualization expert who promotes a high data-ink ratio in data visualizations. He describes data-ink as “the non-erasable core of a graphic”. In other words, avoid chartjunk.

SQL Junk

I buy that. Less is More. And we can apply that idea to SQL. If SQL is going to be maintained by a human, it’s best to use a style that is easy on the reader. There’s a lot of syntax in SQL that is redundant. By keeping only the non-erasable syntax in SQL statements, the SQL gets easier to understand and maintain.

Square Brackets

I know I’m not the only one who thinks so. I was watching Kendra Little’s presentation SSMS Shortcuts & Secrets. When someone asked whether there was a shortcut for removing brackets. There isn’t. Most people search and replace “[” with “” and then do the same thing with “]”. Kendra wondered if there was a regular expression that allowed a user to do both. There is, but it’s awkward because brackets need to be escaped: [\[\]] (which is only slightly easier to remember than ¯\_(ツ)_/¯).

SQL Junk in SSMS Generated Scripts

SSMS’s scripting engine is great at taking an object like a table and giving you a CREATE statement that will perfectly recreate what you need. But it’s a huge contributor of SQL Junk. I’ve seen so much SSMS-generated SQL get checked in to repositories that the style is assumed to be a best practice.

Here’s something that shows just how much SQL Junk can be removed. I think the resulting SQL is so much more clear and so much easier to maintain.

August 2, 2017

Problem With Too Many version_ghost_records

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 8:00 am

We encountered a CPU issue which took over a month to understand and fix. Since then, it’s been another couple months and so I think it may be time for a debrief.

The cause was identified as a growing number of ghost records that SQL Server would not clean up no matter what. Our fix was ultimately to restart SQL Server.

When there's somethin' strange in your neighborhood

Symptoms and Data Collection

Here’s what we found.

  • At time marked ‘A’ on the graph, we noticed that CPU increased dramatically. It was hard not to notice.
  • We used sp_whoisactive to identify a frequent query that was taking a large amount of CPU. That query had never been a problem before. It was a select against a queue table – a table whose purpose was to store data for an implementation of a work queue. This table had a lot of churn: many INSERTS and DELETES. But it was small, never more than 100 rows.
  • So next, we ran the query manually in Management Studio. SET STATISTICS IO, TIME ON gave us a surprise. A simple COUNT(*) of the table told us there were 30 rows in the table, but reading it took 800K logical page reads!
  • What pages could it be reading? It’s impossible for a table to be that fragmented, it would mean less than one row per page. To look at the physical stats we ran:
    select 
           sum(record_count) as records,
           sum(ghost_record_count) as ghost_records,
           sum(version_ghost_record_count) as version_ghost_records
      from sys.dm_db_index_physical_stats(db_id(), object_id('<table_name>'), default, default, 'detailed')
     where index_id = 1
           and index_level = 0

    And that gave us these results:


    Interesting. The ghost records that remain are version_ghost_records, not ghost_records. Which sounds like we’re using some sort of snapshot isolation (which we’re not), or online reindexing (which we are), or something else that uses row versions.

  • Over time, those version_ghost_records would constantly accumulate. This ghost record accumulation was also present in all other tables, but it didn’t hurt as much as the queue table which had the most frequent deletes.

Mitigation – Rebuild Indexes

Does an index rebuild clean these up? In this case, yes. An index rebuild reduced the number of version ghost records for the table. At the time marked ‘B’ in the timeline, we saw that an index rebuild cleaned up these records and restored performance. But only temporarily. The version_ghost_records continued to accumulate gradually.

At time ‘C’ in the timeline, we created a job that ran every 15 minutes to rebuild the index. This restored performance to acceptable levels.

More investigation online

Kendra Little – Why Is My Transaction Log Growing In My Availability Group?
This is a great video. Kendra describes a similar problem. Long running queries on secondary replicas can impact the primary server in an Availability Group (AG). But we found no long running queries on any replica. We looked using sp_whoisactive and DBCC OPENTRAN. We didn’t see any open transactions anywhere.

Amit Banerjee – Chasing the Ghost Cleanup in an Availability Group
Amit also mentions that log truncation would be prevented in the case of a long-running query on a replica. However, in our case, log truncation was occurring.

Uwe Ricken – Read Committed Snapshot Isolation and high number of version_ghost_record_count
Uwe Ricken also blogged recently about a growing number of version_ghost_records. He talked about looking for open transactions that use one of the snapshot isolation levels. Unfortunately it didn’t apply to our case.

Bitemo Erik Gergely – A very slow SQL Server
Another example of a long running query keeping version_ghost_records around.

dba.stackexchange – GHOST_CLEANUP Lots of IO and CPU Usage
This stackexchange question also describes a problem with lots of CPU and an inefficient, ineffective ghost cleanup task for databases in an AG. There’s an accepted answer there, but it’s not really a solution.

Calling Microsoft Support

So we called Microsoft support. We didn’t really get anywhere. We spoke with many people over a month. We generated memory dumps, PSSDiag sessions and we conducted a couple screen sharing sessions. Everyone was equally stumped.

After much time and many diagnostic experiments. Here’s what we did find.

  • Availability Groups with readable secondaries are necessary (but not sufficient) to see the problem. This is where the version_ghost_records come from in the first place. Readable secondaries make use of the version store.
  • We ran an extended event histogram on the ghost_cleanup event. There was a ridiculous amount of events. Like millions per minute, but they weren’t actually cleaning up anything:
    CREATE EVENT SESSION ghostbusters ON SERVER 
    ADD EVENT sqlserver.ghost_cleanup( ACTION( sqlserver.database_id ) )
    ADD TARGET package0.histogram( SET filtering_event_name=N'sqlserver.ghost_cleanup', source=N'sqlserver.database_id' )
  • Microsoft let me down. They couldn’t figure out what was wrong. Each new person on the case had to be convinced that there were no open transactions. They couldn’t reproduce our problem. And on our server, we couldn’t avoid the problem.

Resolution

Ultimately the time came for an unrelated maintenance task. We had to do a rolling upgrade for some hardware driver update. We manually failed over the Availability Group and after that, no more problem!

It’s satisfying and not satisfying. Satisfying because the problem went away for us and we haven’t seen it since. After the amount of time I spent on this, I’m happy to leave this problem in the past.

But it’s not satisfying because we didn’t crack the mystery. And restarting SQL Server is an extreme solution for a problem associated with an “Always On” feature.

If you’re in the same boat as I was, go through the links in this post. Understand your environment. Look for long running queries on all replicas. And when you’ve exhausted other solutions, mitigate with frequent index rebuilds and go ahead and restart SQL Server when you can.

July 24, 2017

Solve this Math Puzzle using SQL

Filed under: Technical Articles — Michael J. Swart @ 9:08 am

IBM holds a monthly math/programming challenge called “Ponder This”.
This month’s problem (July 2017) is an interesting one because it can (mostly) be solved using SQL and highschool math:


“Find a way to assign a unique digit to each letter such that the 863,172nd digit after the decimal point of sin(PONDER) and of sin(THIS) are not zeros, but more than a fifth of the first million digits of sin(PONDER)*sin(THIS) are zeros.

“The parameters are in degrees, and we count the digits from 1, so for example, sin(100) = 0.9848077530122… so the fifth digit after the decimal point of sin(100) is zero.”

If you’re up to the challenge, use any technique you like, but you’ve only got a week. Next week (August 1st) after IBM publishes their solution, I’ll publish the SQL based solution that I used.

Update: My Solution

First I realized that PONDER and THIS represent integers greater than 360 degrees. That’s okay. All I need to do is find two numbers A and B between 1 and 360 with the desired properties and then take that solution and add 360 to either number until I get a six digit number and a four digit number with no shared digits.

So my guess is that SIN(A) * SIN(B) is a regular number.
I can look for that using SQL:

with Nums as
(
    select top 360 row_number() over (order by (select 1)) as n
    from sys.syscolumns
)
select A.n as a, 
       B.n as b, 
       sinA, 
       sinB,
       sinA * sinB as sinA_x_sinB
from Nums A
cross join Nums B
cross apply 
      (
         select SIN(RADIANS(A.n + 0.0)) as sinA,
                SIN(RADIANS(B.n + 0.0)) as sinB
      ) calc
where right(format(sinA * sinB, '###.00000000000'), 4) = '0000'
  and right(format(sinA, '###.00000000000'), 4) <> '0000'
  and right(format(sinB, '###.00000000000'), 4) <> '0000'
  and A.n % 10 <> B.n % 10
order by A.n, B.n

Notice the one extra filter I snuck in there: A.n % 10 <> B.n % 10. If A and B share the last digit, then PONDER and THIS would share the last digit and we can’t have that.

I’m going to focus on one solution, A = 54, and B = 18.
Can I add 360 to each of these to get an 6 digit and a 4 digit solution for PONDER and THIS? Yes.

select TOP (27) 54 + 360 * row_number() over (order by (select 1)) as id
into #nums4
from sys.messages a;
delete #nums4 where id < 1000;
 
select TOP (2777) 18 + 360 * row_number() over (order by (select 1)) as id
into #nums6
from sys.messages a;
delete #nums6 where id < 100000;
 
select #nums4.id, #nums6.id 
from #nums4
cross join #nums6
where dbo.HasDuplicate(10000*#nums6.id+#nums4.id) = 0

Where HasDuplicate is a CLR function I wrote that returns whether a number has repeat digits.
This gives me a few solutions, for example PONDER = 102978 and THIS = 3654.

I get a kick out of these puzzles and as an added bonus, I get to exercise my SQL skills.

Update 2

Stephen pointed out in the comments that I didn’t even check to see whether the 863,172nd digit of SIN(PONDER) or SIN(THIS) is nonzero. How did I do that? Well, I didn’t use SQL. I used google. Searching for “What’s the exact value of SIN(18)” gives me this page Exact Value of sin 18°. The internet’s amazing. So I trust that the exact value is sin(18) = ( -1 + sqrt(5) ) / 4 which has the same 863,172nd digit as sqrt(5) / 4. The same reasoning works for SIN(54) because sin(54) = ( 1 + sqrt(5) ) / 4.

Rather than figure out how to calculate the 863,172nd digit of sqrt(5) / 4, I searched for and found the first million digits of sqrt(5). The internet’s amazing. I know that when you divide a number by four, any digit in that number usually has a limited scope of influence on the digits in the answer. I divided the digits around the 863172 place by 4 to convince myself that the 863172nd digit of sqrt(5) / 4 is not 0.

July 20, 2017

SQL Server UPSERT Patterns and Antipatterns

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 11:34 am
Upsert Patterns and Antipatterns
Good ways and bad ways to update or insert rows

In 2015, Postgres added support for ON CONFLICT DO UPDATE to their INSERT statements. Their docs say that using this syntax guarantees an atomic INSERT or UPDATE outcome; one of those two outcomes is guaranteed, even under high concurrency.

Now, that is one amazing feature isn’t it? Postgres worked really hard to include that feature and kudos to them. I would trade ten features like stretch db for one feature like this. It makes a very common use case simple. And it takes away the awkwardness from the developer.

But if you work with SQL Server, the awkwardness remains and you have to take care of doing UPSERT correctly under high concurrency.

Easy things should be easy

I wrote a post in 2011 called Mythbusting: Concurrent Update/Insert Solutions. But since then, I learned new things, and people have suggested new UPSERT methods. I wanted to bring all those ideas together on one page. Something definitive that I can link to in the future.

The Setup

I’m going consider multiple definitions of a procedure called s_AccountDetails_Upsert. The procedure will attempt to perform an UPDATE or INSERT to this table:

CREATE TABLE dbo.AccountDetails
(
    Email NVARCHAR(400) NOT NULL CONSTRAINT PK_AccountDetails PRIMARY KEY,
    Created DATETIME NOT NULL DEFAULT GETUTCDATE(),
    Etc NVARCHAR(MAX)
);

Then I’m going to test each definition by using a tool like SQL Query Stress to run this code

declare @rand float = rand() + DATEPART( second, getdate() );
declare @Email nvarchar(330) = cast( cast( @rand * 100 as int) as nvarchar(100) ) + '@somewhere.com';
declare @Etc nvarchar(max) = cast( newid() as nvarchar(max) );
exec dbo.s_AccountDetails_Upsert @Email, @Etc;

Antipattern: Vanilla Solution

If you don’t care about concurrency, this is a simple implementation. Nothing fancy here:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
  ELSE 
 
    INSERT dbo.AccountDetails ( Email, Etc )
    VALUES ( @Email, @Etc );

Unfortunately, under high concurrency, this procedure fails with primary key violations.

Antipattern: MERGE Statement

The MERGE statement still suffers from the same concurrency issues as the vanilla solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

Primary Key violations are still generated under high concurrency. Even though it’s a single statement, it’s not isolated enough.

Antipattern: Inside a Transaction

If you try putting the vanilla solution inside a transaction, it makes the whole thing atomic, but still not isolated. You still get primary key violations:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Inside a Serializable Transaction

So this should be isolated enough, but now it’s vulnerable to deadlocks:

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
  BEGIN TRAN
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

Antipattern: Using IGNORE_DUP_KEY

I want to mention one more bad solution before I move on to the good solutions. It works and it’s interesting, but it’s a bad idea.

ALTER TABLE dbo.AccountDetails 
  DROP CONSTRAINT PK_AccountDetails;
 
ALTER TABLE dbo.AccountDetails
  ADD CONSTRAINT PK_AccountDetails 
  PRIMARY KEY ( Email ) WITH ( IGNORE_DUP_KEY = ON )
GO
 
CREATE PROCEDURE dbo.s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  -- This statement will not insert if there's already a duplicate row,
  -- But it won't raise an error either
  INSERT dbo.AccountDetails ( Email, Etc ) 
  VALUES ( @Email, @Etc );
 
  UPDATE dbo.AccountDetails 
     SET @Etc = Etc
   WHERE @Email = Email;

So what are some better solutions?

Pattern: Inside a Serializable Transaction With Lock Hints

Still my prefered solution.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
  SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  BEGIN TRAN
 
    IF EXISTS ( SELECT * FROM dbo.AccountDetails WITH (UPDLOCK) WHERE Email = @Email )
 
      UPDATE dbo.AccountDetails
         SET Etc = @Etc
       WHERE Email = @Email;
 
    ELSE 
 
      INSERT dbo.AccountDetails ( Email, Etc )
      VALUES ( @Email, @Etc );
 
  COMMIT

This is a solid solution, but every implementation is different so every time you use this pattern, test for concurrency.

Pattern: MERGE Statement With Serializable Isolation

Nice, but be careful with MERGE statements.

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
 
    MERGE dbo.AccountDetails AS myTarget
    USING (SELECT @Email Email, @Etc etc) AS mySource
        ON mySource.Email = myTarget.Email
    WHEN MATCHED THEN UPDATE
        SET etc = mySource.etc
    WHEN NOT MATCHED THEN 
        INSERT (Email, Etc) 
        VALUES (@Email, @Etc);

For alternative syntax, skip the SET TRANSACTION ISOLATION LEVEL SERIALIZABLE and include a lock hint: MERGE dbo.AccountDetails WITH (HOLDLOCK) AS myTarget.

Pattern: Just Do It

Just try it and catch and swallow any exception

CREATE PROCEDURE s_AccountDetails_Upsert ( @Email nvarchar(4000), @Etc nvarchar(max) )
AS 
 
BEGIN TRY
 
  INSERT INTO dbo.AccountDetails (Email, Etc) VALUES (@Email, @Etc);  
 
END TRY
 
BEGIN CATCH
 
  -- ignore duplicate key errors, throw the rest.
  IF ERROR_NUMBER() IN (2601, 2627) 
    UPDATE dbo.AccountDetails
       SET Etc = @Etc
     WHERE Email = @Email;
 
END CATCH

I wrote a post about it in Ugly Pragmatism For The Win.
And I wrote about a rare problem with it in Case study: Troubleshooting Doomed Transactions.

I have no concurrency concerns here but because of this issue – not to mention performance concerns – it’s no longer my first preference.

Performance and Other Concerns

When describing each pattern so far, I haven’t paid attention to performance, just concurrency. There are many opportunities to improve the performance of these solutions. In particular, pay very close attention to your average use case. If your workload UPDATEs a row 99% of the time and INSERTs a row only 1% of the time, then optimal implementation will look different than if the UPDATE and INSERT frequency is reversed.

Check out these links (thanks Aaron Bertrand)

July 16, 2017

10 Things I Learned While Working At D2L

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication — Michael J. Swart @ 3:49 pm

Today marks my tenth anniversary working for D2L and it also marks ten years since I specialized in databases full time.
I’ve never worked anywhere this long. I always told myself that I’d stick around as long as there were new and interesting challenges to tackle and there’s no end in sight.

Recognition
So I thought I’d compile a list of things I didn’t know in 2007.

1. Database Performance Improvements Are Often Dramatic

A poorly performing database query can run orders of magnitude slower than a tuned query. Compare that to poorly performing application code. Improvements there are usually more modest.

But this can work out. It makes you look like a genius. You get to brag how that report that once took 32 hours now takes 10 minutes. Even though that missing index suggestion in the query plan turned out to be right this time, take the credit, you deserve it.

2. The Community Has Your Back

Speaking of appearing like a genius, here are three resources I still use when solving problems.

3. People Love Pictures With Their Content

I’ve been writing here for nearly ten years. One day, I decided to include an illustration with each blog post. For some reason it makes each post just a little more compelling than a wall of text. Drawing has been a very rewarding hobby for me. I highly recommend it.

4. There Are Dozens of Walkable Lunch Options In Downtown Kitchener

Here are a few of my favorites

  • Kava Bean: Excellent all day breakfast
  • City Cafe: Awesome bagels in the morning.
  • Darlise: You have to hunt for it, but it’s great food
  • Holy Guacamole: Especially if you’re a fan of cilantro

If you’re in town, send me an email. I’ll introduce you to one of these places.

5. Don’t Overestimate Normal Forms

Knowing normal forms is a lot less useful than I thought it would be. Normal forms are an academic way of describing a data model and it’s ability to avoid redundant data. But knowing the difference between the different normal forms has never helped me avoid trouble.

Just use this rule of thumb: Avoid storing values in more than one place. That will get you most of the way. Do it right and you’ll never hear anyone say things like

  • We need a trigger to keep these values in sync
  • We need to write a script to correct bad data

6. “How Do You Know That?”

This is one of my favorite lines. I use it when troubleshooting a technical problem. It’s a line that helps me avoid red-herrings or to maybe find out if that info is coming via telephone game. It’s kind of my version of Missouri’s “you’ve got to show me”. It might go something like this:

Say a colleague says “I/O is slow lately.”

So I’d ask “How do you know?”

They might say something like:

  • “Hinky McUnreliable thought that’s what it was. He told me in the hallway just now”
  • The error log reports 3 instances of I/O taking longer than 15 seconds
  • I measured it on the QA box.

7. Reduce Query Frequency

The impact of a particular procedure or query on a server is simply the frequency of a query multiplied by the cost of that query. To reduce the impact on the server it’s often more valuable to focus on reducing the number of calls to that query in addition to tuning it.

  • Look for duplicate queries. You’d be surprised how often a web page asks for the same information.
  • Cache the results of queries that are cache-able. I rely on this one heavily.
  • Consider techniques like lazy loading. Only request values the moment they’re needed.

8. Schema Drift Is Real

If you have multiple instances of a database then schema will drift over time. In other words, the definition of objects like tables, views and procedures will be out of alignment with what’s expected. The drift is inevitable unless you have an automated process to check for misalignment.

We wrote our own but tools like Redgate’s Schema Compare can help with this kind of effort.

9. Get Ahead Of Trouble

Software will change over time and it requires deliberate attention and maintenance to remain stable. Watch production and look for expensive queries before they impact end users.

When I type it out like that, it seems obvious. But the question really then becomes, how much time do you want to invest in preventative maintenance? My advice would be more.

Not enough time means that your team gets really good practice firefighting urgent problems. I work directly with a dream team of troubleshooters. They’re smart, they’re cool during a crisis and they’re awesome at what they do. Fixing issues can be thrilling at times, but it pulls us away from other projects.

And what do they say about people who can’t remember the past? Sometimes we have to relearn this lesson every few years.

10. D2L Is A Great Employer

D2L values learning of course and that includes employee professional development. They have always supported my blogging and speaking activities and they recognize the benefits of it.

The people I work with are great. I couldn’t ask for a better team.

And we’ve got a decent mission. I’m really glad to be part of a team that works so hard to transform the way the world learns. Thanks D2L!

July 6, 2017

Converting from DateTime to Ticks using SQL Server

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 9:55 am

A .Net tick is a duration of time lasting 0.1 microseconds. When you look at the Tick property of DateTime, you’ll see that it represents the number of ticks since January 1st 0001.
But why 0.1 microseconds? According to stackoverflow user CodesInChaos “ticks are simply the smallest power-of-ten that doesn’t cause an Int64 to overflow when representing the year 9999”.

Even though it’s an interesting idea, just use one of the datetime data types, that’s what they’re there for. I avoid ticks whenever I can.

ticks

But sometimes things are out of your control. If you do want to convert between datetime2 and ticks inside SQL Server instead of letting the application handle it, then you need to do some date math. It can get a bit tricky avoiding arithmetic overflows and keeping things accurate:

DateTime2 to Ticks

CREATE FUNCTION dbo.ToTicks ( @DateTime datetime2 )
  RETURNS bigint
AS
BEGIN
 
    RETURN DATEDIFF_BIG( microsecond, '00010101', @DateTime ) * 10 +
           ( DATEPART( NANOSECOND, @DateTime ) % 1000 ) / 100;
END

DateTime2 to Ticks (Versions < SQL Server 2016)

CREATE FUNCTION dbo.ToTicks ( @DateTime datetime2 )
  RETURNS bigint
AS
BEGIN
    DECLARE @Days bigint = DATEDIFF( DAY, '00010101', cast( @DateTime as date ) );
    DECLARE @Seconds bigint = DATEDIFF( SECOND, '00:00', cast( @DateTime as time( 7 ) ) );
    DECLARE @Nanoseconds bigint = DATEPART( NANOSECOND, @DateTime );
    RETURN  @Days * 864000000000 + @Seconds * 10000000 + @Nanoseconds / 100;
END

Ticks to DateTime2

CREATE FUNCTION dbo.ToDateTime2 ( @Ticks bigint )
  RETURNS datetime2
AS
BEGIN
    DECLARE @DateTime datetime2 = '00010101';
    SET @DateTime = DATEADD( DAY, @Ticks / 864000000000, @DateTime );
    SET @DateTime = DATEADD( SECOND, ( @Ticks % 864000000000) / 10000000, @DateTime );
    RETURN DATEADD( NANOSECOND, ( @Ticks % 10000000 ) * 100, @DateTime );
END

May 24, 2017

A Table Of Contents For the Data Industry

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 10:52 am

DDIAA review of Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman, published by O’Reilly.

Martin Kleppmann

In my time exploring the world of data, I often feel intrigued by the unfamiliar. But like most, I’m leery of buzzwords and maybe a little worried that I’m missing out on the thing that we’ll all need to learn to extend our careers beyond the next five years.

Adding to the frustration, I have difficulty evaluating new technologies when the only help I’ve got is a vendor’s brochure, or a blog titled “Why X is way way better than Y”. Balanced, unbiased help is hard to find. And I really value unbiased ideas. Personally, I try not to appear the Microsoft groupie. I value my MVP award, but I also like to stress the “independent” part of that.

I’m also wary of some academic researchers. The theory can sometimes drift too far away from the practical (dba.stackexchange only has 411 results for “normal form” and I suspect many of those are homework questions). Some academics almost seem offended whenever a vendor deviates from relational theory.

That’s why I was so thrilled (and bit relieved) to discover Martin Kleppmann’s Designing Data-Intensive Applications. Martin is an amazing writer who approached his book with a really balanced style. He’s also a researcher with real-world experience that helps him focus on the practical.

Designing Data-Intensive Applications

First let me just say that the book has a really cool wild boar on the cover. The boar reminds me of Porcellino at the University of Waterloo.

Boar

Designing Data-Intensive Applications is a book that covers database systems very comprehensively. He covers both relational systems and distributed systems. He covers data models, fault tolerance strategies and so much more. In fact he covers so many topics that the whole book seems like a table of contents for our data industry.

Here’s the thing. When I read the parts I know, he’s a hundred percent right and that helps me trust him when he talks about the parts that I don’t know about.

Martin talks a lot about distributed systems, both the benefits and drawbacks, and even though you may have no plans to write a Map-Reduce job, you’ll be equipped to talk about it intelligently with those that do. The same goes for other new systems. For example, after reading Martin’s book, you’ll be able to read the spec sheet on Cosmos DB and feel more comfortable reasoning about its benefits (but that’s a post for another day).

Event Sourcing

Martin then goes on to write about event sourcing. Martin is a fan of event sourcing (as are several of my colleagues). In fact I first learned about event sourcing when a friend sent me Martin’s video Turning the Database Inside Out With Apache Samza.
A ridiculously simplified summary is this. Make the transaction log the source of truth and make everything else a materialized view. It allows for some really powerful data solutions and it simplifies some problems you and I have learned to live with. So I wondered whether his chapter on event sourcing would sound like a commercial. Nope, that chapter is still remarkably well balanced.

By the way, when the revolution comes it won’t bother me at all. There’s no longer such thing as a one-size-fits-all data system and the fascinating work involves fitting all the pieces together. I’m working in the right place to explore this brave new world and excited to learn the best way to move from here to there.

Some Other Notes I Made

  • This feels like the RedBook put together by Michael Stonebraker but it deals with more kinds of systems. I hope Martin refreshes this book every few years with new editions as the industry changes.
  • Martin suggests that the C in ACID might have been introduced for the purpose of the acronym. I knew it!
  • Martin calls the CAP theorem unhelpful and explains why. He admits the CAP theorem “encouraged engineers to explore a wider design space” but it is probably better left behind.
  • My wife Leanne doesn’t like fantasy books and she won’t read a book that has a map in the front. My friend Paul won’t read a book without one. Martin is very professional, but his style shows and I love it. He’s got a map at the beginning of every chapter.
  • The quotes at the beginning of each chapter are really well chosen. They come from Douglas Adams, Terry Pratchett and others. But my favorite is from Thomas Aquinas “If the highest aim of a captain were to preserve his ship, he would keep it in port forever.”

Martin writes “The goal of this book is to help you navigate the fast and diverse changing landscapes of technologies for processing and storing data”. I believe he met that goal.

May 5, 2017

Drawing (again) with SQL Server

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Tongue In Cheek — Michael J. Swart @ 10:41 am

You have my permission to skip this post. This one’s just for me. So I’ve been drawing again with SQL Server’s spatial results tab, the first time I posted something was with Botticelli’s Birth of Venus in More images from the Spatial Results Tab.

Why Michael??

Because it’s a stupid challenge and I wanted to see what I could do with it. The SSMS spatial tab is a lousy crummy medium for images. It really is quite terrible and using SSMS to draw imposes restrictions and rules. It’s fun to see what I can do by staying within that framework. It’s something to push against just because it’s challenging. Others do crosswords, This week, I did this.

Why Now?

I realized a couple things lately.

The Colors Seem Dull … But Don’t Have to Be
I used to think the spatial results tab uses lousy colors, pastel and dull. I realized that they’re not dull, they’re just transparent. I can overlap polygons inside a geometry collection to get more solid colors. Here are the top 100 colors without transparency.

Grid

The Colors Seem Arbitrary … But Don’t Have to Be
The palette that SSMS uses is terrible. It’s almost as if the nth color is chosen using something like Color.FromArgb(new Random(n).Next()); Notice that color 6 and 7 (the beige colors on the left side of the grid) are almost indistinguishable from each-other. But I can use that. I can overlap different colors to get the color I need. And I can write a program to pick the best combination of overlaps. Here’s a nice red and blue:

SSB

But black remains difficult.

Curves Are Supported Now
I can use arc segments called CIRCULARSTRING. SVG files mostly use Bézier curves which cannot be translated easily to arc segments.
Here’s a logo that I rebuilt using arcs instead of Bézier curves:
d2l

For some reason, if you begin to use CIRCULARSTRING, then the transparent colors won’t blend with itself (just other colors).
Also arc segments are rendered as several small line segments anyway, so for my purposes, it’s not a super feature.

Polly

One last picture/query of a scarlet macaw. Click on it or any other picture in this post to get the query that generated it.
Parrot

April 18, 2017

Find Your Dark Queries

Filed under: Miscelleaneous SQL,SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 11:36 am

My friend Josh came up with the term “dark queries”. Just like “dark matter”, dark queries can’t be detected directly, but their effect can still be observed. He’s talking about queries that aren’t stored in cache. If your monitoring solution relies heavily on the statistics of cached queries, then you may not be capturing all the activity on your server.

Obi-Wan never told you what happened to those queries

Some of my favorite monitoring solutions rely on the cached queries:

but some queries will fall out of cache or don’t ever make it into cache. Those are the dark queries I’m interested in today. Today let’s look at query recompiles to shed light on some of those dark queries that maybe we’re not measuring.

By the way, if you’re using SQL Server 2016’s query store then this post isn’t for you because Query Store is awesome. Query Store doesn’t rely on the cache. It captures all activity and stores queries separately – Truth in advertising!

High Recompile Rate?

If you work with a high-frequency transactional workload like I do, then you can’t afford the CPU required for frequent recompiles. If you have sustained recompiles larger than a few hundred per second, that’s probably too much. It’s easy to check. Use the performance monitor to take a look at the SQL Re-Compilations/sec counter which is found in SQLServer:SQL Statistics/sec.

Way too many recompiles

Drill Into Recompile Causes
You can drill into this a little further with an extended event session stored to a histogram like this:

CREATE EVENT SESSION Recompile_Histogram ON SERVER 
  ADD EVENT sqlserver.sql_statement_recompile
  ADD TARGET package0.histogram (
      SET filtering_event_name=N'sqlserver.sql_statement_recompile',
          source=N'recompile_cause',
          source_type=(0) );
 
ALTER EVENT SESSION Recompile_Histogram ON SERVER STATE = START;

Query it like this

SELECT sv.subclass_name as recompile_cause, 
       shredded.recompile_count
  FROM sys.dm_xe_session_targets AS xet  
  JOIN sys.dm_xe_sessions AS xe  
       ON (xe.address = xet.event_session_address)  
 CROSS APPLY ( SELECT CAST(xet.target_data as xml) ) as target_data_xml ([xml])
 CROSS APPLY target_data_xml.[xml].nodes('/HistogramTarget/Slot') AS nodes (slot_data)
 CROSS APPLY (
         SELECT nodes.slot_data.value('(value)[1]', 'int') AS recompile_cause,
                nodes.slot_data.value('(@count)[1]', 'int') AS recompile_count
       ) as shredded
  JOIN sys.trace_subclass_values AS sv
       ON shredded.recompile_cause = sv.subclass_value
 WHERE xe.name = 'Recompile_Histogram' 
   AND sv.trace_event_id = 37 -- SP:Recompile

To get results like this:
recompile_reasons

Infrequent Recompiles?

Even if you don’t have frequent recompiles, it’s important to know what’s going on in the server. On your OLTP database, I bet you’re reluctant to let those BI folks run their analytical queries whenever they want. They may try to hide their shenanigans with an OPTION(RECOMPILE) hint. This extended events captures those query recompiles. Be sure to only capture a small sample by having the session run for a small amount of time. Or you can use the histogram above to make sure that the frequency is low.

CREATE EVENT SESSION [DarkQueries] ON SERVER
  ADD EVENT sqlserver.sql_statement_recompile(
    ACTION(sqlserver.database_id,sqlserver.sql_text)
    WHERE ([recompile_cause]=(11))) -- Option (RECOMPILE) Requested
  ADD TARGET package0.event_file(SET filename=N'DarkQueries');
ALTER EVENT SESSION [DarkQueries] ON SERVER STATE = START;
GO

Take a look at the dark queries by executing this:

SELECT DarkQueryData.eventDate,
       DB_NAME(DarkQueryData.database_id) as DatabaseName,
       DarkQueryData.object_type,
       COALESCE(DarkQueryData.sql_text, 
                OBJECT_NAME(DarkQueryData.object_id, DarkQueryData.database_id)) command,
       DarkQueryData.recompile_cause
  FROM sys.fn_xe_file_target_read_file ( 'DarkQueries*xel', null, null, null) event_file_value
 CROSS APPLY ( SELECT CAST(event_file_value.[event_data] as xml) ) event_file_value_xml ([xml])
 CROSS APPLY (
         SELECT event_file_value_xml.[xml].value('(event/@timestamp)[1]', 'datetime') as eventDate,
                event_file_value_xml.[xml].value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') as sql_text,
                event_file_value_xml.[xml].value('(event/data[@name="object_type"]/text)[1]', 'nvarchar(100)') as object_type,
                event_file_value_xml.[xml].value('(event/data[@name="object_id"]/value)[1]', 'bigint') as object_id,
                event_file_value_xml.[xml].value('(event/data[@name="source_database_id"]/value)[1]', 'bigint') as database_id,
                event_file_value_xml.[xml].value('(event/data[@name="recompile_cause"]/text)[1]', 'nvarchar(100)') as recompile_cause
       ) as DarkQueryData
 ORDER BY eventDate DESC

This gives you results that look something like the following contrived example:
darkqueries

March 24, 2017

SQL vs. NoSQL

Filed under: SQLServerPedia Syndication — Michael J. Swart @ 9:56 am

It’s difficult to say anything meaningful about this topic. I can’t think of another topic that requires the phrase “it depends” more. And it’s true for a number of reasons.

  • You can’t compare them without context.
  • You can’t say one is better than the other. That would be like asking which is better, a dump truck or a Ferrari? It depends on what you need to do (go fast or haul dirt?) and whether you have trained people who can use those machines.

  • SQL and NoSQL aren’t really well-defined.
  • They simply aren’t. In fact, I can’t even really tell you which vehicle represents SQL and which represents NoSQL.

But I’m going to give it a shot anyway.

I’m Presenting “SQL vs. NoSQL” March 28, 2017 (Next Tuesday) in Kitchener

I’m excited to be invited to the next KW Big Data Peer2Peer meetup.
I’ll be giving a presentation called “All about SQL vs. NoSQL: A comparison of relational and non-relational data”. If you’re in town I’d love to see you there!

We’ll explore what things typically characterize SQL:

  • Normalization
  • ACID properties
  • Declarative query language, Query optimizers, write-ahead logs
  • What do you get for free in SQL? What things get awkward?

And of course, by contrast, we’ll talk about non-relational data and other data stores that break those rules.

It should be fun! I hope to see you there.

Older Posts »

Powered by WordPress