Michael J. Swart

October 23, 2020

In Memory OLTP Defeated Our Tempdb Problems

Filed under: Miscelleaneous SQL,Technical Articles — Michael J. Swart @ 10:29 am
Scaling SQL Server High
The beginning of the school year is behind us and what a semester start! 2020 has been tough on many of us and I’m fortunate to work for a company whose services are in such high demand. In fact we’ve seen some scaling challenges like we’ve never seen before. I want to talk about some of them.

At D2L, we’re the perfect candidate customer for In Memory OLTP features, but we’ve held off adopting those features for years. Our servers handle tons of super quick but super frequent queries and so we find ourselves trying to address the same scaling challenges we read about in Microsoft’s customer case studies.

But there’s only one In Memory feature in particular that I care about. It’s the Memory Optimized Table Types. Specifically, I’ve always wanted to use that feature to avoid tempdb object allocation contention. Recently I finally got my chance with a lot of success. So even though I could say I’m happy with In Memory features, I think it’s more accurate to say that I feel relieved at having finally squashed my tempdb issues.

Summary of Article

The Trouble With Tempdb

We use table valued parameters with our procedures a lot (like thousands a second). We’re lucky that the table variables are not created on each execution, they’re cached. We rely heavily on the reduced overhead that this gives us. It’s for that reason we much prefer table variables over temp tables.

But when we crank up the demand, we can still run into catastrophic trouble. When tempdb contention hits us, throughput doesn’t just plateau, it drops hard. This kind of contention we see is like a kind of traffic jam where anyone who needs to use tempdb (i.e. everyone) has to wait for it. These tempdb traffic jams are rough. We even created a lighter version of sp_whoisactive that avoids tempdb issues for times like those.

I won’t go on too long about our troubles (I’ve written about tempdb issues a few times already: 1, 2, 3, 4, 5, 6, 7). The usual advice is to increase the number of tempdb data files. We were using 48 data files and really looking hard for other options.

SQL Server 2019 has some promising options. In TEMPDB – Files and Trace Flags and Updates, Oh My! Pam Lahoud points out how SQL Server can use all the PFS pages in the tempdb data files, not just the first available one. But we couldn’t move to 2019 that quickly. So we looked at Memory Optimized Table Types to help us.

Memory Optimized Table Types Can Help

Improving temp table and table variable performance using memory optimization tells us how. Our main goal is to avoid tempdb contention and memory optimized table variables don’t use tempdb at all. As long as we can be sure that the number of rows stored in these table variables is small, it’s all pros and no cons.

But it wasn’t easy for us to implement. In 2017, I wrote about Postponing Our Use Of In Memory OLTP. There were just some challenges that we couldn’t overcome. We’re not alone in struggling with the limitations of In Memory features. But our challenges weren’t the usual limitations that folks talk about and so they’re worth exploring.

The Challenge of Sardines and Whales

We have one product that we deploy to many clients. Each client gets their own database. The big ones (whales) have their own servers but the small ones (sardines) get grouped together.

Sardines and Whales

So the overhead of enabling In Memory on all the sardines was going to cause issues. The In Memory OLTP filegroup requires up to 4GB of disk space which isn’t easy to handle with hundreds of sardines. So we’re left with this dilemma. We’d like to use In Memory on the biggest whales, but not on the sardines. We tackled that in two ways

  1. Decide that the choice to add the In Memory filegroup is configuration, not product. This still required some changes though. Our backup and restore processes needed to at least handle the new filegroups, but they couldn’t expect it.
  2. Add an exception to our processes that allow schema drift in the definition table types. Our plan was to manually alter the table types to be memory optimized, but only on the largest whales. Introducing schema drift is not ideal, but we made this choice deliberately.

This whole challenge could have been avoided if memory optimized table types didn’t require an In Memory OLTP file group. I get the sense that the memory optimized table types don’t actually use that folder because I noticed something interesting. SQL Server 2019 introduces memory optimized tempdb metadata tables without a memory optimized filegroup! How did they pull that off? I’m a bit jealous. I asked Pam Lahoud and it turns out that the In Memory filegroups are still required for memory optimized table types and will continue to be. It turns out that Microsoft can make certain assumptions about the tempdb metadata tables that they can’t with regular table types. 😟

Some Implementation Surprises

As we implemented our plan, we also encountered some interesting things during testing that might be useful for you if you’re considering In Memory features.

  • The default directory for storing database files should point to a folder that exists on the database server and on any secondary nodes in the same availability group. So if the default location is E:\SQLData then make sure there’s an E drive on every node. SQL Server will need to create an xtp folder in there.
  • When adding the In Memory OLTP file group, the folder that contains it should also exist on all secondary nodes.
  • In SQL Server 2014, I noticed that the addition of the memory optimized file group required up to 4 Gb of space. In SQL Server 2016, I see that that still happens, but the space isn’t taken until the first memory optimized table type I create. That’s also when the xtp folder gets created.
  • Adjusting the table types to be memory optimized was a challenge because we wanted the process to be online. I wrote about how we pulled that off earlier this week in How to Alter User Defined Table Types (Mostly) Online

Success!

Things worked out really really well for us. Our main goal was to avoid tempdb contention and we succeeded there. But there’s an additional performance boost. When you insert into a regular table variable, that data gets written to tempdb’s transaction log. But that’s not the case for memory optimized tables. So even though I really just care about avoiding contention, the boost in performance is significant and measurable and really nice.

In testing we were finally able to push a 96 CPU machine up to 100% CPU on every core and only then did throughput plateau. No tempdb contention in sight.

In production we also saw the same behavior and we were able to sustain over 200K batch requests per second. No tempdb contention in sight.

Those numbers are nice, tempdb contention has been such a thorn in my side for so long, it’s such a relief to squash that issue once and for all. I now get to focus on the next bottleneck and can leave tempdb contention in the past.

October 19, 2020

How to Alter User Defined Table Types (Mostly) Online

Filed under: Miscelleaneous SQL,SQL Scripts,Technical Articles — Michael J. Swart @ 2:41 pm

Last year, Aaron Bertrand tackled the question, How To Alter User Defined Table Types. Aaron points out that “There is no ALTER TYPE, and you can’t drop and re-create a type that is in use”. Aaron’s suggestion was to create a new type and then update all procedure to use the new type.

I think I’ve got a bit of improvement based on sp_rename and sp_refreshmodule. Something that works well with

  • blue-green deployments,
  • both ad-hoc queries and procedures,
  • imperfectly understood schemas, like schemas that may have suffered from a little bit of schema drift.

Example

Say I have… I don’t know, let’s pick an example out of thin air. Say I have a simple table type containing one BIGINT column that I want to make memory optimized:

What I’ve Got

CREATE TYPE dbo.BigIntSet 
  AS TABLE ( 
    Value BIGINT NOT NULL INDEX IX_BigIntSet );

What I Want

CREATE TYPE dbo.BigIntSet 
  AS TABLE ( 
    Value BIGINT NOT NULL INDEX IX_BigIntSet )
  WITH (MEMORY_OPTIMIZED=ON);

I can’t directly ALTER this table type, but I can do this three-card monte trick using sp_rename to put the BigIntSet in its place.

The Migration Script

IF NOT EXISTS (
  SELECT * 
  FROM sys.table_types
  WHERE name = 'BigIntSet' 
  AND is_memory_optimized = 1
)
BEGIN
 
  CREATE TYPE dbo.BigIntSet_MO 
    AS TABLE ( 
      Value bigint NOT NULL 
      INDEX IX_BigIntSet ) 
    WITH (MEMORY_OPTIMIZED=ON);
 
  -- the switcheroo!
  EXEC sp_rename 'dbo.BigIntSet', 'zz_BigIntSet';
  EXEC sp_rename 'dbo.BigIntSet_MO', 'BigIntSet';
 
  --refresh modules
  DECLARE @Refreshmodulescripts TABLE (script nvarchar(max));
 
  INSERT @Refreshmodulescripts (script)
  SELECT 'EXEC sp_refreshsqlmodule ''' + QUOTENAME(referencing_schema_name) + '.' + QUOTENAME(referencing_entity_name) + ''';'
  FROM sys.dm_sql_referencing_entities('dbo.BigIntSet', 'TYPE');
 
  DECLARE @SQL NVARCHAR(MAX) = N'';
  SELECT @SQL = @SQL + script FROM @Refreshmodulescripts;
 
  EXEC sp_executesql @SQL;
 
END

But Is This Online?

Mostly. All queries that are in progress (whether ad-hoc or via procedures), continue to execute with no issues. However, there may be an issue with other queries that begin their execution during this migration.

If someone sends a query that uses the table type in the split second between the two sp_rename statements, then the query may fail.

If someone executes a procedure in the time between the first sp_rename and when sp_executesql gets around to refreshing that procedure, then the procedure may fail.

In practice, even on a busy server, I saw no such errors the few times I’ve tried this method, but of course, that’s no guarantee. In my case, even when refreshing close to 300 modules, this script takes about one second with no issues.

I actually tried adding a transaction around this whole migration script, and I did in fact see issues on a busy server. The schema modification lock that needs to be taken and held on all 300 objects was too much. It caused excessive blocking and I had to abandon that approach. In practice, I avoided trouble by ditching the explicit transaction.

June 19, 2020

Problem Solving by Cheating

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:34 am

Solving real-world problems is different than answering interview questions or twitter polls. The biggest difference is that real problems aren’t always fair. There’s not always a right answer.

Answer this multiple choice question:

Which of the following SQL statements is used to modify existing data in a table?
A) SELECT
B) INSERT
C) DELETE

Give it some thought. Which option would you pick? The correct answer is UPDATE but it wasn’t one of the options listed and that’s not fair. But neither is real life. Many real problems don’t have an easy answer and some real problems are impossible to solve. That can be discouraging.

A blue shell from MarioKart is about to attack a runner. Sometimes life just isn't fair.

Real Problems Allow For Creativity

But if your problems are unfair, then maybe you’re allowed to cheat too.

“None of the above” is always an option. Understand the goal so that you can stretch or ignore requirements.

Example – Changing an INT to a BIGINT

I have a table that logs enrollments into courses. It’s append only and looks something like this:

CREATE TABLE dbo.LOG_ENROLL (
    LogId INT IDENTITY NOT NULL,  -- This identity column is running out of space
    UserId INT NOT NULL,
    CourseId INT NOT NULL,
    RoleId INT NULL,
    EnrollmentType INT NOT NULL,
    LogDate DATETIME NOT NULL DEFAULT GETUTCDATE(),
 
    INDEX      IX_LOG_ENROLL_CourseId    CLUSTERED    ( CourseId, UserId ),
    CONSTRAINT PK_LOG_ENROLL PRIMARY KEY NONCLUSTERED ( LogId ),
    INDEX      IX_LOG_ENROLL_UserId      NONCLUSTERED ( UserId, CourseId ),
    INDEX      IX_LOG_ENROLL_LogDate     NONCLUSTERED ( LogDate, LogId )
);

The table has over 2 billion rows and it looks like it’s going to run out of space soon because the LogId column is defined as an INT. I need to change this table so that it’s a BIGINT. But changing an INT to a BIGINT is known as a “size of data” operation. This means SQL Server has to process every row to expand the LogId column from 4 to 8 bytes. But it gets trickier than that.

The biggest challenge is that the table has to remain “online” (available for queries and inserts).

Compression?
Gianluca Sartori (spaghettidba) had the idea of enlarging the columns with no downtime using compression. It’s promising, but I discovered that for this to work, all indexes need to be compressed not just the ones that contain the changed column. Also, any indexes which use the column need to be disabled for this to work.

Cheating
I gave up on solving this problem in general and constrained my focus to the specific problem I was facing. There’s always some context that lets us bend the rules. In my case, here’s what I did.

Ahead of time:

  • I removed extra rows. I discovered that many of the rows were extraneous and could be removed. After thinning out the table, the number of rows went from 2 billion down to 300 million.
  • I compressed two of the indexes online (IX_LOG_ENROLL_UserId and IX_LOG_ENROLL_CourseId) because I still want to use the compression trick.

But I’m not ready yet. I still can’t modify the column because the other two columns depend on the LogId column. If I tried, I get this error message:


Msg 5074, Level 16, State 1, Line 22
The index ‘IX_LOG_ENROLL_LogDate’ is dependent on column ‘LogId’.
Msg 5074, Level 16, State 1, Line 22
The object ‘PK_LOG_ENROLL’ is dependent on column ‘LogId’.
Msg 4922, Level 16, State 9, Line 22
ALTER TABLE ALTER COLUMN LogId failed because one or more objects access this column.

So I temporarily drop those indexes!

  • Drop the constraint PK_LOG_ENROLL and the index IX_LOG_ENROLL_LogDate
  • Do the switch! ALTER TABLE LOG_ENROLL ALTER COLUMN LogId BIGINT NOT NULL; This step takes no time!
  • Recreate the indexes online that were dropped.

Hang on, that last step is a size of data operation. Anyone who needs those indexes won’t be able to use them while they’re being built.
Exactly, and this is where I cheat. It turns out those indexes were used for infrequent reports and I was able to co-ordinate my index rebuild around the reporting schedule.

You can’t always make an operation online, but with effort and creativity, you can get close enough. I have found that every real problem allows for a great degree of creativity when you’re allowed to bend the rules or question requirements.

January 28, 2020

What Tables Are Being Written To The Most?

Filed under: Miscelleaneous SQL,SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:38 am

You have excessive WRITELOG waits (or HADR_SYNC_COMMIT waits) and among other things, you want to understand where.

Microsoft’s advice Diagnosing Transaction Log Performance Issues and Limits of the Log Manager remains a great resource. They tell you to use perfmon to look at the log bytes flushed/sec counter (in the SQL Server:Databases object) to see which database is being written to so much.

After identifying a database you’re curious about, you may want to drill down further. I wrote about this problem earlier in Tackle WRITELOG Waits Using the Transaction Log and Extended Events. The query I wrote for that post combines results of an extended events session with the transaction log in order to identify which procedures are doing the most writing.

But it’s a tricky kind of script. It takes a while to run on busy systems. There’s a faster way to drill into writes if you switch your focus from which queries are writing so much to which tables are being written to so much. Both methods of drilling down can be helpful, but the table approach is faster and doesn’t require an extended event session and it might be enough to point you in the right direction.

Use This Query

use [specify your databasename here]
 
-- get the latest lsn for current DB
declare @xact_seqno binary(10);
declare @xact_seqno_string_begin varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_begin = '0x' + CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_begin = stuff(@xact_seqno_string_begin, 11, 0, ':')
set @xact_seqno_string_begin = stuff(@xact_seqno_string_begin, 20, 0, ':');
 
-- wait a few seconds
waitfor delay '00:00:10'
 
-- get the latest lsn for current DB
declare @xact_seqno_string_end varchar(50);
exec sp_replincrementlsn @xact_seqno OUTPUT;
set @xact_seqno_string_end = '0x' + CONVERT(varchar(50), @xact_seqno, 2);
set @xact_seqno_string_end = stuff(@xact_seqno_string_end, 11, 0, ':')
set @xact_seqno_string_end = stuff(@xact_seqno_string_end, 20, 0, ':');
 
WITH [Log] AS
(
  SELECT Category, 
         SUM([Log Record Length]) as [Log Bytes]
  FROM   fn_dblog(@xact_seqno_string_begin, @xact_seqno_string_end)
  CROSS  APPLY (SELECT ISNULL(AllocUnitName, Operation)) AS C(Category)
  GROUP  BY Category
)
SELECT   Category, 
         [Log Bytes],
         100.0 * [Log Bytes] / SUM([Log Bytes]) OVER () AS [%]
FROM     [Log]
ORDER BY [Log Bytes] DESC;

Results look something like this (Your mileage may vary).
A screenshot of the results

Notes

  • Notice that some space in the transaction log is not actually about writing to tables. I’ve grouped them into their own categories and kept them in the results. For example LOP_BEGIN_XACT records information about the beginning of transactions.
  • I’m using sp_replincrementlsn to find the current last lsn. I could have used log_min_lsn from sys.dm_db_log_stats but that dmv is only available in 2016 SP2 and later.
  • This method is a little more direct measurement of transaction log activity than a similar query that uses sys.dm_db_index_operational_stats

January 20, 2020

Watching SQL Server Stuff From Performance Monitor

Taking a small break from my blogging sabbatical to post one script that I’ve found myself writing from scratch too often.
My hope is that the next time I need this, I’ll look it up here.

The User Settable Counter

Use this to monitor something that’s not already exposed as a performance counter. Like the progress of a custom task or whatever. If you can write a quick query, you can expose it to a counter that can be plotted by Performance Monitor.

Here’s the script (adjust SomeMeasurement and SomeTable to whatever makes sense and adjust the delay interval if 1 second is too short:

declare @deltaMeasurement int = 0;
declare @totalMeasurement int = 0;
 
while (1=1)
begin
 
  select @deltaMeasurement = SomeMeasurement - @totalMeasurement
  from SomeTable;
 
  set @totalMeasurement += @deltaMeasurement;
 
  exec sp_user_counter1 @deltaMeasurement;
  waitfor delay '00:00:01'
end

Monitoring

Now you can monitor “User Counter 1” in the object “SQLServer:User Settable” which will look like this:
Example of monitoring a performance counter using Performance Monitor

Don’t forget to stop the running query when you’re done.

April 3, 2019

Finding Tables with Few Dependencies

Filed under: SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 10:00 am

A couple weeks ago, I wrote about how to find lonely tables in Sql Server. This is a follow up to that post. I’m now going to talk about small sets of tables that are joined to eachother, but no-one else.

It’s Not Just Me
It seems everyone’s talking about this.

So as I was writing this post and code I noticed an amazing coincidence. I saw the same ideas I was writing about being discussed on twitter by Kelly Sommers, Ben Johnson and others.

They discuss Uber’s microservice graph. When visualized, it’s a big mish-mash of dependencies. Kelly points out how hard it is to reason about and Ben points to a small decoupled piece of the system that he wants to work on.

Me too Ben! And I think that’s the value of that visualization. It can demonstrate to others how tangled your system is. It can also identify small components that are not connected to the main mess. When I tie it to my last post and consider this idea in the database world, I can expand my idea of lonely tables to small sets of tables that are never joined to other tables.

I want to find them because these tables are also good candidates for extraction but how do I find them? I start by visualizing tables and their joins.

Visualizing Table Joins

I started by looking for existing visualizations. I didn’t find exactly what I wanted so I coded my own visualization (with the help of the d3 library). It’s always fun to code your own physics engine.

Here’s what I found

A monolith with some smaller isolated satellites

An example that might be good to extract

That ball of mush in the middle is hard to look at, but the smaller disconnected bits aren’t! Just like Ben, I want to work on those smaller pieces too! And just like the lonely tables we looked at last week, these small isolated components are also good candidates for extracting from SQL Server.

Try It Yourself

I’ve made this visualization available here:

https://michaeljswart.com/show_graph/show_graph.html

There’s a query at the end of this post. When you run it, you’ll get pairs of table names and when you paste it into the Show Graph page, you’ll see a visualization of your database.

(This is all client-side code, I don’t collect any data).

The Query

use [your database name goes here];
 
select
    qs.query_hash,
    qs.plan_handle,
    cast(null as xml) as query_plan
into #myplans
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_plan_attributes(qs.plan_handle) pa
where pa.attribute = 'dbid'
and pa.value = db_id();
 
with duplicate_queries as
(
    select ROW_NUMBER() over (partition by query_hash order by (select 1)) r
    from #myplans
)
delete duplicate_queries
where r > 1;
 
update #myplans
set query_plan = qp.query_plan
from #myplans mp
cross apply sys.dm_exec_query_plan(mp.plan_handle) qp
 
;WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'),
mycte as
(
    select q.query_hash,
           obj.value('(@Schema)[1]', 'sysname') AS schema_name,
           obj.value('(@Table)[1]', 'sysname') AS table_name
    from #myplans q
    cross apply q.query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') as nodes(stmt)
    CROSS APPLY stmt.nodes('.//IndexScan/Object') AS index_object(obj)
)
select query_hash, schema_name, table_name
into #myExecutions
from mycte
where schema_name is not null
and object_id(schema_name + '.' + table_name) in (select object_id from sys.tables)
group by query_hash, schema_name, table_name;
 
select DISTINCT A.table_name as first_table,
       B.table_name as second_table
from #myExecutions A
join #myExecutions B
on A.query_hash = B.query_hash
where A.table_name < B.table_name;

March 12, 2019

Lonely Tables in SQL Server

Filed under: SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:00 pm

Takeaway: I provide a script that looks at the procedure cache and reports tables that are never joined to other tables.

Recently, I’ve been working hard to reduce our use of SQL Server as much as possible. In other words, I’ve been doing some spring cleaning. I pick up a table in my hands and I look at it. If it doesn’t spark joy then I drop it.

If only it were that easy. That’s not quite the process I’m using. The specific goals I’m chasing are about reducing cost. I’m moving data to cheaper data stores when it makes sense.

So let’s get tidying. But where do I start?

Getting rid of SQL Server tables should accomplish a couple things. First, it should “move the needle”. If my goal is cost, then the tables I choose to remove should reduce my hardware or licensing costs in a tangible way. The second thing is that dropping the table is achievable without 10 years of effort. So I want to focus on “achievability” for a bit.

Achievable

What’s achievable? I want to identify tables to extract from the database that won’t take years. Large monolithic systems can have a lot of dependencies to unravel.

So what tables in the database have the least dependencies? How do I tell without a trustworthy data model? Is it the ones with the fewest foreign keys (in or out)? Maybe, but foreign keys aren’t always defined properly or they can be missing all together.

My thought is that if two tables are joined together in some query, then they’re related or connected in some fashion. So that’s my idea. I can look at the procedure cache of a database in production to see where the connections are. And when I know that, I can figure out what tables are not connected.

Lonely Tables

This script gives me set of tables that aren’t joined to any other table in any query in cache

use [your db name here];
 
SELECT qs.query_hash,
       qs.plan_handle,
       cast(null as xml) as query_plan
  INTO #myplans
  FROM sys.dm_exec_query_stats qs
 CROSS APPLY sys.dm_exec_plan_attributes(qs.plan_handle) pa
 WHERE pa.attribute = 'dbid'
   AND pa.value = db_id();
 
WITH duplicate_queries AS
(
  SELECT ROW_NUMBER() OVER (PARTITION BY query_hash ORDER BY (SELECT 1)) n
  FROM #myplans
)
DELETE duplicate_queries
 WHERE n > 1;
 
UPDATE #myplans
   SET query_plan = qp.query_plan
  FROM #myplans mp
 CROSS APPLY sys.dm_exec_query_plan(mp.plan_handle) qp;
 
WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan'),
my_cte AS 
(
    SELECT q.query_hash,
           obj.value('(@Schema)[1]', 'sysname') AS [schema_name],
           obj.value('(@Table)[1]', 'sysname') AS table_name
      FROM #myplans q
     CROSS APPLY q.query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') as nodes(stmt)
     CROSS APPLY stmt.nodes('.//IndexScan/Object') AS index_object(obj)
)
SELECT query_hash, [schema_name], table_name
  INTO #myExecutions
  FROM my_cte
 WHERE [schema_name] IS NOT NULL
   AND OBJECT_ID([schema_name] + '.' + table_name) IN (SELECT object_id FROM sys.tables)
 GROUP BY query_hash, [schema_name], table_name;
 
WITH multi_table_queries AS
(
    SELECT query_hash
      FROM #myExecutions
     GROUP BY query_hash
    HAVING COUNT(*) > 1
),
lonely_tables as
(
    SELECT [schema_name], table_name
      FROM #myExecutions
    EXCEPT
    SELECT [schema_name], table_name
      FROM #myexecutions WHERE query_hash IN (SELECT query_hash FROM multi_table_queries)
)
SELECT l.*, ps.row_count
  FROM lonely_tables l
  JOIN sys.dm_db_partition_stats ps
       ON OBJECT_ID(l.[schema_name] + '.' + l.table_name) = ps.object_id
 WHERE ps.index_id in (0,1)
 ORDER BY ps.row_count DESC;

Caveats

So many caveats.
There are so many things that take away from the accuracy and utility of this script that I hesitated to even publish it.
Here’s the way I used the script. The list of tables was something that helped me begin an investigation. For me, I didn’t use it to give answers, but to generate questions. For example, taking each table in the list, I asked: “How hard would it be to get rid of table X and what would that save us?” I found it useful to consider those questions. Your mileage of course will vary.

October 26, 2018

Uncovering Hidden Complexity

Filed under: Miscelleaneous SQL,SQL Scripts,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 12:15 pm

The other day, Erin Stellato asked a question on twitter about the value of nested SPs. Here’s how I weighed in:

Hidden complexity has given me many problems in the past. SQL Server really really likes things simple and so it’s nice to be able to uncover that complexity. Andy Yun has tackled this problem for nested views with his sp_helpexpandview.

Here’s what I came up with for nested anything. It helps unravel a tree of dependencies based on information found in sys.triggers and sys.dm_sql_referenced_entities. With it, you can see what’s involved when interacting with objects. Here’s what things look like for Sales.SalesOrderDetail in AdventureWorks2014. A lot of the resulting rows can be ignored, but there can be surprises in there too.

A lot in there

DECLARE @object_name SYSNAME = 'Sales.SalesOrderDetail';
 
WITH dependencies AS
(
    SELECT @object_name AS [object_name],
           CAST(
             QUOTENAME(OBJECT_SCHEMA_NAME(OBJECT_ID(@object_name))) + '.' + 
             QUOTENAME(OBJECT_NAME(OBJECT_ID(@object_name)))
             as sysname) as [escaped_name],
           [type_desc],
           object_id(@object_name) AS [object_id],
           1 AS is_updated,
           CAST('/' + CAST(object_id(@object_name) % 10000 as VARCHAR(30)) + '/' AS hierarchyid) as tree,
           0 as trigger_parent_id
      FROM sys.objects 
     WHERE object_id = object_id(@object_name)
 
    UNION ALL
 
    SELECT CAST(OBJECT_SCHEMA_NAME(o.[object_id]) + '.' + OBJECT_NAME(o.[object_id]) as sysname),
           CAST(QUOTENAME(OBJECT_SCHEMA_NAME(o.[object_id])) + '.' + QUOTENAME(OBJECT_NAME(o.[object_id])) as sysname),
           o.[type_desc],
           o.[object_id],
           CASE o.[type] when 'U' then re.is_updated else 1 end,
           CAST(d.tree.ToString() + CAST(o.[object_id] % 10000 as VARCHAR(30)) + '/' AS hierarchyid),
           0 as trigger_parent_id
      FROM dependencies d
     CROSS APPLY sys.dm_sql_referenced_entities(d.[escaped_name], default) re
      JOIN sys.objects o
           ON o.object_id = isnull(re.referenced_id, object_id(ISNULL(re.referenced_schema_name,'dbo') + '.' + re.referenced_entity_name))
     WHERE tree.GetLevel() < 10
       AND re.referenced_minor_id = 0
       AND o.[object_id] <> d.trigger_parent_id
       AND CAST(d.tree.ToString() as varchar(1000)) not like '%' + CAST(o.[object_id] % 10000 as varchar(1000)) + '%'
 
     UNION ALL
 
     SELECT CAST(OBJECT_SCHEMA_NAME(t.[object_id]) + '.' + OBJECT_NAME(t.[object_id]) as sysname),
            CAST(QUOTENAME(OBJECT_SCHEMA_NAME(t.[object_id])) + '.' + QUOTENAME(OBJECT_NAME(t.[object_id])) as sysname),
            'SQL_TRIGGER',
            t.[object_id],
            0 AS is_updated,
            CAST(d.tree.ToString() + CAST(t.object_id % 10000 as VARCHAR(30)) + '/' AS hierarchyid),
            t.parent_id as trigger_parent_id
       FROM dependencies d
       JOIN sys.triggers t
            ON d.[object_id] = t.parent_id
      WHERE d.is_updated = 1
        AND tree.GetLevel() < 10
        AND CAST(d.tree.ToString() as varchar(1000)) not like '%' + cast(t.[object_id] % 10000 as varchar(1000)) + '%'
)
SELECT replicate('—', tree.GetLevel() - 1) + ' ' + [object_name], 
       [type_desc] as [type],
       tree.ToString() as dependencies       
  FROM dependencies
 ORDER BY tree

July 9, 2018

The Bare Minimum You Need To Know To Work With Git

Filed under: Technical Articles — Michael J. Swart @ 9:00 am

I don’t like using git for source control. It’s the worst source control system (except for all the others). My biggest beef is that many of the commands are unintuitive.

Look how tricky some of these common use cases can be: Top Voted Stackoverflow Questions tagged Git. The top 3 questions have over ten thousand votes! This shows a mismatch between how people want to use git and how git is designed to be used.

I want to show the set of commands that I use most. These commands cover 95% of my use of git.
stupid git

Initial Setup

One-time tasks include downloading git and signing up for github or bitbucket. My team uses github, but yours might use gitlab, bitbucket or something else.

Here’s my typical workflow. Say I want to work on some files in a project on a remote server:

Clone a Repository

My first step is to find the repository for the project. Assuming I’m not starting a project from scratch, I find and copy the location of the repository from a site like github or bitbucket. So the clone command looks like this:

git clone https://github.com/SomeProject/SomeRepo.git

This downloads all the files so I have my own copy to work with.

Create a Branch

Next I create a branch. Branches are “alternate timelines” for a repository. The real timeline or branch is called master. One branch can be checked out at a time, so after I create a branch, I check out that branch. In the diagram, I’ve indicated the checkout branch in bold. I like to immediately push that branch back to the remote server. I can always refer to the remote server as “origin”. All this is done with these commands:

git branch myBranch
git checkout myBranch 
git push -u origin myBranch

Change Stuff

Now it’s time to make changes. This has nothing to do with git but it’s part of my workflow. In my example here I’m adding a file B.txt.

Stage Changes

These changes aren’t part of the branch yet though! If I want them to be part of the branch. I have to commit my changes. That’s done in two parts. The first part is to specify the changes I want to commit. That’s called staging and it’s done with git add. I almost always want to commit everything, so the command becomes:

git add *

Commit Changes

The second part is to actually commit the files to the branch with a commit message:

git commit -m "my commit message"

Push Changes

I’m happy with the changes I made to my branch so I want to share them with the rest of the world starting with the remote server.

git push origin myBranch

Create a Pull Request and Merge to master

In fact I’m so happy with these changes, I want to include them in master, the real timeline. But not so fast! This is where collaboration and teamwork become important. I create a pull request and then if I get the approval of my teammates, I can merge.

It sounds like a chore, but luckily I don’t have to memorize any git commands for this step because of sites like github or bitbucket. They have a really nice web site and UI where teams can discuss changes before approving them. Once the changes are approved and merged, master now has the changes.

Once it’s merged. Just to complete the circle, I can pull the results of the merge back to my own computer with a pull

git pull
git checkout master

Other Use Cases

Where Am I?
To find out where I am in my workflow, I like to use:

git status

This one command can tell me what branch I’m on. Whether there are changes that can be pushed or pulled. What files have changed and what changes are staged.

Merge Conflicts
With small frequent changes, merge conflicts become rare. But they still happen. Merge conflicts are a pain in the neck and to this day I usually end up googling “resolving git merge conflicts”.

Can’t this Be Easier?

There are so many programs and utilities available whose only purpose is to make this stuff easier. But they don’t. They make some steps easy, and some steps impossible. Whenever I really screw things up, I delete everything and start from scratch at the cloning step. I find I have to do that more often when I use a tool that was supposed to make my life easier.

One Exception
The only exception to this rule is Visual Studio Code. It’s a real treat to use. I love it.

Maybe you like the command line. Maybe you have a favorite “git-helper” application. No matter how you use git, in every case, you still have to understand the workflow you’re using and that’s what I’ve tried to describe here.

Where To Next

If you want to really get good at this stuff. I recently learned of a great online resource (thanks Cressa!) at https://learngitbranching.js.org/. It’s a great interactive site that teaches more about branching. You will very quickly learn more than the bare minimum required. I recommend it.

June 15, 2018

ORDER BY newid() is an Unbiased Way To Randomize

Filed under: Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 9:47 am

Mike Bostock is a data-visualization specialist. And it really shows in his blog. Every article is really well designed (which makes sense… many of the articles are about design).

One of his articles, Visualizing Algorithms has some thoughts on shuffling at https://bost.ocks.org/mike/algorithms/#shuffling.

He says that sorting using a random comparator is a rotten way to shuffle things. Not only is it inefficient, but the resulting shuffle is really really biased. He goes on to visualize that bias (again, I really encourage you to go see his stuff).

Ordering by random reminded me of the common technique in SQL Server of ORDER BY newid(). So I wondered whether an obvious bias was present there. So I shuffled 100 items thousands of times and recreated the visualization of bias in a heat map (just like Mike did).

Here is the heatmap. If you can, try to identify any patterns.

Order By NewID Bias

Where:

    columns are the position before the shuffle,
    rows are the position after the shuffle,
    green is a positive bias and
    red is a negative bias.

I don’t think there is any bias here. The problem that introduces bias in Mike Bostock’s example is that his “random comparator” that he defined does not obey transitivity. His words. “A comparator must obey transitivity: if a > b and b > c, then a > c.”
But in SQL Server, because each row is assigned a newid(), ORDER BY newid() doesn’t have that flaw and so it doesn’t have that bias.

But Be Careful

Although the method is unbiased, ORDER BY newid() is still inefficient. It uses a sort which is an inefficient way of shuffling. There are alternative shuffle algorithms that are more efficient.
ORDER BY newid() is good for quick and dirty purposes. But if you value performance, shuffle in the app.

Older Posts »

Powered by WordPress