Data Cartoons | Michael J. Swart

June 25, 2014

Looking Back at 100 Illustrations

Filed under: Data Cartoons — Michael J. Swart @ 1:05 pm

So I recently took a look at the illustrations I have on this blog and I realize that I’ve got 100 illustrations. I’ve even built a page to show them off. Since 100 is a nice round number, I’m going to take this opportunity to look at some trends. I’ve grouped some of my illustrations into categories:

Movie Franchises

the Princess Bride:

Star Wars:

Star Trek:

Lord of the Rings:

Myself

Turns out I’m also a bit of a narcissist:

Women

A little over half of my illustrations are of people, and a smaller fraction are of fictional characters. But only a tiny fraction of those are women. I’m a bit worried about that. Maybe it says something about pop culture. Maybe it says something about me. I’ll have to give that some extra thought.

My Favorites:

This illustration has the right level of snark and it just makes me laugh. In my head I’ve titled this one “Grumpy Ted Codd” and I have this on a mug. It was one of my first illustrations I ever published and I’ve never been able to capture the same feeling of humour and relevance.

I was proud of this one because the likeness turned out. It was one of the first feelings I had that I was getting the hang of this.

Looking back on this now, their heads are too shiny. But this illustration was used with one of my most popular articles and I find myself looking for this article at least once a month. I hit the Browse By Illustration page and start looking for the mythbusters.

I did this one for a guest post on SQL Brit’s site. There were a lot of details in the spaceship and it’s a different kind of drawing than drawing faces. The pun works and I got to reference cheesy 80’s sci-fi… classic.

In General

It’s been a ton of fun. I don’t know what I have in store for the future but I do feel like a change of format is due.

As always, the comment form below is open. I’d like to hear what you think.

-- Comments (4)

March 19, 2013

Checking Out Amazon Redshift

Filed under: Data Cartoons,Miscelleaneous SQL,SQLServerPedia Syndication,Technical Articles — Michael J. Swart @ 3:00 pm

In order to refresh my memory about what I learned in University, I’ve been watching a course that UC Berkeley made available on Youtube. It’s been a good course so far. Recently I got to the topic of logical fallacies. And so I’m reminded that the following is not a valid argument: Jimmy advises X; Jimmy profits if you follow advice X; Therefore, Not X. It’s not valid in the academic sense. In the practical sense, I have to remember to not distrust all commercials and marketing videos.

But it’s hard, especially when Googling for “Big Data” or “Analytics”, I find it very difficult to find anything to read that’s both unbiased and useful. Maybe it’s because I’m still skeptical about any solution that is promoted by the people who stand to profit from following their advice (I’m trying not to discount their advice, I just squint at it a little).

So when Amazon came out with a publicly available Beta for their new Redshift Datawarehouse service (accompanied by a slick marketing video), I decided to kick the tires. Here’s some notes I made:

Amazon’s Redshift Commercial

Their commercial is here. And I have to say it reminded me of an infomercial. You know the kind where they try to sell fancy mop gadgets by showing how ridiculously clumsy people can be with regular mops. The Amazon commercial is light on technical details but I don’t think it was meant for an audience like me and you. I made a conscious effort not to hold that against them.

Warehouse in the cloud

Having a warehouse in the cloud makes a lot of sense in some ways. Pay-as-you-go pricing is what you want when you prefer operational costs over capital costs. Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves. But it can get expensive quick, Amazon advertises less than $1000 / Terabyte / year. But with upfront costs for reserved pricing and a two terabyte minimum. The smallest rate you can get away with is $3000 per year for three years. In practice it will likely be significantly more. I can’t see anyone choosing Redshift without doing due diligence on the pricing, but it’s probably not going to be dirt cheap.

Connecting to Redshift

Connections are made to Redshift only through ODBC (or JDBC) using Postgres drivers. Some challenges:

Picking my way through 32 bit vs 64 bit was tricky. Getting and using the right driver took some care.
Uploading through this connection is slow. So don’t try it this way. I mean it’s possible, but it’s simply not recommended. I learned this lesson not by reading through Amazon docs, but by attempting to run an SSIS job to funnel data into an ODBC destination. It was actually working, but it was impractical because it was so slow.

Creating the warehouse schema wasn’t too hard: I scripted a subset of tables from an existing warehouse. I stripped indexes, constraints and identities. There were a couple syntax differences (int becomes integer, bit becomes boolean, and GO becomes semicolon) but it went pretty smooth. For Redshift, in lieu of indexes, you choose columns on each table for the distribution key and the sort key. Underneath the covers, Redshift uses ParAccel and so if you’re familiar with that, you’ve got a great head start. Otherwise, Amazon’s Developer Guide is a good start. I’d like to hear more from others who know more about modeling for Redshift; it seems like a regular star schema will work well here.

Loading data is best through S3

I created a SSIS package that takes data from my relational warehouse. It takes that data and dumps it into delimited files (Hat tip to @peschkaj for advice on delimiter values). Then I gzipped the whole thing. I loaded those gzipped files into Amazon S3 and loaded data into Redshift using those files. Initially, I was afraid of Amazon S3 (What? I have to sign up for and learn about another service?) but working with Amazon S3 was dead simple.

Consuming data

I connected a Tableau client to Redshift using an ODBC connection. This Tableau discussion tells me that’s the current way to connect Tableau and Redshift. There are quite a few SQL limitations imposed by the ODBC protocol. So the Tableau experience was not too smooth. Tableau has a new Redshift connector coming out in a matter of weeks which should clear these limitations.
The point is that Amazon claims Redshift integrates seamlessly with a list of reporting apps (including Tableau). I believe it really will by the time Redshift is released, it’s just right now there’s a pretty big seam.

Next?

I’m going to get around to evaluating Google Big Query and of course Microsoft’s HDInsight. You might notice that lately, I find myself in a phase of learning (fun!) and so this post and the next few will be a description of my exploration.

-- Comments (8)

January 17, 2012

Clearing out the closet…

Filed under: Data Cartoons,Tongue In Cheek — Michael J. Swart @ 2:31 pm

Hey fellow SQL bloggers,

Have you ever written an amazing blog post that just didn’t pan out because when you reread what you wrote, you found out it was crap? That happens to me sometimes and it’s frustrating. Here are some illustrations that I’ve created in the past that just never made the cut. So I’ll just leave these here.

Clean Your Data

SQLHomies

Just slightly hipper than SQLPeople.net

Love Your Data

Inspired by Karen Lopez’s favourite quote (Maybe this should have gone before the first picture).

Brent Ozar PLF

I admit it, I’m a Brent OPLF fanboy. And once upon a time, Jeremiah Peschka’s and his beard reminded me of Yukon Cornelius:

-- Comments (4)

June 1, 2011

A Round-up of Round-ups and other Meta Things

Filed under: Data Cartoons,Miscelleaneous SQL,SQLServerPedia Syndication — Tags: "sql server", community — Michael J. Swart @ 12:00 pm

This article is very meta.

No, I’m not talking about the friendly Mehta family a few doors down. I’m talking about the prefix Meta. In particular I’ve noticed a few SQL Blog post aggregator sites and I wanted to aggregate the aggregators, review the reviews and round up the round-ups.

But How Do I Use This Post?

I wrote this post because a friend of mine wanted to keep up with what’s new in the industry. When I showed him my list of RSS feeds, it was a bit overwhelming. So I wrote this post to be an easier and less overwhelming intro to the online SQL community. If you’re in the same boat, just follow these steps:

Start by following the weekly round ups (see below). At the very beginning, it’s a very good place to start.
Then if you’re still curious and your free time allows, subscribe to some of the collection feeds (see below). This gives a large variety of decent quality content.
Create a category called Elite in your RSS reader and promote or tag your favorite writers to that category.

Without further ado

Weekly SQL Article Roundups

These are weekly recaps of what’s new in SQL Server are probably the highest concentration of quality links for the week. I would recommend any one of them for anyone who only has 15 to 30 minutes to spare a week but wants to keep an eye on the industry.

Database Weekly (website | newsletter) A weekly newsletter put on by the folks at SQL Server Central (Or Red Gate or Simple Talk; they’re all friends). It’s probably the most popular weekly round up for SQL Server and well deserved. It’s got at least a couple dozen links a week. It seems like a lot, but they’re broken out into categories so it’s manageable and very comprehensive.
Something For The Weekend (website | email | rss) By John Sansom (good guy). He usually has a dozen or so handpicked articles that he posts on Fridays. If you’re in the UK, he’ll also tell you about local user group events.
Things Brent, Jeremiah, Kendra and Tim Liked This Week (website | email) Another weekly newsletter that sprouted up about a month ago. It’s not as much of a round up as the other two in this list; It’s four people selecting three or four of their favorite links weekly. It’s only been a little over a month, but these links are cool (concentrated cool) and don’t tend to overlap with the other round-ups above.

Blog Collections

If you thought that Ulysses was a nice light read; And you remember Moby Dick as a quick fable; Maybe you’ve accidentally called the phone book a pamphlet and your favorite drink is fire-hose water. Then these are for you.

These are aggregate feeds of syndicated bloggers. Why subscribe to hundreds of SQL Bloggers separately when you can subscribe to most of them in one feed? With these feeds, you’ve got tons of articles to read and will likely not have enough time to read all of them. But if you eat\breathe\sleep SQL Server, these are for you.

SQLServerPedia (website | rss) Hosted by Quest software, they’ve got about 100 bloggers and I’m proud to be one of them. It’s a syndicated list which means that each article actually lives on its own bloggers’ site. But the articles are collected by SQLServerPedia’s website and rss feed. It’s a very diverse group of bloggers and together they average about 10 to 20 articles a day.
SQLBlog (website | rss) Run by Adam Machanic and Peter DeBetta. These blogs are popular and they’ve been around a while. They’ve got about 30-ish active bloggers under their wing and the quality is top knotch. There’s a number of things that make this site stand out.
- Quality, You can trust the information in these articles.
- Focus, Most of the articles are technical. Occasionally there are a few posts entitled I’ll be speaking at some user group you don’t care about. But those are easy to skip.
- Active, The comment section is lively. Because the articles are hosted at SQL Blog, so are the comments.
SQLServerCentral (rss) This has a couple posts a day and includes articles from Simple Talk. The volume makes keeping up with this feed manageable. The quality of each article is first class. I believe these articles are really polished because they’re reviewed by tech editors. My only gripes are that …
- The feed only includes a one paragraph teaser so that you have to click through to read the articles.
- Often the teaser has no indication of who the author is. So when the teaser article is something generic like Make the most of your Database it’s tempting to simply mark-as-read.

There are lots of others and I know I’m missing a few (SQLTeam, SQLskills, SSWUG etc…) you can check them out if you’re a fire-hose water connoisseur. (Holy cow, I just spelled connoisseur correctly without spell check!!)

Michael, What Else Have You Got?

If none of the above information is new to you, good job, you’re well ahead of the curve. This post still has something new for you though, a “meta” joke (a joke about “meta”, not a joke about jokes)

The humour here is all Karen Lopez (@DataChick), I just supplied the pixels.

-- Comments (8)

January 26, 2011

Ridiculously Unnormalized Database Schemas – Part Two

Filed under: Data Cartoons,SQLServerPedia Syndication,Technical Articles — Tags: "sql server", normal form, rdbms, relational — Michael J. Swart @ 12:00 pm

Ridiculously Unnormalized Database Schemas

Part Zero: Schemas that are not schemas
Part One: Schemas that are not 1st normal form
Part Two: Schemas that are not 2nd normal form
Part Three: Schemas that are not 3rd normal form

Second Normal Form (2NF)

I’ve seen 2NF defined this way:

“A table is considered 2NF if it is 1NF and if its non-prime attributes depends on the whole of every candidate key not just part of it.

It means that for tables that have multicolumn primary keys (or could), the other columns depend on all the columns in the key, not just a subset. If you follow this rule you’ll see that a field in a record will contain data that is a fact only about the record it belongs to, not others. From another point of view, I’ve seen normalization defined as removing redundant data

Things to Remember

For 2NF, you only need to look at the “non-prime attributes” or in other words the attributes that aren’t part of the primary key.
Look at these columns and ask whether they depend on the whole primary key.
Tables with a single-column primary keys are automatically in 2NF
BUT it’s not fair to make identity columns your primary keys on every table and call your job done (The definition of 2NF closes this loophole by mentioning candidate keys).

The Example

Take a look at the following table. It tracks reviews from the talent search t.v. show “American Idol”. The primary key is defined on the columns (Date, Singer, Song, Reviewer). Look at each other column and ask whether it depends on the whole key, or a subset of the key.

American Idol Reviews

Date	Singer	Song	Reviewer	Age	Show Order	Show Theme	Result	Review
Aug-21	Justin Guarini	Before Your Love	Paula Abdul	23	1	Finale	Runner-Up	Beautiful
Aug-21	Justin Guarini	Before Your Love	Randy Jackson	23	1	Finale	Runner-Up	A little pitchy dawg
Aug-21	Justin Guarini	Before Your Love	Simon Cowell	23	1	Finale	Runner-Up	Rubbish
Aug-21	Kelly Clarkson	A Moment Like This	Paula Abdul	23	2	Finale	Winner	Beautiful
Aug-21	Kelly Clarkson	A Moment Like This	Randy Jackson	23	2	Finale	Winner	A little pitchy dawg
Aug-21	Kelly Clarkson	A Moment Like This	Simon Cowell	23	2	Finale	Winner	Rubbish
Aug-21	Justin Guarini	Get Here	Paula Abdul	20	3	Finale	Runner-Up	Beautiful
Aug-21	Justin Guarini	Get Here	Randy Jackson	20	3	Finale	Runner-Up	A little pitchy dawg
Aug-21	Justin Guarini	Get Here	Simon Cowell	20	3	Finale	Runner-Up	Rubbish
Aug-21	Kelly Clarkson	Respect	Paula Abdul	23	4	Finale	Winner	Beautiful
Aug-21	Kelly Clarkson	Respect	Randy Jackson	23	4	Finale	Winner	A little pitchy dawg
Aug-21	Kelly Clarkson	Respect	Simon Cowell	23	4	Finale	Winner	Rubbish
Aug-21	Justin Guarini	A Moment Like This	Paula Abdul	23	5	Finale	Runner-Up	Beautiful
Aug-21	Justin Guarini	A Moment Like This	Randy Jackson	23	5	Finale	Runner-Up	A little pitchy dawg
Aug-21	Justin Guarini	A Moment Like This	Simon Cowell	23	5	Finale	Runner-Up	Rubbish
Aug-21	Kelly Clarkson	Before Your Love	Paula Abdul	23	6	Finale	Winner	Beautiful
Aug-21	Kelly Clarkson	Before Your Love	Randy Jackson	23	6	Finale	Winner	A little pitchy dawg
Aug-21	Kelly Clarkson	Before Your Love	Simon Cowell	23	6	Finale	Winner	Rubbish

You can see that with the exception of the column Review that all the columns in tables are not dependent on the whole key. We can pull these columns into separate tables:

Normalized:

ShowContestants

Singer	Date	Age	Result
Justin Guarini	Aug-21	23	Runner-Up
Kelly Clarkson	Aug-21	20	Winner

Performances

Date	Singer	Song	Show Order
Aug-21	Justin Guarini	Before Your Love	1
Aug-21	Kelly Clarkson	A Moment Like This	2
Aug-21	Justin Guarini	Get Here	3
Aug-21	Kelly Clarkson	Respect	4
Aug-21	Justin Guarini	A Moment Like This	5
Aug-21	Kelly Clarkson	Before Your Love	6

Shows

Date	Show Theme
Aug-21	Finale

We still have a table to hold all the reviews by the judges. Defined as:

Reviews

Date

Singer

Song

Reviewer

But it’s still a point of debate whether or not the reviews depend on the whole primary key or a subset of the key (especially based on the example).

-- Comments (8)

October 5, 2010

DBA humour

Filed under: Data Cartoons,SQLServerPedia Syndication,Tongue In Cheek — Michael J. Swart @ 9:44 pm

Data (Bass)

(Pun so intended)

-- Comments (3)

March 25, 2010

Transaction Phenomena – Part 2: The Non-Repeatable Read

Filed under: Data Cartoons,SQLServerPedia Syndication,Technical Articles — Tags: "sql server", "Transaction Isolation Level", non-repeatable read, Read committed — Michael J. Swart @ 10:31 am

SERIES: Concurrent Transaction Phenomena

In Books Online (BOL), Microsoft describes different kinds of transaction isolation levels in terms of phenomena that can occur during concurrent transactions. Specifically, they mention three kinds of phenomena: Dirty Reads, Non-Repeatable Reads, and Phantom Reads. You may have heard of these before, but correct me if I’m wrong, I just can’t find a good definition anywhere on BOL.

And that’s too bad, because these phenomena don’t just help describe the different transaction isolation levels. The phenomena actually define these levels.

These terms actually come from the ISO/ANSI standard for SQL-92. The terms themselves however do little to illuminate exactly the meaning of each. (What’s a phantom read?) When used by the ANSI/ISO standard, they mean something very specific and they’re key to understanding the different isolation levels.

In the next few days, I’d like to illustrate each phenomenon:

Part 1: The Dirty Read (reading tentative data)
Part 2: The Non-Repeatable Read (reading changed data)
Part 3: The Phantom Read (reading new data)
Part 4: Serializable vs. Snapshot

Part 2: The Non-Repeatable Read

The non-repeatable read occurs when a transaction re-reads a single record and finds that it has been changed or deleted.

From ISO/ANSI: Process P1 reads a row. Process P2 then modifies or deletes that rows and commits the change. If P1 rereads the row it receives the modified value or discovers the row has been deleted.

Changing Data
I like to think of Non-Repeatable Reads as being about reading data that has changed.

Changing Data

(I apologize in advance to my friends at work whose reaction I know will be “Boo”.)

But Non-Repeatable reads are not always a bad thing. In fact they’re often harmless or even required. When data has changed, you have to decide whether you want consistent data or whether you want current data. If you need to rely on consistent data, then a non-repeatable read is detrimental. Here is an example of a non-repeatable read that causes problems:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED
BEGIN TRANSACTION
 
	IF EXISTS(
		SELECT 1
		FROM Theatre.AvailableSeats
		WHERE seat = 'B23')
	BEGIN
 
		-- imagine a concurrent transaction
		-- sells seat B23 here and removes the record
		-- from table Theatre.AvailableSeats
 
		-- then no rows are returned here:
		SELECT price
		FROM Theatre.AvailableSeats
		WHERE seat = 'B23'
 
		-- and zero rows are affected here:
		DELETE FROM Theatre.AvailableSeats
		WHERE seat = 'B23'
 
	END
COMMIT

Isolation Levels that Experience Non-Repeatable Reads

Just like dirty reads, this phenomenon is possible when using the READ UNCOMMITTED isolation level, but it is also possible when using the READ COMMITTED isolation level. And in SQL Server READ COMMITTED is the default isolation level for transactions.

My experience
To avoid this phenomena, you could use isolation level REPEATABLE READ. But often I find it’s possible to rewrite a transaction and stick with the default READ COMMITTED.

-- Comments (11)