Michael J. Swart

March 19, 2013

Checking Out Amazon Redshift

In order to refresh my memory about what I learned in University, I’ve been watching a course that UC Berkeley made available on Youtube. It’s been a good course so far. Recently I got to the topic of logical fallacies. And so I’m reminded that the following is not a valid argument: Jimmy advises X; Jimmy profits if you follow advice X; Therefore, Not X. It’s not valid in the academic sense. In the practical sense, I have to remember to not distrust all commercials and marketing videos.

But it’s hard, especially when Googling for “Big Data” or “Analytics”, I find it very difficult to find anything to read that’s both unbiased and useful. Maybe it’s because I’m still skeptical about any solution that is promoted by the people who stand to profit from following their advice (I’m trying not to discount their advice, I just squint at it a little).

So when Amazon came out with a publicly available Beta for their new Redshift Datawarehouse service (accompanied by a slick marketing video), I decided to kick the tires. Here’s some notes I made:

Amazon’s Redshift Commercial

Their commercial is here. And I have to say it reminded me of an infomercial. You know the kind where they try to sell fancy mop gadgets by showing how ridiculously clumsy people can be with regular mops. The Amazon commercial is light on technical details but I don’t think it was meant for an audience like me and you. I made a conscious effort not to hold that against them.

Warehouse in the cloud

Having a warehouse in the cloud makes a lot of sense in some ways. Pay-as-you-go pricing is what you want when you prefer operational costs over capital costs. Many businesses who don’t have I.T. as a core competency will find this a pretty attractive alternative to running servers or an appliance themselves. But it can get expensive quick, Amazon advertises less than $1000 / Terabyte / year. But with upfront costs for reserved pricing and a two terabyte minimum. The smallest rate you can get away with is $3000 per year for three years. In practice it will likely be significantly more. I can’t see anyone choosing Redshift without doing due diligence on the pricing, but it’s probably not going to be dirt cheap.

Connecting to Redshift
Star Trek's Jean Luc Picard says "Very well Data, open a channel"
Connections are made to Redshift only through ODBC (or JDBC) using Postgres drivers. Some challenges:

  • Picking my way through 32 bit vs 64 bit was tricky. Getting and using the right driver took some care.
  • Uploading through this connection is slow. So don’t try it this way. I mean it’s possible, but it’s simply not recommended. I learned this lesson not by reading through Amazon docs, but by attempting to run an SSIS job to funnel data into an ODBC destination. It was actually working, but it was impractical because it was so slow.

Creating the warehouse schema wasn’t too hard: I scripted a subset of tables from an existing warehouse. I stripped indexes, constraints and identities. There were a couple syntax differences (int becomes integer, bit becomes boolean, and GO becomes semicolon) but it went pretty smooth. For Redshift, in lieu of indexes, you choose columns on each table for the distribution key and the sort key. Underneath the covers, Redshift uses ParAccel and so if you’re familiar with that, you’ve got a great head start. Otherwise, Amazon’s Developer Guide is a good start. I’d like to hear more from others who know more about modeling for Redshift; it seems like a regular star schema will work well here.

Loading data is best through S3

I created a SSIS package that takes data from my relational warehouse. It takes that data and dumps it into delimited files (Hat tip to @peschkaj for advice on delimiter values). Then I gzipped the whole thing. I loaded those gzipped files into Amazon S3 and loaded data into Redshift using those files. Initially, I was afraid of Amazon S3 (What? I have to sign up for and learn about another service?) but working with Amazon S3 was dead simple.

Consuming data

I connected a Tableau client to Redshift using an ODBC connection. This Tableau discussion tells me that’s the current way to connect Tableau and Redshift. There are quite a few SQL limitations imposed by the ODBC protocol. So the Tableau experience was not too smooth. Tableau has a new Redshift connector coming out in a matter of weeks which should clear these limitations.
The point is that Amazon claims Redshift integrates seamlessly with a list of reporting apps (including Tableau). I believe it really will by the time Redshift is released, it’s just right now there’s a pretty big seam.

Next?

I’m going to get around to evaluating Google Big Query and of course Microsoft’s HDInsight. You might notice that lately, I find myself in a phase of learning (fun!) and so this post and the next few will be a description of my exploration.

6 Comments »

  1. [...] Checking Out Amazon Redshift - Yes I know it’s not your classic SQL Server post but it is for your own good, moulding you into a more robust data professional, courtesy of the experiences of Mr Michael J. Swart (Blog|Twitter). [...]

    Pingback by Something for the Weekend - SQL Server Links 22/03/13 • John Sansom — March 22, 2013 @ 9:22 am

  2. Not to derail any data-based conversation, but this Picard and the last Columbo are just really really good.

    Comment by Dave Swart — March 22, 2013 @ 11:28 am

  3. Thanks for the post, Michael. Like you, I’m suspicious of the pricing. Do you have any sense of what the bandwidth might add to the pricing for Redshift?

    Comment by Tim Benninghoff — March 22, 2013 @ 3:54 pm

  4. Hey Tim, Great question.

    I’m by no means the expert on pricing here, but I think the thing to reference is the Data Transfer section on Amazon Redshift Pricing.

    To get a definitive answer, you’d have to talk to someone at Amazon. My hunch is that data loads and unloads within Amazon (i.e. to s3 and back) are not charged (from s3 to the outside world is another story).

    And data transfers to redshift using their jdbc connection is so painfully slow, that you probably won’t be able to rack up any significant charges. (But don’t take my word for it).

    The whole point I believe is to let redshift do the slicing/dicing/processing/aggregating of data efficiently in the warehouse and report a relatively small set of results. These results shouldn’t be a significant size. That is, query results that use jdbc should be measured in KB rather than TB.

    Comment by Michael J. Swart — March 22, 2013 @ 4:16 pm

  5. Hi Michael – great post – I am working through something very similar at the moment. Did you ever do that next step and blog – evaluating Google Big Query ?

    Comment by Kelly Jol — March 19, 2014 @ 6:58 am

  6. Thanks for reading Kelly,
    I did hint that I would be taking a look at Google’s Big Query. The pricing model didn’t work for us (at work) and so it was dismissed pretty quickly without any analysis that would be worth sharing.

    I wish you good luck.

    Comment by Michael J. Swart — March 19, 2014 @ 8:22 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress