Do I Need SQL or Hadoop? A Flowchart

I read this blog post, thanks to @merv on Twitter: Counting Triangles (Vertica)

It's about how to count triangles in a graph, and contrasts using Vertica with using Hadoop's MapReduce. Vertica was 22-40x faster than hadoop on 1.3GB of data. And it only took 3 lines of SQL. They've shown that on 1.3GB of data, Vertica is easier and faster. This result is not super interesting though.

The effort in writing the jobs is vastly different - SQL is much easier in this case, but we all know this. Yes, SQL is easier than MapReduce. It's also true that MapReduce is way easier than writing your own distributed computation job. And it's also true that with MapReduce once can do things SQL can't, like image processing.

But benchmarking Vertica or Hadoop on 1.3GB of data is like saying "We're going to conduct a 50 meter race between a Boeing 737 and a DC10".  Such a race wouldn't even involve taking off. The same is true of this comparison. Neither of these technologies was designed to run on such small data sets.

Now, it is nice to have a scalable system that is also fast at small scales, but I don't think that's what this article was about. If the implication is that this performance difference will still hold at larger scales, that is not obvious at all, and really deserves to be proven.

To help people decide which technologies they should use based on their particular situation, I've constructed the following flow chart (click to enlarge):