Dremel vs. Tenzing vs. Sawzall

Recent buzz surrounding Google's Dremel and the potential for an open source implementation caused me to wonder about a similar paper Google published about a system called Tenzing, about which there seems to be less buzz.

Turns out both technologies are in use in Google's YouTube Data Warehouse. The slides from XLDB that describe the system highlight the following tradeoffs, which may be specific to Google's implementation, but may reveal more fundamental tradeoffs between latency and query power. The also include Sawzall, a language for implementing MapReduce jobs.

The slides contain the following table. Note that 'high' = good for rows except latency:

SawzallTenzingDremel
Latencyhighmedlow
Scalabilityhighhighmed
SQLnonehighmed
Powerhighmedlow


Looking at this chart, it appears that there is a bit of a continuum. Dremel provides the lowest (best) latency, it appears at the cost of query power (no joins?).  

If more query power is required moving to Tenzing appears to handle 'medium complexity analysis' with strong SQL support (i.e. more of the SQL spec is implemented and it's likely more compatible with SQL based systems). Tenzing sacrifices a bit of latency but scalability is actually considered better than Dremel.

Finally, switching from declarative SQL-like queries to the procedural language Sawzall provides more query power and control at the cost of yet more latency. 

Open Source Options


Currently, Sawzall has been open sourced and can be found here. There is a proposal to create a Dremel implementation as an Apache Incubator project called Drill from the guys at MapR and some other companies. There's also a project called OpenDremel.

These projects are interesting since achieving higher scalability was considered to come at the cost of the interactivity and flexibility provided by SQL. Dremel demonstrates that low-latency, interactive SQL-like queries are possible at 'medium' scales.

I'd love to hear why the YTDW guys say Dremel doesn't scale as well as MapReduce as I didn't get that sense from the research paper. They quote 'trillions of rows in seconds', it runs on 'thousands of CPUs and petabytes of data', and processes 'quadrillions of records per month'. That's 'medium' scalability for Google. It's likely the case that Google's version of MapReduce scales to astronomical numbers and that Dremel can handle the biggest datasets that all but a few are likely to throw at it. 


Big Data Reading List

Since there's so much going on in the Big Data space these days, getting up to speed quickly is important for lots of technical decision makers.

This is a list of books and articles that might be helpful for learning about the major concepts and innovations in the Big Data space. It is by no means an attempt to be comprehensive or even an unbiased representation, just useful. I've organized the list according to what I feel are a few fundamental approaches to tackling the Big Data challenge, namely:

New Distributed Architectures

Distributed Architectures address the most basic problems related to Big Data - i.e. what does one do when the data no longer 'fits' on a single machine. Ostensibly one must store or stream the data and process it somehow.

Machine Learning

Machine learning, modeling, data mining, etc address the problem of understanding the data. Even if I can store and process the data, ultimately I need to gain some level of human understanding of the information contained therein. 

Machine learning can help solve this problem via modeling - either in a way such that the model is transparent and a human can understand the fundamental processes that generated the data, or a model that can be used in place of human understanding to help make decision. It can also reduce dimensionality and reveal structure.

Visualization

Visualization is a different approach to helping understand the data that leverages the considerable power of the human visual cortex to help find patterns and structure in the data. 


Sometimes these approaches combine. I think that perhaps all of the above approaches are coalescing into a new field that could be termed 'Data Science'.


New Distributed Architecture Concepts


Machine Learning and New Architectures

Machine Learning

Visualization

A Few Blogs / Sites




Ingest the Web into Accumulo at covert.io

A good friend and outstanding technologist behind the blog covert.io just published this guide to crawling the web using Nutch and storing the pages in Accumulo:
Accumulo, Nutch, and Gora

Accumulo on EC2

I've posted a guide to running Accumulo on Amazon's EC2. Accumulo has been deployed on hundreds of machines on EC2 and it works pretty well.

Accumulo is an implementation of Google's BigTable with addition features such as cell-level security labels and programmable server side aggregation.
Scaling the size of the cluster we saw an 85% increase in the aggregate write rate each time we doubled the number of machines, reaching 1 million inserts per second at the 100 machine mark.

Netflix has shown similar results running Cassandra on EC2 on 100 machines in their benchmark.

Hadoop Genealogy

Came across this diagram of the evolution of Hadoop on the Apache Blog. Interesting that so much competition is going on down in the dirty details of features, like append. It's as if the commercial contention has visibly manifested itself in the code.


click to enlarge

Online Data Availability Challenges

When deciding to share data with others, publish content, or create a new website or web application one must consider the problem of where the files and code must live. As the Internet Archive project illustrates, content on the web is unsettlingly ephemeral.

For those with an interest in creating some lasting content or presence on the web, and for those worried about data loss and high data availability, the following factors conspire to kill your data. I'll address each and hopefully a picture of a solution will emerge:

1. The majority of internet users are connected to the net via obfuscated means

Wouldn't it be nice if it were possible to make some content or data on your hard drive available to others without having to put them somewhere else? Alas, the nature of most internet users' connections means there's no good way for others to consistently find your machine from wherever they may be. First, the assignment of IP addresses to user machines is dynamic, you get a different address every time you reconnect to the web, which might not be changed that often, but can't be relied on not to change. This means that there's no good way for other people to consistently find your machine and the data you may wish to make available.

There are dynamic DNS updating services to help with this, but then you run into the next hurdle: many users are sharing a single IP address via a technology called NAT (network address translation) that only allows traffic coming back from an original user request to find it's way to a user's machine, rather than unsolicited web requests from anywhere.

These are probably good for security, but completely destroy the option of making data hosted on your own machine highly available to others. Furthermore, upload speeds tend to be much lower than download, so only a few users could reach your data simultaneously.

This means you've got to find some service out there on the net that is setup for making data available consistently and to a large number of users.

2. Hard drives and other storage media fail fairly reliably

Conventional hard drives (not Solid State Disks) have moving parts and are often the first component of any computer to fail. It's just a matter of time. More expensive disks last longer but no disks last forever. This necessitates backing data up somewhere. But even then the backups are prone to fail. Writable CDs only last a few years, and other hard drives will also fail or may silently corrupt the data.

The safest option is to host several copies of data on several disks that are monitored for bit rot (not total drive failure, but just a few bits of data changing or going away) and failure, upon which new copies are created from the remaining good copies. This can be provided via RAID, or newer distributed filesystems such as HDFS, which is the only filesystem I know of that checks for bit rot.
[Update: as @Zooko points out, ZFS and Tahoe-LAFS also provide this feature]

This means we have to find a service that offers to make multiple copies of your data on multiple spinning disks, monitor the copies, and automatically re-replicate as necessary. I mentioned a few technologies, and these are either very available, such as RAID, or are becoming more so, as with hosted distributed filesystems.

3. Web hosting companies can come and go

This doesn't necessarily threaten data loss, as most companies will probably offer to let you get your data somehow should they fail, but this process is not usually very well known if it exists at all. Even the big web companies have reluctantly responded to the need to allow users to extract all their data.

It would be best for data availability to have your data hosted by a strong company, with advanced data centers, and enough scale to provide your data with several spinning disks.

4. Web technologies, such as servers and browsers, change

This one may be the hardest challenge to overcome, as there doesn't seem to be much demand for solving it. The best one can do is try to stick to widespread standards, formats and technologies to extend the usefulness of their data over time. I believe we need more work on creating lasting data and content delivery systems that have fewer dependencies that the systems we have today.

5. The networked nature of the web gives rise to spiky page view behavior

Even if your data is protected against loss, others must be able to get to it. Because of the interlinked nature of the web, content tends to be accessed in spiky bursts as word spreads about the content through user sharing and re-posting. If your data or content isn't in a highly scalable application, some users will be denied access to it.

Modern content hosting services provide a technology known as Content Delivery Networks, which are essentially geographically distributed servers that hold static content and direct requests to the nearest physical copy. For dynamic data, scale can be addressed by applications that can be distributed across a cluster of machines that can be grown and shrunk dynamically with no downtime, and requests load balanced across them.

This means we need to dynamically replicate content and applications across many machines to respond to user demand. Fortunately web technology has been trending towards this with the advent of dynamically scalable platforms.

At this point, a few services more or less fit this bill, namely some of the large 'cloud computing' offerings. Each offer some form of scalable, replicated content delivery. However, we have one more factor to address.

6. Controversial content may have a hard time finding a home and staying available

Content that threatens powerful interests, or that may be considered controversial, or that could be caught up in intellectual property disputes, is subject to being denied hosting if the affected interests can pressure content hosting companies. The unfortunate side effect of all the other factors, which point us towards large, thriving, and technologically advanced hosting companies, is that it makes it easier to have content made unavailable via legal or commercial pressure on one or two entities. 

Solutions to factors #6 and #3 seem to be at odds with each other.

So how does one achieve data replication, distribution, and scalability, but at the same time avoid centralized control? I believe we need more effort in this area, perhaps going back to challenge #1 of securely and consistently making content on user machines available. A few decentralized services exist to help avoid making content 'killable' through censorship efforts, and these may even get close to addressing the other requirements, but are not well adopted or integrated into the web. 

The perfect solution seems to be a decentralized, geographically distributed, widely addressable, dynamically scalable data replication and delivery service that is an essential component of several independent commercial endeavors to ensure it's longer term support and viability. I don't think we have many candidates, but I'd love to be proven wrong.




Do I Need SQL or Hadoop? A Flowchart

I read this blog post, thanks to @merv on Twitter: Counting Triangles (Vertica)

It's about how to count triangles in a graph, and contrasts using Vertica with using Hadoop's MapReduce. Vertica was 22-40x faster than hadoop on 1.3GB of data. And it only took 3 lines of SQL. They've shown that on 1.3GB of data, Vertica is easier and faster. This result is not super interesting though.

The effort in writing the jobs is vastly different - SQL is much easier in this case, but we all know this. Yes, SQL is easier than MapReduce. It's also true that MapReduce is way easier than writing your own distributed computation job. And it's also true that with MapReduce once can do things SQL can't, like image processing.

But benchmarking Vertica or Hadoop on 1.3GB of data is like saying "We're going to conduct a 50 meter race between a Boeing 737 and a DC10".  Such a race wouldn't even involve taking off. The same is true of this comparison. Neither of these technologies was designed to run on such small data sets.

Now, it is nice to have a scalable system that is also fast at small scales, but I don't think that's what this article was about. If the implication is that this performance difference will still hold at larger scales, that is not obvious at all, and really deserves to be proven.

To help people decide which technologies they should use based on their particular situation, I've constructed the following flow chart (click to enlarge):