Ten Principles for Good Data

Inspired by Dieter Rams’ Ten Principles for Good Design, I created the following list of ten principles for good data. These principles inform the products we build at Koverse to help organizations to take full advantage of all of their data.

[PDF Version]

Good data varies in the level of structure. 

The structure of data does not determine its usefulness – unstructured text and imagery, semi-structured and nested data, and highly structured records can contain equally valuable information. Technologies such as flexible schemas, natural language processing, and computer vision help unlock the information in these data types.

Good data is as much data as possible. 

We no longer need to try to predict the usefulness of data before storing it. Data storage and processing technology has advanced to the point where it is now possible to store data before particular use cases are identified, full of potential, ready to be combined and analyzed with other sources.

Good data is co-located. 

The volume of available data is increasing rapidly. As data volume grows, it becomes harder to move quickly. Not only are we now moving computation to the data, but various data sets should be stored close together on physical media, so that combining different data sets and asking questions across multiple data sets becomes possible.

Good data is widely accessible. 

To be effective, data must be available to the right people at the right time. Not only must the proper access be granted to various groups of decision makers, but the data must be retrievable quickly, at the speed at which decisions must be made. Organizing the data via profiling and indexing makes it possible to ensure the right data can be queried, searched, and delivered in time.

Good data can be traced to its source. 

As data is combined, transformed, and summarized, the actions and relationships between source data and derived data sets must be recorded. Decisions that can be made from a data set are only as good as the sources and methods from which it was created. Being able to trace the lineage of data back to original sources is essential for making decisions with high confidence.

Good data should have an original version. 

Often assumptions are made as to structure and semantics of a data set. Ideally these assumptions are informed by the data itself, but when it is discovered that an assumption is invalid, it must be possible to go back to the earliest form of the data and start over.

Good data is distributed. 

The ability to work with data should only be limited by available resources and not by artificial technical hurdles. By distributing data onto multiple servers and using software designed to coordinate work across these servers, the time it takes to process data becomes a function of the number of servers available, and organizations can increase the number of servers to meet business need. Many algorithms have already been parallelized to work in these environments and more work is being done all the time.

Good data is protected. 

Before data owners are comfortable contributing data to an analytical system, sufficient control must be guaranteed such that those responsible for protecting data can be assured that access is granted only to groups that are authorized to read the data. Powerful security features make it possible to overcome obstacles to bring data together to provide insight without violating the confidentiality of the data.

Good data can be understood. 

Billions of records and mounds of text are manageable from a storage and processing perspective, but at some point it should be made to guide thinking and actions within an organization. As raw data is frequently overwhelming and low level, this often involves transforming raw data into higher level, often smaller data that can be readily applied to a decision process. These transformations include aggregation, summarization, training statistical models, and visualization to name a few.

Good data flows. 

Gathering and organizing data can take so long that by the time it is in the form to support a decision it is already too old. Often this is not due to hardware limitations but is the result of the data flow being a manual, labor intensive process. Good data is frequently updated, and the derivatives of the data are updated in time to be relevant to today’s decisions.

(This post is also available at the Koverse blog)

Some of my Favorite Data Visualization Resources

Data Visualization is one pillar of the solution to the problem of overwhelming data. What may appear to be completely noisy, incoherent data points can appear as immediately recognizable patterns when projected onto the right data visualization.

Visualization is more art than science at this point, although some have used it enough to be able to identify successful techniques for various purposes. Most successful visualizations are perhaps more dependent upon the decision or task at hand than the actual original structure of the data. For example, geographical information might actually not be best visualized on a map, if the task at hand doesn't depend on comparing geographical distances or directionality.

Sometimes the distribution of values in the data call for one specific visualization rather than another. A good example is illustrated in a recent post on Business Insider showing how a pie chart, while it could be good for showing a distinction between 2 or 3 data points with different values, is terrible for comparing 10 values with subtly differing values; a task much better served by a bar chart.

Here are a few great resources on data visualization. There are too many out there for me to do them all justice. These are just a few that I've found useful and use on a regular basis.

Books on Visualization

Visualize This
By Nathan Yau

This book is great because it discusses everything from how visualization can solve an information problem to creating the visualization using open tools to adding the final touches.


Visualizing Data
By Ben Fry

This is another great book for a comprehensive look at creating visualizations. Using a seven step pipeline this book covers the conception of a good visualization all the way to completion. Ben Fry is also the author of an interactive visualization framework called Processing that is used throughout the book.




R Graphics Cookbook
By Winston Chang

I recently picked this up because R is such a useful tool for doing so many things with data, and the visualizations R produces are fairly good out of the box. This cookbook helps you get to a good visualization quickly with examples for almost any visualization you might want.





Information is Beautiful
By David McCandless

This fascinating book is a collection of interesting visualizations of a wide variety of things that author David McCandless has created. David describes himself as a data journalist and information designer. These are some very creative visualizations that go far beyond the classic charts and graphs.



The Visual Display of Quantitative Information
By Edward Tufte

This is a much older book than most of the others yet it describes principles that apply to all good visualizations. In particular I like Tufte's emphasis on parsimonious designs that get out of the way and let the information tell the story.



Universal Principles of Design
By William Lidwell, Kritina Holden, and Jeff Butler

This book contains an alphabetical list of good design principles that apply to all aspects of design, including visualizations and information design.












Sites for Visualizations and Design

Flowing Data - flowingdata.com
Nathan Yau's (author of Visualize This) site

Visual Complexity - www.visualcomplexity.com/vc
A site containing many different ways to tackle visualizing graphs, one of the most difficult types of visualization.

Visual.ly - visual.ly
Visual.ly is like a hub for infographics. One could argue that too many things are being made into infographics, but there have been some amazing things posted at visual.ly that push the envelope in terms of making unfathomable things understandable, like this Perspective on Time.

The New York Times - nytimes.com
The Times has done some fantastic visualizations around individual stories in the past.

Wanken - blog.wanken.com
The blog of designer Shelby White, who just manages to find consistently amazing design. Also author if the site designspiration.


Tools

R - www.r-project.org
The reason I like R is that it produces pretty good visualizations out of the box which can then be exported as SVG to another application for fine-tuning, styling, and labeling.

Python's Matplot lib - matplotlib.org
In conjunction with things like scikit-learn and NLTK, Python's matplotlib makes for a very well-rounded data processing and visualization toolkit.

D3 - d3js.org
D3 is great for creating custom visualizations for the web that may need to be interactive. The only drawback is that it can be more difficult to use than a focused graphic library. Classic customization vs convenience trade-off here.

Highcharts - www.highcharts.com
Highcharts are ready-to-use javascript charts that look great and are fantastic for showing live updates to data. Highcharts must be licensed for commercial use.

Chart.js - www.chartjs.org
Chart.js are really good looking animated HTML5 charts. These are so great I recently built them into our product at Koverse for visualizing various aspects of data collections. Below is one example.



Processing - processing.org
Processing is a project to make building interactive visualizations easy using a scripting language built on Java and integrated development environment. Processing has a lot of great libraries for doing things like simulating physics and even face detection in webcam video. Many a sophisticated experiential design and art project have been built using Processing.



Hopefully the above list proves useful in your own endeavors. Happy visualization!

Dremel vs. Tenzing vs. Sawzall

Recent buzz surrounding Google's Dremel and the potential for an open source implementation caused me to wonder about a similar paper Google published about a system called Tenzing, about which there seems to be less buzz.

Turns out both technologies are in use in Google's YouTube Data Warehouse. The slides from XLDB that describe the system highlight the following tradeoffs, which may be specific to Google's implementation, but may reveal more fundamental tradeoffs between latency and query power. The also include Sawzall, a language for implementing MapReduce jobs.

The slides contain the following table. Note that 'high' = good for rows except latency:

SawzallTenzingDremel
Latencyhighmedlow
Scalabilityhighhighmed
SQLnonehighmed
Powerhighmedlow


Looking at this chart, it appears that there is a bit of a continuum. Dremel provides the lowest (best) latency, it appears at the cost of query power (no joins?).  

If more query power is required moving to Tenzing appears to handle 'medium complexity analysis' with strong SQL support (i.e. more of the SQL spec is implemented and it's likely more compatible with SQL based systems). Tenzing sacrifices a bit of latency but scalability is actually considered better than Dremel.

Finally, switching from declarative SQL-like queries to the procedural language Sawzall provides more query power and control at the cost of yet more latency. 

Open Source Options


Currently, Sawzall has been open sourced and can be found here. There is a proposal to create a Dremel implementation as an Apache Incubator project called Drill from the guys at MapR and some other companies. There's also a project called OpenDremel.

These projects are interesting since achieving higher scalability was considered to come at the cost of the interactivity and flexibility provided by SQL. Dremel demonstrates that low-latency, interactive SQL-like queries are possible at 'medium' scales.

I'd love to hear why the YTDW guys say Dremel doesn't scale as well as MapReduce as I didn't get that sense from the research paper. They quote 'trillions of rows in seconds', it runs on 'thousands of CPUs and petabytes of data', and processes 'quadrillions of records per month'. That's 'medium' scalability for Google. It's likely the case that Google's version of MapReduce scales to astronomical numbers and that Dremel can handle the biggest datasets that all but a few are likely to throw at it. 


Big Data Reading List

Since there's so much going on in the Big Data space these days, getting up to speed quickly is important for lots of technical decision makers.

This is a list of books and articles that might be helpful for learning about the major concepts and innovations in the Big Data space. It is by no means an attempt to be comprehensive or even an unbiased representation, just useful. I've organized the list according to what I feel are a few fundamental approaches to tackling the Big Data challenge, namely:

New Distributed Architectures

Distributed Architectures address the most basic problems related to Big Data - i.e. what does one do when the data no longer 'fits' on a single machine. Ostensibly one must store or stream the data and process it somehow.

Machine Learning

Machine learning, modeling, data mining, etc address the problem of understanding the data. Even if I can store and process the data, ultimately I need to gain some level of human understanding of the information contained therein. 

Machine learning can help solve this problem via modeling - either in a way such that the model is transparent and a human can understand the fundamental processes that generated the data, or a model that can be used in place of human understanding to help make decision. It can also reduce dimensionality and reveal structure.

Visualization

Visualization is a different approach to helping understand the data that leverages the considerable power of the human visual cortex to help find patterns and structure in the data. 


Sometimes these approaches combine. I think that perhaps all of the above approaches are coalescing into a new field that could be termed 'Data Science'.


New Distributed Architecture Concepts


Machine Learning and New Architectures

Machine Learning

Visualization

A Few Blogs / Sites




Ingest the Web into Accumulo at covert.io

A good friend and outstanding technologist behind the blog covert.io just published this guide to crawling the web using Nutch and storing the pages in Accumulo:
Accumulo, Nutch, and Gora

Accumulo on EC2

I've posted a guide to running Accumulo on Amazon's EC2. Accumulo has been deployed on hundreds of machines on EC2 and it works pretty well.

Accumulo is an implementation of Google's BigTable with addition features such as cell-level security labels and programmable server side aggregation.
Scaling the size of the cluster we saw an 85% increase in the aggregate write rate each time we doubled the number of machines, reaching 1 million inserts per second at the 100 machine mark.

Netflix has shown similar results running Cassandra on EC2 on 100 machines in their benchmark.

Hadoop Genealogy

Came across this diagram of the evolution of Hadoop on the Apache Blog. Interesting that so much competition is going on down in the dirty details of features, like append. It's as if the commercial contention has visibly manifested itself in the code.


click to enlarge