After hearing words like “Big Data”, “Hadoop”, “MapReduce” and “NoSQL” more and more recently, many friends and colleagues have decided that they need more than just a vague idea about what this stuff is all about. A couple of years ago when I started on this same discovery path, info on this topic area was far more fragmented. Recently, though, with an explosion of “big data” interest and companies, there are many great resources. In fact, Cloudera is a great starting point. First, is an interview of Cloudera (essentially they are a RedHat for Hadoop) CEO Mike Olsen by Robert Scoble. Mike nicely pulls back the onion step-by-step, first by walking through the computing/data trends that have driven this “big data” movement, which leads into who the key developers/contributors have been, and finally what the ecosystem looks like today.
The key takeaways that I like to focus on are:
- Data sets have gotten MUCH bigger. It used to be that data was human-scale. In other words, data was created by human-driven activities. Things like credit card transactions, bank accounts, static html pages, and e-mail. This type of data consumed GB’s maybe even some TB’s of data. Now we live in a world of machine-scale data. Machines with code that are producing new data every millisecond. We’re talking about things like sensors and data-driven web sites that generate, repurpose or syndicate data. The Web is described today in PBs (petabytes = 1,000 TB). Many large-scale services deal with datasets in the 10s-100s of TB or larger and process TBs of data daily.
- Physical I/O throughput is the bottleneck – The amount of data we can store on a disk is on the order of 1,000x what we can store in RAM as a working set. Commodity hardware can handle 10′s of TB of hard disk storage but only 10′s of GB of RAM. So what does that mean? Well, in order to process any arbitrary subset of data within a massive dataset, this data needs to be loaded into RAM. Even fast HDs today can only read at a theoretical maximum of not even 100MB/sec. And even at this rate, reading just 1TB of data into RAM will take nearly 3 hours. 100TB would take more than a week. When I worked at Ariba from 2001-2004, we basically told our clients to go out and buy the biggest Sun box they could afford and throw Oracle on it. Whatever the maximum througput of the resulting DB server was essentially set an upper limit of overall application throughput. This was in a 1TB world. That’s just not good enough today for many applications. The Hadoop/MapReduce concept breaks this paradigm by saying instead of having 1 uber expensive DB box, let’s put small chunks of data into large clusters of cheap, commodity hardware. Hadoop achieves massive parallellization and thus fights orders of magnitude larger datasets with orders of magnitude higher I/O.
- Finally, Hadoop is not the “replacement” for the RDBMS. solves new kinds of data problems that SQL databases were never designed for. For financial systems, transactional data needs to have perfect integrity. If data is updated then the very next query that references that data must see the updated data. On the other hand, if you add a friend on Facebook, it’s not the end of the world if a split-second later your # of friends stat still shows 467 versus 468. Further, data like webpages, profiles, and tweets are not highly structured. Such data should be stored differently because it is accessed and processed differently. Furthermore, such data tends to be read orders of magnitude more often than it is written. (Think about a typical webpage and heck a tweet for example can never be updated). So instead of making reads expensive, let’s make writes expensive for such applications.
If this kinds of stuff sounds interesting, take a few minutes and watch the video:
After watching this, if you’re ready to dive deeper, watch Clouder’s Hadoop training video series starting with this video on Thinking at Scale