February 22, 2015 · hadoop distributed systems

The Fading of Hadoop

The MapReduce programming model was born at Google in 2004 and, thanks to the creation of Hadoop in 2006, for a long time it has been the de facto model for processing big data. However, during the last few years new frameworks and projects have emerged that improve on its main weak points: complexity and slowness.

Hadoop Logo

In 2009, to overcome the complexity of writing MapReduce jobs by hand, Yahoo built Pig and Facebook Hive, and both projects are still widely used. They provide a simple language with higher level of abstraction that can be compiled to MapReduce jobs. Essentially they make it much easier to write applications and scripts, which can even be executed interactively. Cascading also appeared some years after to simplify the task of developing MapReduce pipelines.

The other main weakness of MapReduce is that is too slow to power near real-time applications. To address that, many use a stream processing framework in parallel to a Hadoop cluster, in what is known as Lambda Architecture. Others only use a stream processing framework backed by a distributed log, called Kappa Architecture. Yet LinkedIn seems to be the only big company to be using it extensively.

But nowadays the cool kid in the distributed systems world is Spark. It surpasses Hadoop on speed by using in-memory computation as much as possible and it provides a high-level API, similar to Cascading's, facilitating to develop on it. It also allows you to build stream processing applications, making it easier to bridge the online and batch worlds.

All in all, it seems that the days of writing MapReduce jobs are coming to an end. Hadoop is being replaced by several tools and frameworks that complement and sometimes compete against each other. These are interesting times for building distributed applications.

Comments powered by Disqus