May 10, 2015 · hadoop data spark conferences

Strata + Hadoop World 2015, London

This week was the Strata + Hadoop World conference in London. Having six simultaneous tracks I surely missed a lot of content, but I want to explore some of the main themes and the sessions that I enjoyed most.

Strata + Hadoop World Logo

Getting Value out of Big Data

One of the biggest themes of the conference was how to get value out of big data. It seems that many companies have deployed Hadoop for their analytics systems but now struggle to get more value out of it. Also a few big companies showed how they use Hadoop, but they're still in the beginning of the journey, doing ETL and joining data sets to have a complete view of the customer.

Some sessions that I enjoyed are It ain’t what you do to data, it’s what you do with it and The data strategy revolution: building an in-house data insights lab.

A number of talk highlighted the importance of people and communication skills needed to breaking data silos, become a data-driven company and setting up data teams.

Spark, Spark, Spark!

Many talks focused on Spark. Nobody doubts that is Hadoop's successor thanks to its more concise API, its speed and the ability to use it for both streaming and batch problems. And thanks to it many people are trying Scala for the first time.

Good sessions were Apache Spark: What's new; what's coming and Spark on Mesos, where Dean describes the similarities and differences between YARN and Mesos.

Some Other Technical Bits

I also attended a few highly enjoyable technical sessions. In Deploying machine learning in production we where explained how to evaluate ML systems in production and some of the more common problems.

In What's there to know about A/B testing? Noel Welsh talked about running A/B tests, making it clear that it's much more complex than it seems: you must know how long are you going to run it, the size of the variants, etc. Unfortunately it was a short 20min session.

Another enjoyable one was about using probabilistic data structures and algorithms like bloom filters and HyperLogLog to approximate metrics in near real-time using bounded space.

Finally, in Information architecture for Apache Hadoop, Mark Samson from Cloudera described a logical information architecture to organise your batch data in layers so analysts, developers and user-facing applications can use it in the form that it's more useful to them.

Did you attend the conference? If so, which sessions did you find more interesting?

Comments powered by Disqus