Strata + Hadoop World 2016, London
A month ago I attended the Strata + Hadoop World conference in London. If I thought that last year's was big with 6 simultaneous tracks, imagine this year with 12! Here's my partial impression of the conference.
Spark Is the Name of the Game
Is hardly a surprise, but big data processing is nowadays dominated by Spark. There are still many companies using Hadoop, but no one starting today would choose it over Spark.
A session that generated a lot of interest was Spark 2.0: What’s next?. Most of the details can be found in this post about Spark 2.0 technical preview. To me the most interesting bits are the Dataset API, structured streaming and the performance improvements. They also showed what could come in Spark 2.1 and beyond:
Another talk I found interesting was Top five mistakes to avoid when using Apache Spark in production. Most of the recommendations are well known, but this is the kind of information that can avoid you a lot of pain when productionizing your cluster.
Streaming is Getting Big
The other big subject of the conference was streaming. In the message brokers area, Kafka continues to be the only option and there are no other contenders in sight. But the stream processing frameworks arena is getting interesting. Apart from Storm (which had no dedicated talks), we now have Spark Streaming, Kafka Streams, Apache Beam, Apache Flink and Google's proprietary Dataflow.
Beam, Flink and Dataflow are conceptually quite similar and have some benefits over Spark: they're true streaming frameworks (as opposed to mini-batches) and have built in support for watermarks and late arriving data. On the other hand Spark currently has much more momentum and, as shown in the release plan slide above, will eventually get those features too. Kafka Streams tries to be much more lightweight and is suited to simpler use-cases.
I enjoyed Tricia Wang's keynote about thick data and how context and stories are lost when aggregating data. I also liked the talk about Parquet and Arrow columnar formats, I'm really looking forward to Arrow being adopted by major frameworks. Apart from technical talks, there were quite a few about how companies used data to gain a competitive advantage. It's one thing to use the latest technologies, it's another thing to get value from them!
I personally enjoyed the conference a lot, including the "Data After Dark" pub crawl. But from my point of view I think it could be better by focusing more on quality rather than quantity. Some talks had overlapping content (there is so much you can say about Spark), some had misleading or too generic abstracts and some felt a big too much like a sales pitch. I understand all these technologies have companies behind, but I believe the talks should give you some value even if you don't buy or use their product.
Did you attend the conference? Which sessions did you find more interesting?