October 19, 2014 · data quality events data

Data Quality in Event-Driven Architectures

With stream processing frameworks like Storm, Samza or Kinesis it's easier than ever to build near real-time, event-driven architectures that scale. It goes without saying that your code must be tested, but given that your code will be executed asynchronously by different systems and that you might not even control the input of your pipeline... how do we make sure that the output is correct?

Obviously that will depend on the specifics of your data, but the effort you that you'll have to invest in ensuring its quality will be proportional to its value and the time you have to keep it. It's not the same processing page views that will only be displayed on a screen than processing financial data that will be used for quarterly reports.

Keep the Raw Data

From the collection until it's stored, events usually go through a series of processes that extract new features, filter them, join them with other bits of data, etc. But all these processes can (will) have bugs at some point, so it's very important to keep the original raw data in order to reprocess it once the bugs have been fixed.

On the same line, in addition of keeping the raw data you should never modify it (you know, lowercasing this attribute, trimming the other). Treat your data-processing pipeline as a function without side effects: it receives some data and outputs a different set of data, keeping the input unchanged.

Use Idempotent Operations

Build your pipeline in a way that all the operations can be repeated without the data ending up duplicated or corrupted. This way, for example, you can reprocess your raw data if you find a bug in your pipeline without ending up with duplicated rows in your database.

Also, this is a requirement if you use a message broker like Kafka or RabbitMQ with acknowledgements. They have "at least once" delivery semantics, meaning that occasionally you can receive a message more than once.

To do so, make sure that all events have an ID generated at source, not during the ETL process. In this way it will be impossible to have duplicates in your database, or at least they will be easy to spot.

Make the Data Traceable

To help debug and fix data issues, is important to know where an event comes from and which transformations have been applied to it. So, in addition to a unique ID, an event should have an schema version, the name and version of the library that produced it, and the same for the library that transforms it.

Using this approach, when you find an incomplete event it's easy to see that it was generated months ago with a specific version of a JavaScript tracker, and then you can go to your source code repository to understand how that happened. Moreover, you will be able to find all the events that where generated with this version of the tracker and fix them.

Don't Derive Data...

that the producer already has. For instance, if a "sale event" of more than X is considered premium by the web app, it would be a mistake to have a rule in the ETL process that flags the event as premium. First, because you would have the logic split in two different places, and second because if X changes you wouldn't be able to reprocess past events without checking when the events occurred.

This applies for most of the business logic of an app. It is much easier for the producer, who already has it, to add this information to the event.

Monitoring and Alerts

No matter how careful we are, eventually machines will fail or we'll introduce bugs in our systems. When this happens, you want to be able to detect it as soon as possible or, even better, to have alerts that notify you.

There are many tools out there that you can integrate with your system, but I would recommend Elastic Search with Kibana. It allows you to search your events and you can also create sophisticate dashboards that take into account its structure. For instance, you could have a graph that tracks page views, sales and deploys to production so you can spot regressions in near real-time.

Communication & KISS

As in any other software project, having a clear, shared and documented domain model across the organisation helps a lot in order to reduce bugs, misunderstandings and frustration. After all, data is not of much value if it's not easy to understand or to put to use!

More Resources

Comments powered by Disqus