Data Quality in Event-Driven Architectures
With stream processing frameworks like Storm, Samza or Kinesis it's easier than ever to build near real-time, event-driven architectures that scale. It goes without saying that your code must be tested, but given that your code will be executed asynchronously by different systems and that you might not even control the input of your pipeline... how do we make sure that the output is correct?
Obviously that will depend on the specifics of your data, but the effort you that you'll have to invest in ensuring its quality will be proportional to its value and the time you have to keep it. It's not the same processing page views that will only be displayed on a screen than processing financial data that will be used for quarterly reports.
Keep the Raw Data
From the collection until it's stored, events usually go through a series of processes that extract new features, filter them, join them with other bits of data, etc. But all these processes can (will) have bugs at some point, so it's very important to keep the original raw data in order to reprocess it once the bugs have been fixed.
On the same line, in addition of keeping the raw data you should never modify it (you know, lowercasing this attribute, trimming the other). Treat your data-processing pipeline as a function without side effects: it receives some data and outputs a different set of data, keeping the input unchanged.
Use Idempotent Operations
Build your pipeline in a way that all the operations can be repeated without the data ending up duplicated or corrupted. This way, for example, you can reprocess your raw data if you find a bug in your pipeline without ending up with duplicated rows in your database.
Also, this is a requirement if you use a message broker like Kafka or RabbitMQ with acknowledgements. They have "at least once" delivery semantics, meaning that occasionally you can receive a message more than once.
To do so, make sure that all events have an ID generated at source, not during the ETL process. In this way it will be impossible to have duplicates in your database, or at least they will be easy to spot.
Make the Data Traceable
To help debug and fix data issues, is important to know where an event comes from and which transformations have been applied to it. So, in addition to a unique ID, an event should have an schema version, the name and version of the library that produced it, and the same for the library that transforms it.
Don't Derive Data...
that the producer already has. For instance, if a "sale event" of more than X is considered premium by the web app, it would be a mistake to have a rule in the ETL process that flags the event as premium. First, because you would have the logic split in two different places, and second because if X changes you wouldn't be able to reprocess past events without checking when the events occurred.
This applies for most of the business logic of an app. It is much easier for the producer, who already has it, to add this information to the event.
Monitoring and Alerts
No matter how careful we are, eventually machines will fail or we'll introduce bugs in our systems. When this happens, you want to be able to detect it as soon as possible or, even better, to have alerts that notify you.
There are many tools out there that you can integrate with your system, but I would recommend Elastic Search with Kibana. It allows you to search your events and you can also create sophisticate dashboards that take into account its structure. For instance, you could have a graph that tracks page views, sales and deploys to production so you can spot regressions in near real-time.
Communication & KISS
As in any other software project, having a clear, shared and documented domain model across the organisation helps a lot in order to reduce bugs, misunderstandings and frustration. After all, data is not of much value if it's not easy to understand or to put to use!