April 12, 2015 · data quality events data

Semantic Anomaly Detection

Data streams and event-driven architectures are quite common, but ensuring the quality of the data in the events is hard. And it gets worse the more event types you have, specially if you also ingest events from third party providers.

One of the techniques that is becoming popular to detect changes in the event bus as soon as possible is anomaly detection. And the good news is that there are a number of open source tools available built by big Internet companies to do so. But those frameworks only take into account the number of events: if the volume suddenly increases we might need to spin more machines and if it drops something might be broken. This information is valuable, particularly for ops, but we should also be concerned about the content of the events, as they could change at any point and make the consumers misbehave.

Anomaly detection in time series

The main way of validating the content of events is by using schemas like XML schemas (urgh!) or protocols that use schemas like Avro or Protocol Buffers. Those define the contract that consumers and producers agree on in order to communicate. But schemas cannot define all restrictions and patterns that might occur in an event stream.

Let's see some examples of those using item_purchased events:

{
  "schema": "item_purchased/1-0-0",
  "data": {
    "item_id": "D6KML",
    "purchase_counter": 11,
    "customer_country": "UK",
    "customer_phone_prefix": "44",
    "customer_requests": null
  }
}

{
  "schema": "item_purchased/1-0-0",
  "data": {
    "item_id": "9NDP4",
    "purchase_counter": 12,
    "customer_country": "US",
    "customer_phone_prefix": "1",
    "customer_requests": null
  }
}

{
  "schema": "item_purchased/1-0-0",
  "data": {
    "item_id": "HL34J",
    "purchase_counter": 13,
    "customer_country": "UK",
    "customer_phone_prefix": "44",
    "customer_requests": "Gift wrap it please!"
  }
}

Some of the restrictions and patterns that cannot be represented using schemas are:

And even if some restrictions can be specified, they might be unknown at the time of defining the schema, like patterns in individual fields: item_id has length 5, customer_country follows the regular expression [A-Z]{2} and customer_phone_prefix is a positive integer stringly-typed.

I'm prototyping a tool that automatically detects these patterns in event streams and sends an alert as soon as it detects a change. Searching online, I haven't found any project or tool that does something similar, only a paper [pdf] from 2002 that focuses on correlating numeric data from different streams.

I would love to have some feedback! Do you know any tool or project that does that? Would you find such project valuable? Did I miss any other kind of semantic anomaly?

Comments powered by Disqus