October 11, 2015 · aws distributed systems

Building Kinesis Apps with KCL

Let's see how to build applications that consume records from Kinesis using the Amazon's Kinesis Consumer Library (KCL).

Kinesis Abstractions

Before starting, we need to understand the main Kinesis concepts:

And the KCL abstractions:

Tying It Together: An Example

Let's say that we send raw page view events to a stream. We have an app that counts them and every five minutes stores the number of page views in a database. We also have another app that reads the raw events and enriches them, resolving the user geolocation from the IP. It then writes those enriched events to another stream so other apps can process them further.

Each app will read from the raw stream independently, at its own speed.

Simplified view

How does that view change if we know that the raw stream has two shards? Each app would need two workers, one per shard. And because we need two shards for the raw stream, the enriched stream will also need two shards to handle the volume of records.

The workers of the enricher app will write to both shard 1 and 2 of the enriched stream, depending on the value of the partition key of the record they write.

View with shards

Physical Architecture

You have to create your own cluster of machines to run the application processes, perhaps using an auto scaling group. In our example, we could have two instances to maintain availability and run both apps in each. KCL will assign shards to a worker in each app, regardless of the instance in which they run. It will also detect if a worker becomes irresponsive and will assign its shard to an available worker of the same app.

Physical view

Although not in the diagram, workers need to have access to DynamoDB. There they store the shard that they're processing and the processed offset. This way, if a worker goes down another one can continue processing the shard from the last saved offset.

Writing the App

KCL apps can be written in a number of languages, including Java, Python, Ruby and Scala. That said, all of them follow the same principles. The main logic of the application resides in a RecordProcessor class, which has three methods:

Besides implementing your RecordProcessor, you also have to configure the app. Things like how often do you want to check for new records, the name of the stream you are reading from, etc.

Sum Up

Writing Kinesis apps it's quite simple, but it has some important differences compared to other stream processing frameworks like Storm or Samza.

Kinesis apps should be fully featured by themselves, not parts of an app (like Storm's bolts). So avoid creating a complex graph of apps, because it's quite tedious to create streams and configure the apps to read and write to them. Imagine if you want to test them locally or have different environments.

The other big difference is that with KCL you have to take care of where and how to deploy the apps. Amazon should make it easy to create Hadoop-like clusters, where you could submit the apps and the cluster would deploy and run the apps automatically.

Comments powered by Disqus