Kafka vs. Kinesis
Kafka and Kinesis are message brokers that have been designed as distributed logs. With them you can only write at the end of the log or you can read entries sequentially. But you cannot remove or update entries, nor add new ones in the middle of the log.
This simple design allows distributed logs to have a really interesting set of characteristics. Because the reads and writes to the log are sequential, they have much better performance than other message brokers. And because the log is persistent, you can reprocess it as many times as needed.
But even though Kafka and Kinesis are very similar (Kinesis was, to say the least, inspired by Kafka), they differ in many aspects.
Configuration & Features
Although both tools are conceptually simple and don't have many features, Kinesis couldn't be simpler. And that's a really good thing. The only options you can tune are the number of shards (throughput) and the number of days you want to keep the data (maximum 7).
Kafka, just because you have to host it yourself, requires more configuration. You'll need to configure each node, define where the data is stored, have a Zookeper cluster up and running, etc. The documentation is great, but you'll have to spend time to understand the details and implications of the options.
In terms of features, Kafka has a log compaction mode that allows you to update entries in the log that have a specific key. This is useful when you want to keep the last value for each key and you don't care about the previous ones.
In Kafka you can configure, for each topic, the replication factor and how many replicas have to acknowledge a message before is considered successful. The number of nodes and where you run them is also up to you. So you can definitely make it highly available, but you'll have to make sure that this is the case.
In contrast, in Kinesis all messages are written synchronously to 3 different data centers (availability zones) before a write is considered successful. Amazon ensures that you won't lose data, but that comes with a performance cost.
There are several benchmarks online comparing Kafka and Kinesis, but the result it's always the same: you'll have a hard time to replicate Kafka's performance in Kinesis. At least for a reasonable price.
This is in part is because Kafka is insanely fast, but also because Kinesis writes each message synchronously to 3 different machines. And this is quite costly in terms of latency and throughput.
Kafka is one of the preferred options for the Apache stream processing frameworks, like Storm, Samza and Spark Streaming. And thanks to Kafka Connect it's also quite easy to add new connectors that move data from and to Kafka.
Unsurprisingly, Kinesis is really well integrated with other AWS services. Using Kinesis Firehose you can automatically persist data from Kinesis into S3 or Redshift. It also has adaptors for Storm and Spark Streaming, as well as Amazon's Kinesis Client Library and Lambda to develop consumer applications.
Despite the similarities, it's clear that Kafka and Kinesis should be used in different contexts. The main selling points of Kafka are performance and the integration with other big data tools. The downside is that although Kafka is very stable, you'll have to have someone that knows the tool in depth in case something goes wrong. You shouldn't take risks with the infrastructure that stores data such as DBs and distributed logs.
Kinesis' strengths are simplicity and built-in reliability. You just create a stream and let Amazon make sure that the data won't be lost. The downside is that streams can only store records for 7 days, so you'll copy them somewhere else if you want to keep them for longer. Also, throughput is not as good as Kafka's. Those disadvantages make Kinesis a poor choice if you want to build a Kappa architecture.
In a nutshell, Kafka is a better option if:
- You have the in-house knowledge to maintain Kafka and Zookeper
- You need to process more than 1000s of events/s
- You don't want to integrate it with AWS services
Kinesis works best if:
- You don't have the in-house knowledge to maintain Kafka
- You process 1000s of events/s at most
- You stream data into S3 or Redshift
- You don't want to build a Kappa architecture