The Rise of Event-Driven Stream Processing

Event-driven architectures have become a popular design pattern amid increasing demands for resiliency and scalability. The traditional approach for communicating between services has been to use REST APIs, but this introduces undesirable side effects. The client must wait for the server to return a response or time out. This tightly couples the distributed services since the client would need to be aware of the server that would be processing its request. There is no failover; a service that goes down will miss all incoming requests, and the approach of using an exponential backoff to resend the request is expensive. An event-driven architecture makes use of a messaging system, where producers send events without being aware of consumers. The result is a decoupled system where multiple consuming services can read a single message, and a consumer that goes down can receive the missed messages once it comes back up again.

This article will examine and compare the performance of two popular messaging systems, namely Kafka and Google Cloud Pub/Sub. Both Kafka and Pub/Sub are stream processing frameworks that enable data to be shared across applications based on an event. Every effort was made to ensure as fair a comparison as possible, but the fundamental differences between the two technologies means that some limitations are inevitable. Pub/Sub is a managed service, so it scales up and down with demand, whereas Kafka cluster configurations are typically self-managed and therefore static. This means a performance test that uses a large load which overwhelms the Kafka cluster will likely produce results that favor Pub/Sub, unless you utilize automated deployment and scaling for your Kafka cluster(s). This could be done through Kubernetes or a managed version of Kafka. This caveat should be kept in mind in the context of the test results, and yet we believe this article and the results can still inform your opinions on choosing a stream processing framework.

The Testing Environment

The performance will be measured according to two metrics: 1) throughput and 2) consumer lag. Throughput is the volume of messages published per second, and the consumer lag is the time delay between the production and consumption of a message. Consumer lag can also be measured by how far the consumer is lagging the producer in terms of the number of messages at a time instant. This will give an idea of the load that the framework is able to handle, and how quickly it delivers messages to consumers.

For Kafka, the test setup involves a single instance producer/consumer Java application deployed to Pivotal Cloud Foundry (PCF). It exposes an endpoint from which messages can be published to Kafka via the Spring Cloud Stream library. There is a consumer class which subscribes to the topic. The Kafka cluster consists of 3 partitions in the topic and 3 consumers in the consumer group. The Java application for testing Pub/Sub uses Spring Integration channel adapters, where the producer and consumer are defined as beans instead of classes.

Figure 1: Architecture of Java producer/consumer application

The Test and Results

The load is provided by JMeter running on a developer machine. The test uses 1000 threads and hits the controller endpoint with a small JSON message of around 250 bytes for 5 minutes. The JMeter setup is shown below. Note that it is very important to check the “Same user on each iteration” box so that the cache is not cleared on each thread group iteration. Otherwise, the performance of the REST endpoint will be degraded, thus limiting the throughput observed on the messaging framework.

Figure 2: JMeter test parameters

The Kafka control centre provides graphs on the production and consumption throughputs over time, in bytes per second.

Figure 3: Kafka load test, production throughput

Figure 4: Kafka load test, consumption throughput

After the initial ramp up, the throughput peaked at 466.614 KB/s.

Pub/Sub metrics can be gathered from Cloud Monitoring. Throughput is represented by the metric topic/byte_cost and subscription/byte_cost.

Figure 5: Pub/Sub load test, publisher throughput

Figure 6: Pub/Sub load test, subscriber throughput

The throughput on the topic peaked at 430.95 KB/s, and the throughput on the pull subscription (represented by the pink line in Figure 6) was slightly lower, likely due to some latency in delivering the messages to the subscription. There does not appear to be any significant difference in throughput between the two frameworks. This is likely due to the throughput being limited on the client side, since the developer machine running JMeter does not have enough threads to send 1000 requests at once, and must time slice the available threads to simulate the load. This test is certainly not pushing the envelope of the maximum throughput supported by Kafka and Pub/Sub. In fact, GCP claims Pub/Sub can reach 2 GB/s of publisher throughput in large regions. It appears both Kafka and Pub/Sub can easily handle the load used in this test.

The Kafka control centre also monitors consumer lag in the metrics section in the form of a value which updates in real-time.

Figure 7: Consumer lag in Kafka control centre

Unfortunately, it does not maintain a graph that shows the consumer lag over time, so the peak value has to be found by inspection. Over the course of the five minute test, the maximum value reached 180,000 messages, indicating that the consumers were having a hard time keeping up with the flow of messages. For Pub/Sub, it is available as a graph, owing to the fact that Cloud Monitoring is much richer than the Kafka control centre. It is captured in the metric subscription/num_unacked_messages_by_region.

Figure 8: Pub/Sub consumer lag by number of messages

The peak value was 1428 messages, with a mean value of 917.67 over the five minute window. This shows that Pub/Sub has much lower consumer lag than Kafka, and this is confirmed by observing the lag in time units, which is captured by the metric topic/oldest_unacked_message_age_by_region.

Figure 9: Pub/Sub consumer lag by time

At no point during the test does a message remain unacknowledged by the subscriber for more than one second. The subscriber is receiving the messages almost as soon as they are sent.

The result of the load tests can be summarized in the table below.

Metric Kafka Pub/Sub
Peak Throughput (KB/s) 466.614 430.95
Peak Consumer Lag (number of messages) 180000 1428

Conclusion

There is no meaningful difference in the observed throughput, but Pub/Sub had much lower consumer lag. Based on this, it can be concluded that Pub/Sub exhibits slightly better performance than Kafka for the given load. Ultimately, the choice of framework should be determined by the message load. A Kafka cluster that is too small for the load will exhibit high consumer lag, as seen in the previous section. A cluster that is too large will be a waste of money. This is not a concern in Pub/Sub, because you only pay for the resources you use. Therefore, if the load is predictable and consistent, thus enabling you to configure an appropriately sized Kafka cluster, then there should be no significant difference in performance between the two frameworks. If the load is highly variable, Pub/Sub would be a better choice.