decorative image for blog on solving Apache Kafka issues
September 30, 2024

Solving Complex Kafka Issues: Enterprise Case Studies

Apache Kafka

Apache Kafka issues, especially for enterprises running Kafka at scale, can escalate quickly and bring operations to a halt. The open source community may be able to offer assistance, but in some situations, you need a resolution fast. 

While some organizations partner with OpenLogic for ongoing, SLA-backed Kafka support, our Professional Services team gets involved when a customer who does not have a support contract needs a consultation or help troubleshooting an issue with their Kafka deployments. These engagements can last anywhere from a few days to a few weeks, depending on the scope and complexity of the project. 

In this blog, we present four Kafka case studies with details on what the Kafka issue was and how OpenLogic solved it. 

Back to top

Case Study #1: Large Internet Marketing Firm

Background: This customer was tracking clickstream events to measure ad campaign success. Their large bare metal implementation contained 48 nodes, and was processing roughly 5.8 million messages per second with 1-2 second end-to-end latency.

The Issue: LeaderAndIsr requests were failing during rolling restarts, resulting in multiple leader epochs with stale zkVersions.

The Solution: OpenLogic identified an existing bug that had not been fixed in the version of Kafka they were using, which had a higher likelihood of occurring during resource contention on the Zookeeper instance co-located on five of the Kafka nodes. They recommended upgrading the Kafka cluster and running Kafka on Zookeeper on independent nodes, which fixed the issue. 

Length of Engagement: 5 days 

Back to top

Case Study #2: Large South American Bank

Background: This customer was currently utilizing IBM MQ and not hitting the performance metrics they desired. They were having to deal with large messages at high volume.

The Issue: Due to slow response times with end-to-end latency and total throughput with large messages, the customer wanted to move to Kafka to have a streaming-focused messaging bus.

The Solution: OpenLogic provided architecture using the Saga pattern with Apache Kafka and Apache Camel for managing long-running actions, such as crediting a payment on a loan from cash deposited at a branch. They also provided architecture for using Kafka with log shipping and the ELK stack, as well as for bridging events from IBM API Connect Cloud to Elasticsearch index behind the firewall using Apache Kafka. Finally, OpenLogic led a 5-day Apache Camel training to a team of 15 people so they could learn how to create Kafka consumers and producers.

Length of Engagement: 27 days 

Related Video: Apache Kafka Best Practices 

Back to top

Case Study #3: U.S. Aerospace Firm

Background: Originally this customer wanted help with Rancher and moving from a VM-based Kafka cluster. They were utilizing a web socket server that was responsible for collecting satellite location data in real time. The web socket server could not talk directly with Kafka, and so they had developed a Camel-based system for their original Kafka cluster. They did not have any metrics collected on the existing cluster and could not identify the root cause for message delays and lag. 

The Issue: Performance issues with pub/sub relay application that consumed from websockets from domain-specific appliance and published to Kafka queues.

The Solution: OpenLogic implemented Rancher clusters dedicated to running the Strimzi operator and serving Kafka clusters. They were also able to improve throughput dramatically by moving existing Java code to Apache Camel with vertx driver. 

OpenLogic created metrics with Prometheus and Grafana in both the Camel websocket relay application and the Kafka brokers to determine replication and processing lag, and put monitoring in place to alert on topics that didn't meet SLAs. Once metrics collection with Grafana and Prometheus were put in place, existing bottlenecks became identifiable and addressing them drastically improved end-to-end performance.

Length of Engagement: 3 days 

Back to top

Case Study #4: Global Financial Services Company

Background: Customer came to OpenLogic with a security concern with Kafka Connect that violated PCI compliance as well as internal security standards.

The Issue: Sensitive information was included in stack traces with Kafka Connect.

The Solution: OpenLogic created a test harness, which was sanitized so that customer information was not present, that reproduced the bug. They filed a bug against the project and attached the test harness – and wrote the code that resolved the bug. OpenLogic then submitted the code to the community and worked with community to modify the PR to meet the community's standards. Finally, they informed the customer when the bug was accepted and estimated which release was likely to include the fix for it. As a result, this K.I.P. was produced from the engagement.

Length of Engagement: 20 days 

Back to top

Final Thoughts

Apache Kafka is an extremely powerful event streaming platform, but when things go wrong, they go wrong at scale. These Kafka case studies illustrate the benefits of having direct access to Enterprise Architects with deep Kafka expertise in those moments when every minute counts. 

Don't Waste Development Hours Troubleshooting Kafka 

OpenLogic experts have hands-on experience engineering, modernizing, and optimizing enterprise-scale Kafka implementations. If you're looking to migrate from batch-driven to streaming data, integrate systems, train your team, or get technical support, we can help.

Kafka Support and Services

Additional Resources

Back to top