Dominik,

your setup looks to be:

Test data producer* -> Kafka cluster -> KafkaSpout [Storm] -> KafkaBolt

(*You didn't say what exactly you used for this, but presumably based on
Kafka's producer client.)

Here, the KafkaSpout internally uses Kafka's consumer client to read data
from Kafka (i.e. your 100M test messages), and then uses Storm's internal
messaging layer (based on Netty) to forward this data to the KafkaBolt.

One reason you see a discrepancy between the fast "native" Kafka
performance (e.g. your 650k msg/s for the KafkaProducer) vs. the slower
Storm performance (206k msg/s for KafkaBolt) is due to differences in the
messaging layer (i.e. Kafka's messaging layer vs. Storm's internal
messaging layer which is based on Netty).  Kafka is simply much more
performant on that front because the Kafka project originally started out
as a messaging system, so its messaging layer is really good.  Other
factors may include the implementation/versions of Storm/KafkaSpout/...,
any serialization/deserialization you're doing, configuration settings (you
mentioned you kept the defaults), and so on.

It would help though if you shared the exact versions of Storm and Kafka
that you have been using for your experiments.  Historically, the Kafka
integration (KafkaSpout) was only so-so but since then has improved over
time, although -- to my knowledge -- even the latest versions still trail
behind native Kafka performance.

-Michael




On Thu, Sep 29, 2016 at 8:36 PM, Dominik Safaric <[email protected]>
wrote:

> Hi Everyone,
>
> In the past few days, I’ve been benchmarking Storm using a simple topology
> consisting of a KafkaSpout and KafkaBolt. For the benchmark, I’ve produced
> 100.000.000 messages into Kafka, where each message was measured in 100
> bytes. The configuration of Kafka, Zookeeper and Storm was intentionally
> left default.
>
> An interesting observation I’ve made is in regard to the KafkaBolt
> throughput. Namely, while running the KafkaProducer standalone it has an
> uniform throughput of approximately 650.000 messages per second. Whereas,
> in the case of the KafkaBolt, the throughput is at most 206.000 messages,
> with a skewed distribution where subsequent seconds may have *zero
> throughput* i.e. tuples emitted. For an overview of the distribution,
> while running the benchmark on a cluster take a look at the graph below.
>
> Now, my question is - why does the KafkaBolt have such an decreased
> throughput when compared to a standalone KafkaProducer? What factors in
> your experience influence it’s throughput?
>
> I’ve measured the same by having various configurational variances, such
> as configuring the topology.executor.(receive | send).buffer.size,
> disabling acknowledgements etcetera. But, the result although in some cases
> improved, still has a skewed throughput throughput the benchmark.
>
> Thanks in advance for sharing your experience and advice!
>
> Dominik
>
>
>

Reply via email to