Regarding failure handling in storm-kafka

Hemanth Yamijala Mon, 29 Sep 2014 03:27:53 -0700

Hi,

We are using Apache-storm 0.9.2 and the storm-kafka
(version 0.9.0-wip16a-scala292) which has support for Kafka 0.7.


I am trying to understand the failure handling of Kafka spout in a
particular scenario.

I have 4 workers, 1 running 1 executor of the Kafka spout, 1 running 1
executor of Bolt1, and 2 running 1 executor each of Bolt2. The topology is
Spout -> Bolt1 -> Bolt2. Shuffle grouping is configured between spout and
Bolt1 and between Bolt1 and Bolt2.

Messages are injected into Kafka in a controlled manner. At one point, I
inject a failure in Bolt2 that results in an exception and the worker dies.
At this stage, I notice Kafka spout remains stuck at the offset last
committed into Zookeeper. This continues even when the condition causing
the exception is removed and the failed worker gets restarted by Storm.

I have some questions based on this observation:

1. I expected that once the worker restarts and the condition causing the
exception is resolved, the Kafka spout will send the new messages to be
processed starting with the offset last committed and everything will
continue to work fine. However, from the Kafka spout, while I see new
messages getting fetched through lines like these:

2014-09-29 15:39:39 s.k.PartitionManager [INFO] Added 2 messages from
Kafka: localhost:0 to internal buffers

2014-09-29 15:39:41 s.k.PartitionManager [INFO] Committing offset for
localhost:9092:0
I don't see any new messages being processed in downstream bolts, nor the
Kafka offset being incremented. In other words, the spout stays stuck at
the last committed offset without progress.  Is this the expected behaviour
?

2. I see that the other executors (Bolt1 and Bolt2) which did not have any
errors also seem to reinitialise after the erroneous executor of Bolt2
dies. Is this expected ? I don't see any messages that reflect an error in
either of these executors.

3. In general, what is the recommended way to handle errors such as
transient downstream failures when Kafka is involved. i.e we want messages
to be re-processed if the downstream systems recovers. Say Bolt 2 is
writing to a DB that is temporarily unavailable. Some messages on this
forum seemed to indicate that just throwing an exception and letting the
worker restart could cause recovery and the messages will get re-processed
from where they left. Is this the right thing to do, or should we try a
different error handling mechanism. I stumbled on this thread:
http://grokbase.com/t/gg/storm-user/136gj7fz18/kafka-spout-fail-retry -
which seems to imply that we should catch exceptions and not relying on
retries. Can someone provide guidance on the right path ?

Thanks
Hemanth

Regarding failure handling in storm-kafka

Reply via email to