Hi, We are using Apache-storm 0.9.2 and the storm-kafka (version 0.9.0-wip16a-scala292) which has support for Kafka 0.7.
I am trying to understand the failure handling of Kafka spout in a particular scenario. I have 4 workers, 1 running 1 executor of the Kafka spout, 1 running 1 executor of Bolt1, and 2 running 1 executor each of Bolt2. The topology is Spout -> Bolt1 -> Bolt2. Shuffle grouping is configured between spout and Bolt1 and between Bolt1 and Bolt2. Messages are injected into Kafka in a controlled manner. At one point, I inject a failure in Bolt2 that results in an exception and the worker dies. At this stage, I notice Kafka spout remains stuck at the offset last committed into Zookeeper. This continues even when the condition causing the exception is removed and the failed worker gets restarted by Storm. I have some questions based on this observation: 1. I expected that once the worker restarts and the condition causing the exception is resolved, the Kafka spout will send the new messages to be processed starting with the offset last committed and everything will continue to work fine. However, from the Kafka spout, while I see new messages getting fetched through lines like these: 2014-09-29 15:39:39 s.k.PartitionManager [INFO] Added 2 messages from Kafka: localhost:0 to internal buffers 2014-09-29 15:39:41 s.k.PartitionManager [INFO] Committing offset for localhost:9092:0 I don't see any new messages being processed in downstream bolts, nor the Kafka offset being incremented. In other words, the spout stays stuck at the last committed offset without progress. Is this the expected behaviour ? 2. I see that the other executors (Bolt1 and Bolt2) which did not have any errors also seem to reinitialise after the erroneous executor of Bolt2 dies. Is this expected ? I don't see any messages that reflect an error in either of these executors. 3. In general, what is the recommended way to handle errors such as transient downstream failures when Kafka is involved. i.e we want messages to be re-processed if the downstream systems recovers. Say Bolt 2 is writing to a DB that is temporarily unavailable. Some messages on this forum seemed to indicate that just throwing an exception and letting the worker restart could cause recovery and the messages will get re-processed from where they left. Is this the right thing to do, or should we try a different error handling mechanism. I stumbled on this thread: http://grokbase.com/t/gg/storm-user/136gj7fz18/kafka-spout-fail-retry - which seems to imply that we should catch exceptions and not relying on retries. Can someone provide guidance on the right path ? Thanks Hemanth
