Hello,

I was hoping for guidance on an issue I've been seeing. I have a topology
that uses `org.apache.storm.kafka.spout.trident.KafkaTridentSpoutOpaque` to
read from a Kafka topic with a reset strategy of UNCOMMITTED_EARLIEST.
Since upgrading from 1.0.5 to 1.2.3 the offsets that the topology stores in
Zookeeper are periodically lost for some partitions and the spout ends up
restarting to the beginning of those partitions.

The issue seems to occur when workers die. Originally the topology was a
bit flaky and would end up throwing exceptions in workers frequently, at
this point I was seeing the offset reset issue happen several times a day.
Having made the topology more stable I'm now seeing the reset happen a
couple times a week. The topology is reading ~80k messages per hour from
the topic using 12 workers and a parallelism hint of 384. The topic also
has 384 partitions.

There's a couple spots in OpaquePartitionedTridentSpoutExecutor where it
removes state from ZK, namely:
https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L138
https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L153
https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L177

But each of those seems safe and I can't see how it would leave ZK in a
state where there was no transaction data at all for a given partition's
node. Given the rarity of the issue it seems like a worker or multiple
workers have to fail in just the right way for the state to be lost. Any
help you can give on pointing to where this issue might be coming from
would be appreciated.

Thanks!

Reply via email to