Hello, I was hoping for guidance on an issue I've been seeing. I have a topology that uses `org.apache.storm.kafka.spout.trident.KafkaTridentSpoutOpaque` to read from a Kafka topic with a reset strategy of UNCOMMITTED_EARLIEST. Since upgrading from 1.0.5 to 1.2.3 the offsets that the topology stores in Zookeeper are periodically lost for some partitions and the spout ends up restarting to the beginning of those partitions.
The issue seems to occur when workers die. Originally the topology was a bit flaky and would end up throwing exceptions in workers frequently, at this point I was seeing the offset reset issue happen several times a day. Having made the topology more stable I'm now seeing the reset happen a couple times a week. The topology is reading ~80k messages per hour from the topic using 12 workers and a parallelism hint of 384. The topic also has 384 partitions. There's a couple spots in OpaquePartitionedTridentSpoutExecutor where it removes state from ZK, namely: https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L138 https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L153 https://github.com/apache/storm/blob/v1.2.3/storm-core/src/jvm/org/apache/storm/trident/spout/OpaquePartitionedTridentSpoutExecutor.java#L177 But each of those seems safe and I can't see how it would leave ZK in a state where there was no transaction data at all for a given partition's node. Given the rarity of the issue it seems like a worker or multiple workers have to fail in just the right way for the state to be lost. Any help you can give on pointing to where this issue might be coming from would be appreciated. Thanks!
