Hi, We have been experiencing an issue since last few days now that in one of our topology even though all the bolts are acknowledging the messages , the same message Id is not acknowledged by the storm-kinesis -spout , in following scenarios when
We have two bolts in our topology and when the 1st bolt is emitting multiple tuples to the second bolt but there are only about 1 in 100 messages emitted from spout have to go through the 2nd bolt on some specific condition, and also the second bolt is about 100x slower than the first bolt. In such cases although we are seeing acknowledgement from both bolts but the spout is intermittently not acknowledging the tuple which is going through the second bolt, we have set the timeout to false . The storm Kinesis spout has properties to throttle the spout on backpressure they are set as follows : zkCommitIntervalMs = 10000 zkMaxUncommittedRecords = 100000 kinesisRecordsLimit = 10000 The current configurations in our topology is as follows: topology.max.spout.pending 20 storm.messaging.netty.buffer_size 5242880 topology.executor.receive.buffer.size 16384 topology.executor.send.buffer.size 16384 topology.transfer.buffer.size 1024 backpressure.znode.timeout.secs 30 topology.enable.message.timeouts False topology.disruptor.batch.timeout.millis 1 topology.disruptor.wait.timeout.millis 1000 topology.acker.executors 3 The screen shots from the storm UI for bolt and spout stats is attached. The problem that we are facing is after about an hours of running either we are stumbling upon Max spout pending causing the topology to hang or , we are stumbling upon zkMaxUncommittedRecords , which means there is atleast one message which is not completed its DAG even though many messages behind it is already acknowledged but since we want atleast one time process guarantee the spout is not able to commit it. And digging further we realized that acknowledge stream actually have about 99000 message ids which can not be commited as the emitted stream has one message which is much older and not yet acknowledged by the spout, and one it is tuck at that point we see for hours it does not move from there it is just stuck, so the question is , is there any other timeout at spout end which may be causing the spout to never acknowledge the message, while bolts have alreday acknowledged it . what is the role of topology.disruptor.wait.timeout.millis? Regards, Shatabdi
