Hi,

We have been experiencing an issue since last few days now that in one of
our topology even though all the bolts are acknowledging the messages , the
same message Id is not acknowledged by the  storm-kinesis -spout , in
following scenarios when

We have two bolts in our topology and when the 1st bolt is emitting
multiple tuples to the second bolt but there are only about 1 in 100
messages emitted from spout have to go through the 2nd bolt on some
specific  condition, and also the second bolt is about 100x slower than the
first bolt. In such cases although we are seeing acknowledgement from both
bolts but the spout is intermittently  not acknowledging the  tuple which
is going through the second bolt, we have set the timeout to false . The
storm Kinesis spout has  properties to throttle the spout on backpressure
they are set as follows :
zkCommitIntervalMs = 10000
zkMaxUncommittedRecords = 100000
kinesisRecordsLimit = 10000

The current configurations in our topology is as follows:



topology.max.spout.pending

20

storm.messaging.netty.buffer_size

5242880

topology.executor.receive.buffer.size

16384

topology.executor.send.buffer.size

16384

topology.transfer.buffer.size

1024

backpressure.znode.timeout.secs

30

topology.enable.message.timeouts

False

topology.disruptor.batch.timeout.millis

1

topology.disruptor.wait.timeout.millis

1000

topology.acker.executors

3




The screen shots from the storm UI for bolt and spout stats is attached.

The problem that we are facing is after about an hours of running  either
we are stumbling upon Max spout pending causing the topology to hang or ,
we are stumbling upon  zkMaxUncommittedRecords , which means there is
atleast one message which is not completed its DAG even though many
messages behind it is already acknowledged  but since we want atleast one
time process guarantee  the spout is not able to commit it. And digging
further we realized that acknowledge stream actually have about 99000
message ids which can not be commited as the emitted stream has one message
which is much older and not yet acknowledged by the spout, and one it is
tuck at that point we see for hours it does not move from there it is just
stuck, so the question is , is there any other timeout at spout end which
may be causing the spout to never acknowledge the message, while bolts have
alreday acknowledged it . what is the role of
topology.disruptor.wait.timeout.millis?

Regards,
Shatabdi

Reply via email to