Hello Ravi, First of all, thank you for your input. I have checked the nimbus and the supervisor logs and at some point Nimbus decides that my executor is not alive. Also, I am neither issuing a re-balance nor a kill command. It just picks a busy executor and re-assigns it. There is no specific pattern on which executor gets re-started, but it always is a busy one (at least in terms of input tuples). I am trying to implement a fault tolerance mechanism for when the above happens.
I am pretty sure that ACK-ing tuples is not the reason in my case, since the number of ACK-ed tuples equals the number of input tuples. Thank you again for your time. Regards, Nick 2015-06-30 15:26 GMT-04:00 Ravi Tandon <ravi.tan...@microsoft.com>: > I would suggest you to also look at your nimbus & supervisor logs at the > same time too. They will help paint the full picture to you. > > > > Nimbus not getting a hearbeat back from the worker can lead to shutdown of > the port as it tries to shift the worker to another free slot. (Assuming > there was no kill or rebalance issued that forced this on your topology). > > > > I have not seen a case where Netty will cause this, others can chime on > that. > > > > Key things to consider: > > 1. Your topology continues to work after this. If it’s not, then > there is an issue. > > 2. You do not ack the tuples until they are completely processed so > when the task re-spawns your tuples are replayed again. > > > > http://storm.apache.org/documentation/Fault-tolerance.html > > http://storm.apache.org/documentation/FAQ.html > > > > > > *From:* Nick R. Katsipoulakis [mailto:nick.kat...@gmail.com] > *Sent:* Thursday, June 25, 2015 12:18 PM > *To:* user@storm.apache.org > *Subject:* When is a task considered dead? > > > > Hello, > > > > I have the problem that at some point in a running topology, one of the > tasks running gets restarted by Storm. Under which circumstances can the > previous happen? Can it happen because of Netty (input rate of tuples is > higher than the process rate)? > > > > I do not understand why the previous is happening and it is not definitely > problem in my code because I can not find any exceptions in the worker log > files. > > > > Any ideas/hints? > > > > Thanks, > > Nick > > > -- Nikolaos Romanos Katsipoulakis, University of Pittsburgh, PhD candidate