Hello Ravi,

First of all, thank you for your input. I have checked the nimbus and the
supervisor logs and at some point Nimbus decides that my executor is not
alive. Also, I am neither issuing a re-balance nor a kill command. It just
picks a busy executor and re-assigns it. There is no specific pattern on
which executor gets re-started, but it always is a busy one (at least in
terms of input tuples). I am trying to implement a fault tolerance
mechanism for when the above happens.

I am pretty sure that ACK-ing tuples is not the reason in my case, since
the number of ACK-ed tuples equals the number of input tuples.

Thank you again for your time.

Regards,
Nick

2015-06-30 15:26 GMT-04:00 Ravi Tandon <ravi.tan...@microsoft.com>:

>  I would suggest you to also look at your nimbus & supervisor logs at the
> same time too. They will help paint the full picture to you.
>
>
>
> Nimbus not getting a hearbeat back from the worker can lead to shutdown of
> the port as it tries to shift the worker to another free slot. (Assuming
> there was no kill or rebalance issued that forced this on your topology).
>
>
>
> I have not seen a case where Netty will cause this, others can chime on
> that.
>
>
>
> Key things to consider:
>
> 1.       Your topology continues to work after this. If it’s not, then
> there is an issue.
>
> 2.       You do not ack the tuples until they are completely processed so
> when the task re-spawns your tuples are replayed again.
>
>
>
> http://storm.apache.org/documentation/Fault-tolerance.html
>
> http://storm.apache.org/documentation/FAQ.html
>
>
>
>
>
> *From:* Nick R. Katsipoulakis [mailto:nick.kat...@gmail.com]
> *Sent:* Thursday, June 25, 2015 12:18 PM
> *To:* user@storm.apache.org
> *Subject:* When is a task considered dead?
>
>
>
> Hello,
>
>
>
> I have the problem that at some point in a running topology, one of the
> tasks running gets restarted by Storm. Under which circumstances can the
> previous happen? Can it happen because of Netty (input rate of tuples is
> higher than the process rate)?
>
>
>
> I do not understand why the previous is happening and it is not definitely
> problem in my code because I can not find any exceptions in the worker log
> files.
>
>
>
> Any ideas/hints?
>
>
>
> Thanks,
>
> Nick
>
>
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Reply via email to