Re: Storm kill fails with exit code 143

2019-05-06 Thread Stig Rohde Døssing
Ah, sorry, got off on the wrong track due to the linked issue, which is
talking about worker JVM exit codes.

Den man. 6. maj 2019 kl. 19.34 skrev Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
mrathb...@bloomberg.net>:

> Sorry if my initial question was misleading. The "Storm kill" command
> returned 143, there was no exit code from our topology. Our topology was
> never shutdown and never received a command to shutdown. As far as I can
> tell, Nimbus never received a command from running Storm kill in this case.
> So the process created to carry out the kill command was the one
> terminated. As Derek mentioned, it seems like something killed the process.
> I am wondering if since so many topologies were being brought down at once,
> the process took a long time to communicate with Nimbus and timed out/was
> terminated. Is something like this possible? As far as I can tell, there
> was no external command at the time to kill the process.
>
> From: user@storm.apache.org At: 05/06/19 13:14:02
> To: user@storm.apache.org
> Subject: Re: Storm kill fails with exit code 143
>
> I would assume that what actually happened is that most of your workers
> don't manage to finish shutting down the worker gracefully, and so exit
> with code 20 due to the 1 second time limit imposed by the shutdown hook.
> One of your workers happened to run the entire shutdown sequence within the
> 1 second time limit, and so returns 143.
>
> Basically what is happening is that the supervisor sends SIGTERM to the
> worker to get it to shut down. The worker then runs its shutdown sequence
> to shutdown gracefully. Before starting the shutdown sequence, the worker
> sets up a new thread that sleeps for 1 second, then halts the JVM with exit
> code 20. If the shutdown exceeds the time limit, you get exit code 20. If
> the shutdown is finished within the time limit, you get 143 in response to
> the original SIGTERM.
>
> Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit :
>
>> An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15).
>>
>> It seems like something killed the shutdown script.
>>
>> https://www.tldp.org/LDP/abs/html/exitcodes.html
>>
>> On Sun, May 5, 2019 at 8:19 PM JF Chen  wrote:
>>
>>> Do you run your storm application on yarn?
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>>
>>> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
>>> mrathb...@bloomberg.net> wrote:
>>>
>>>> Recently our shutdown script failed when calling storm kill with a
>>>> return code of 143. Typically this means that SIGTERM was received and the
>>>> process was terminated. I see in
>>>> https://issues.apache.org/jira/browse/STORM-2176 that it is possible
>>>> to get this exit code if a topology takes too long to come down. However,
>>>> we are running version 1.2.1 of Storm, which should have the fix mentioned
>>>> in the issue. Is it possible that we have the same cause for our error?
>>>> When this occurred, many topologies were brought down at once, but only
>>>> this one topology seemed to have an issue.
>>>>
>>>
>


Re: Storm kill fails with exit code 143

2019-05-06 Thread Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Sorry if my initial question was misleading. The "Storm kill" command returned 
143, there was no exit code from our topology. Our topology was never shutdown 
and never received a command to shutdown. As far as I can tell, Nimbus never 
received a command from running Storm kill in this case. So the process created 
to carry out the kill command was the one terminated. As Derek mentioned, it 
seems like something killed the process. I am wondering if since so many 
topologies were being brought down at once, the process took a long time to 
communicate with Nimbus and timed out/was terminated. Is something like this 
possible? As far as I can tell, there was no external command at the time to 
kill the process.

From: user@storm.apache.org At: 05/06/19 13:14:02To:  user@storm.apache.org
Subject: Re: Storm kill fails with exit code 143

I would assume that what actually happened is that most of your workers don't 
manage to finish shutting down the worker gracefully, and so exit with code 20 
due to the 1 second time limit imposed by the shutdown hook. One of your 
workers happened to run the entire shutdown sequence within the 1 second time 
limit, and so returns 143.

Basically what is happening is that the supervisor sends SIGTERM to the worker 
to get it to shut down. The worker then runs its shutdown sequence to shutdown 
gracefully. Before starting the shutdown sequence, the worker sets up a new 
thread that sleeps for 1 second, then halts the JVM with exit code 20. If the 
shutdown exceeds the time limit, you get exit code 20. If the shutdown is 
finished within the time limit, you get 143 in response to the original 
SIGTERM. 

Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit :

An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15).

It seems like something killed the shutdown script.

https://www.tldp.org/LDP/abs/html/exitcodes.html

On Sun, May 5, 2019 at 8:19 PM JF Chen  wrote:

Do you run your storm application on yarn? 

Regard,
Junfeng Chen


On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) 
 wrote:

Recently our shutdown script failed when calling storm kill with a return code 
of 143. Typically this means that SIGTERM was received and the process was 
terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 that it 
is possible to get this exit code if a topology takes too long to come down. 
However, we are running version 1.2.1 of Storm, which should have the fix 
mentioned in the issue. Is it possible that we have the same cause for our 
error? When this occurred, many topologies were brought down at once, but only 
this one topology seemed to have an issue.




Re: Storm kill fails with exit code 143

2019-05-06 Thread Stig Rohde Døssing
I would assume that what actually happened is that most of your workers
don't manage to finish shutting down the worker gracefully, and so exit
with code 20 due to the 1 second time limit imposed by the shutdown hook.
One of your workers happened to run the entire shutdown sequence within the
1 second time limit, and so returns 143.

Basically what is happening is that the supervisor sends SIGTERM to the
worker to get it to shut down. The worker then runs its shutdown sequence
to shutdown gracefully. Before starting the shutdown sequence, the worker
sets up a new thread that sleeps for 1 second, then halts the JVM with exit
code 20. If the shutdown exceeds the time limit, you get exit code 20. If
the shutdown is finished within the time limit, you get 143 in response to
the original SIGTERM.

Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit :

> An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15).
>
> It seems like something killed the shutdown script.
>
> https://www.tldp.org/LDP/abs/html/exitcodes.html
>
> On Sun, May 5, 2019 at 8:19 PM JF Chen  wrote:
>
>> Do you run your storm application on yarn?
>>
>> Regard,
>> Junfeng Chen
>>
>>
>> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
>> mrathb...@bloomberg.net> wrote:
>>
>>> Recently our shutdown script failed when calling storm kill with a
>>> return code of 143. Typically this means that SIGTERM was received and the
>>> process was terminated. I see in
>>> https://issues.apache.org/jira/browse/STORM-2176 that it is possible to
>>> get this exit code if a topology takes too long to come down. However, we
>>> are running version 1.2.1 of Storm, which should have the fix mentioned in
>>> the issue. Is it possible that we have the same cause for our error? When
>>> this occurred, many topologies were brought down at once, but only this one
>>> topology seemed to have an issue.
>>>
>>


Re: Storm kill fails with exit code 143

2019-05-06 Thread Derek Dagit
An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15).

It seems like something killed the shutdown script.

https://www.tldp.org/LDP/abs/html/exitcodes.html

On Sun, May 5, 2019 at 8:19 PM JF Chen  wrote:

> Do you run your storm application on yarn?
>
> Regard,
> Junfeng Chen
>
>
> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
> mrathb...@bloomberg.net> wrote:
>
>> Recently our shutdown script failed when calling storm kill with a return
>> code of 143. Typically this means that SIGTERM was received and the process
>> was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176
>> that it is possible to get this exit code if a topology takes too long to
>> come down. However, we are running version 1.2.1 of Storm, which should
>> have the fix mentioned in the issue. Is it possible that we have the same
>> cause for our error? When this occurred, many topologies were brought down
>> at once, but only this one topology seemed to have an issue.
>>
>


Re: Storm kill fails with exit code 143

2019-05-05 Thread JF Chen
Do you run your storm application on yarn?

Regard,
Junfeng Chen


On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <
mrathb...@bloomberg.net> wrote:

> Recently our shutdown script failed when calling storm kill with a return
> code of 143. Typically this means that SIGTERM was received and the process
> was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176
> that it is possible to get this exit code if a topology takes too long to
> come down. However, we are running version 1.2.1 of Storm, which should
> have the fix mentioned in the issue. Is it possible that we have the same
> cause for our error? When this occurred, many topologies were brought down
> at once, but only this one topology seemed to have an issue.
>


Storm kill fails with exit code 143

2019-05-05 Thread Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Recently our shutdown script failed when calling storm kill with a return code 
of 143. Typically this means that SIGTERM was received and the process was 
terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 that it 
is possible to get this exit code if a topology takes too long to come down. 
However, we are running version 1.2.1 of Storm, which should have the fix 
mentioned in the issue. Is it possible that we have the same cause for our 
error? When this occurred, many topologies were brought down at once, but only 
this one topology seemed to have an issue.