Re: Storm kill fails with exit code 143
Ah, sorry, got off on the wrong track due to the linked issue, which is talking about worker JVM exit codes. Den man. 6. maj 2019 kl. 19.34 skrev Mitchell Rathbun (BLOOMBERG/ 731 LEX) < mrathb...@bloomberg.net>: > Sorry if my initial question was misleading. The "Storm kill" command > returned 143, there was no exit code from our topology. Our topology was > never shutdown and never received a command to shutdown. As far as I can > tell, Nimbus never received a command from running Storm kill in this case. > So the process created to carry out the kill command was the one > terminated. As Derek mentioned, it seems like something killed the process. > I am wondering if since so many topologies were being brought down at once, > the process took a long time to communicate with Nimbus and timed out/was > terminated. Is something like this possible? As far as I can tell, there > was no external command at the time to kill the process. > > From: user@storm.apache.org At: 05/06/19 13:14:02 > To: user@storm.apache.org > Subject: Re: Storm kill fails with exit code 143 > > I would assume that what actually happened is that most of your workers > don't manage to finish shutting down the worker gracefully, and so exit > with code 20 due to the 1 second time limit imposed by the shutdown hook. > One of your workers happened to run the entire shutdown sequence within the > 1 second time limit, and so returns 143. > > Basically what is happening is that the supervisor sends SIGTERM to the > worker to get it to shut down. The worker then runs its shutdown sequence > to shutdown gracefully. Before starting the shutdown sequence, the worker > sets up a new thread that sleeps for 1 second, then halts the JVM with exit > code 20. If the shutdown exceeds the time limit, you get exit code 20. If > the shutdown is finished within the time limit, you get 143 in response to > the original SIGTERM. > > Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit : > >> An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). >> >> It seems like something killed the shutdown script. >> >> https://www.tldp.org/LDP/abs/html/exitcodes.html >> >> On Sun, May 5, 2019 at 8:19 PM JF Chen wrote: >> >>> Do you run your storm application on yarn? >>> >>> Regard, >>> Junfeng Chen >>> >>> >>> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < >>> mrathb...@bloomberg.net> wrote: >>> >>>> Recently our shutdown script failed when calling storm kill with a >>>> return code of 143. Typically this means that SIGTERM was received and the >>>> process was terminated. I see in >>>> https://issues.apache.org/jira/browse/STORM-2176 that it is possible >>>> to get this exit code if a topology takes too long to come down. However, >>>> we are running version 1.2.1 of Storm, which should have the fix mentioned >>>> in the issue. Is it possible that we have the same cause for our error? >>>> When this occurred, many topologies were brought down at once, but only >>>> this one topology seemed to have an issue. >>>> >>> >
Re: Storm kill fails with exit code 143
Sorry if my initial question was misleading. The "Storm kill" command returned 143, there was no exit code from our topology. Our topology was never shutdown and never received a command to shutdown. As far as I can tell, Nimbus never received a command from running Storm kill in this case. So the process created to carry out the kill command was the one terminated. As Derek mentioned, it seems like something killed the process. I am wondering if since so many topologies were being brought down at once, the process took a long time to communicate with Nimbus and timed out/was terminated. Is something like this possible? As far as I can tell, there was no external command at the time to kill the process. From: user@storm.apache.org At: 05/06/19 13:14:02To: user@storm.apache.org Subject: Re: Storm kill fails with exit code 143 I would assume that what actually happened is that most of your workers don't manage to finish shutting down the worker gracefully, and so exit with code 20 due to the 1 second time limit imposed by the shutdown hook. One of your workers happened to run the entire shutdown sequence within the 1 second time limit, and so returns 143. Basically what is happening is that the supervisor sends SIGTERM to the worker to get it to shut down. The worker then runs its shutdown sequence to shutdown gracefully. Before starting the shutdown sequence, the worker sets up a new thread that sleeps for 1 second, then halts the JVM with exit code 20. If the shutdown exceeds the time limit, you get exit code 20. If the shutdown is finished within the time limit, you get 143 in response to the original SIGTERM. Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit : An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). It seems like something killed the shutdown script. https://www.tldp.org/LDP/abs/html/exitcodes.html On Sun, May 5, 2019 at 8:19 PM JF Chen wrote: Do you run your storm application on yarn? Regard, Junfeng Chen On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) wrote: Recently our shutdown script failed when calling storm kill with a return code of 143. Typically this means that SIGTERM was received and the process was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 that it is possible to get this exit code if a topology takes too long to come down. However, we are running version 1.2.1 of Storm, which should have the fix mentioned in the issue. Is it possible that we have the same cause for our error? When this occurred, many topologies were brought down at once, but only this one topology seemed to have an issue.
Re: Storm kill fails with exit code 143
I would assume that what actually happened is that most of your workers don't manage to finish shutting down the worker gracefully, and so exit with code 20 due to the 1 second time limit imposed by the shutdown hook. One of your workers happened to run the entire shutdown sequence within the 1 second time limit, and so returns 143. Basically what is happening is that the supervisor sends SIGTERM to the worker to get it to shut down. The worker then runs its shutdown sequence to shutdown gracefully. Before starting the shutdown sequence, the worker sets up a new thread that sleeps for 1 second, then halts the JVM with exit code 20. If the shutdown exceeds the time limit, you get exit code 20. If the shutdown is finished within the time limit, you get 143 in response to the original SIGTERM. Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit : > An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). > > It seems like something killed the shutdown script. > > https://www.tldp.org/LDP/abs/html/exitcodes.html > > On Sun, May 5, 2019 at 8:19 PM JF Chen wrote: > >> Do you run your storm application on yarn? >> >> Regard, >> Junfeng Chen >> >> >> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < >> mrathb...@bloomberg.net> wrote: >> >>> Recently our shutdown script failed when calling storm kill with a >>> return code of 143. Typically this means that SIGTERM was received and the >>> process was terminated. I see in >>> https://issues.apache.org/jira/browse/STORM-2176 that it is possible to >>> get this exit code if a topology takes too long to come down. However, we >>> are running version 1.2.1 of Storm, which should have the fix mentioned in >>> the issue. Is it possible that we have the same cause for our error? When >>> this occurred, many topologies were brought down at once, but only this one >>> topology seemed to have an issue. >>> >>
Re: Storm kill fails with exit code 143
An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). It seems like something killed the shutdown script. https://www.tldp.org/LDP/abs/html/exitcodes.html On Sun, May 5, 2019 at 8:19 PM JF Chen wrote: > Do you run your storm application on yarn? > > Regard, > Junfeng Chen > > > On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < > mrathb...@bloomberg.net> wrote: > >> Recently our shutdown script failed when calling storm kill with a return >> code of 143. Typically this means that SIGTERM was received and the process >> was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 >> that it is possible to get this exit code if a topology takes too long to >> come down. However, we are running version 1.2.1 of Storm, which should >> have the fix mentioned in the issue. Is it possible that we have the same >> cause for our error? When this occurred, many topologies were brought down >> at once, but only this one topology seemed to have an issue. >> >
Re: Storm kill fails with exit code 143
Do you run your storm application on yarn? Regard, Junfeng Chen On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < mrathb...@bloomberg.net> wrote: > Recently our shutdown script failed when calling storm kill with a return > code of 143. Typically this means that SIGTERM was received and the process > was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 > that it is possible to get this exit code if a topology takes too long to > come down. However, we are running version 1.2.1 of Storm, which should > have the fix mentioned in the issue. Is it possible that we have the same > cause for our error? When this occurred, many topologies were brought down > at once, but only this one topology seemed to have an issue. >
Storm kill fails with exit code 143
Recently our shutdown script failed when calling storm kill with a return code of 143. Typically this means that SIGTERM was received and the process was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 that it is possible to get this exit code if a topology takes too long to come down. However, we are running version 1.2.1 of Storm, which should have the fix mentioned in the issue. Is it possible that we have the same cause for our error? When this occurred, many topologies were brought down at once, but only this one topology seemed to have an issue.