Ah, sorry, got off on the wrong track due to the linked issue, which is talking about worker JVM exit codes.
Den man. 6. maj 2019 kl. 19.34 skrev Mitchell Rathbun (BLOOMBERG/ 731 LEX) < [email protected]>: > Sorry if my initial question was misleading. The "Storm kill" command > returned 143, there was no exit code from our topology. Our topology was > never shutdown and never received a command to shutdown. As far as I can > tell, Nimbus never received a command from running Storm kill in this case. > So the process created to carry out the kill command was the one > terminated. As Derek mentioned, it seems like something killed the process. > I am wondering if since so many topologies were being brought down at once, > the process took a long time to communicate with Nimbus and timed out/was > terminated. Is something like this possible? As far as I can tell, there > was no external command at the time to kill the process. > > From: [email protected] At: 05/06/19 13:14:02 > To: [email protected] > Subject: Re: Storm kill fails with exit code 143 > > I would assume that what actually happened is that most of your workers > don't manage to finish shutting down the worker gracefully, and so exit > with code 20 due to the 1 second time limit imposed by the shutdown hook. > One of your workers happened to run the entire shutdown sequence within the > 1 second time limit, and so returns 143. > > Basically what is happening is that the supervisor sends SIGTERM to the > worker to get it to shut down. The worker then runs its shutdown sequence > to shutdown gracefully. Before starting the shutdown sequence, the worker > sets up a new thread that sleeps for 1 second, then halts the JVM with exit > code 20. If the shutdown exceeds the time limit, you get exit code 20. If > the shutdown is finished within the time limit, you get 143 in response to > the original SIGTERM. > > Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit <[email protected]>: > >> An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). >> >> It seems like something killed the shutdown script. >> >> https://www.tldp.org/LDP/abs/html/exitcodes.html >> >> On Sun, May 5, 2019 at 8:19 PM JF Chen <[email protected]> wrote: >> >>> Do you run your storm application on yarn? >>> >>> Regard, >>> Junfeng Chen >>> >>> >>> On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) < >>> [email protected]> wrote: >>> >>>> Recently our shutdown script failed when calling storm kill with a >>>> return code of 143. Typically this means that SIGTERM was received and the >>>> process was terminated. I see in >>>> https://issues.apache.org/jira/browse/STORM-2176 that it is possible >>>> to get this exit code if a topology takes too long to come down. However, >>>> we are running version 1.2.1 of Storm, which should have the fix mentioned >>>> in the issue. Is it possible that we have the same cause for our error? >>>> When this occurred, many topologies were brought down at once, but only >>>> this one topology seemed to have an issue. >>>> >>> >
