Sorry if my initial question was misleading. The "Storm kill" command returned 143, there was no exit code from our topology. Our topology was never shutdown and never received a command to shutdown. As far as I can tell, Nimbus never received a command from running Storm kill in this case. So the process created to carry out the kill command was the one terminated. As Derek mentioned, it seems like something killed the process. I am wondering if since so many topologies were being brought down at once, the process took a long time to communicate with Nimbus and timed out/was terminated. Is something like this possible? As far as I can tell, there was no external command at the time to kill the process.
From: [email protected] At: 05/06/19 13:14:02To: [email protected] Subject: Re: Storm kill fails with exit code 143 I would assume that what actually happened is that most of your workers don't manage to finish shutting down the worker gracefully, and so exit with code 20 due to the 1 second time limit imposed by the shutdown hook. One of your workers happened to run the entire shutdown sequence within the 1 second time limit, and so returns 143. Basically what is happening is that the supervisor sends SIGTERM to the worker to get it to shut down. The worker then runs its shutdown sequence to shutdown gracefully. Before starting the shutdown sequence, the worker sets up a new thread that sleeps for 1 second, then halts the JVM with exit code 20. If the shutdown exceeds the time limit, you get exit code 20. If the shutdown is finished within the time limit, you get 143 in response to the original SIGTERM. Den man. 6. maj 2019 kl. 18.22 skrev Derek Dagit <[email protected]>: An exit code of 143 indicates a SIGTERM was received. (143 - 128 = 15). It seems like something killed the shutdown script. https://www.tldp.org/LDP/abs/html/exitcodes.html On Sun, May 5, 2019 at 8:19 PM JF Chen <[email protected]> wrote: Do you run your storm application on yarn? Regard, Junfeng Chen On Mon, May 6, 2019 at 4:53 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <[email protected]> wrote: Recently our shutdown script failed when calling storm kill with a return code of 143. Typically this means that SIGTERM was received and the process was terminated. I see in https://issues.apache.org/jira/browse/STORM-2176 that it is possible to get this exit code if a topology takes too long to come down. However, we are running version 1.2.1 of Storm, which should have the fix mentioned in the issue. Is it possible that we have the same cause for our error? When this occurred, many topologies were brought down at once, but only this one topology seemed to have an issue.
