Hi Don,
what you report below is definitively a bug, please find attached
Patch #1: cleans up stopped child processes on salloc exit
I have tested this condition and your scripts under several conditions. What
happens is that the shell, which is spawned via salloc in a bash shell script
is suspended, therefore it sends itself continually SIGTTIN signals. The
problem was that this case was not caught by the code.
With regard to the scripts, I am sorry but this is not the intended mode for
salloc. We had an earlier discussion which involved scripts where salloc was
invoked to run jobs in a manner like
#!/bin/sh
salloc -N1 ... ./my_exe1 &
salloc -N2 ... ./my_exe2 &
which was what triggered these changes that distinguished between interactive
and non-interactive mode.
Both of your scripts use a different mode which is between running salloc
non-interactively in a shell script, and running it interactively from the
commandline. Both scripts will work if you instead use
. script.sh (or "source script.sh")
but this mode is not supported. I spent some time working on whether it should
be supported, but since for these use cases alternative formulations exist:
* shell functions, e.g.
function _salloc() {
if [ $# -eq 0 ];then salloc -N2; else salloc -w $1; fi
},
* sourcing the file directly,
* shell aliases,
and since catering for these half-interactive modes makes it more difficult, I
can not see the point in taking this further.
Thank you for pointing out the bug.
Gerrit
On Tue, 31 May 2011 15:02:57 -0700 [email protected] wrote:
> I do not think it is as simple as just the executable program terminating
> too soon. One of our testers encountered this problem with a simple
> script that attempts to invoke a default shell and allow requesting either
> a specific node or just a count of nodes, i.e.,
>
> #! /bin/bash
> # usage: salloc
> # salloc host
> #
> if [ $# -eq 0 ]; then
> salloc -N2
> else
> salloc -w $1
> fi
>
> When this script is executed (on SLURM 2.2.3), it immediately terminates:
>
> [jouvin@xna0 slurmtest]$ test.sh
> salloc: Granted job allocation 556
> salloc: Relinquishing job allocation 556
> salloc: Job allocation 556 has been revoked.
>
> whereas if the "salloc -N2" is executed directly at the command prompt,
> it makes the allocation and invokes the shell, as intended.
>
> I reproduced the problem on SLURM 2.2.5 with an even simpler script:
>
> #! /bin/bash
> salloc -N1 /bin/bash
>
> This script seems to terminate immediately, as above, but doing a "ps j"
> reveals that there is a copy of "/bin/bash" in execution, with a parent
> pid of "1":
>
> [stag] (dalbert) dalbert> ps j
> PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
> 8778 8785 8785 8785 pts/1 28118 Ss 605 0:00 -bash
> 27154 27161 27161 27161 pts/5 27161 Ss+ 605 0:00 -bash
> 1 28090 28090 27161 pts/5 27161 R 605 0:04 /bin/bash
> 8785 28118 28118 8785 pts/1 28118 R+ 605 0:00 ps j
>
> The "top" monitor shows that this pid is consuming 100% of a processor.
> Attaching to the pid with gdb reveals:
>
> [stag] (dalbert) 314778> gdb attach 28090
> GNU gdb Fedora (6.8-27.el5)
> (gdb) bt
> #0 0x00000034198306a7 in kill () from /lib64/libc.so.6
> #1 0x0000000000436f53 in initialize_job_control ()
> #2 0x000000000041a7b8 in main ()
>
> I added "strace" to the script:
>
> #! /bin/bash
> salloc -N1 strace /bin/bash
>
> When I run the script with strace, it loops, displaying on the terminal
> the following, until the terminal is disconnected. Ctrl-C or Ctrl-D have
> no effect:
>
> [stag] (dalbert) 314778> ./test.sh
> salloc: Granted job allocation 87
> execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
> <<< lines deleted for brevity >>>
> munmap(0x2b97ad742000, 4096) = 0
> stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096,
> ...}) = 0
> stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> getpid() = 22714
> getppid() = 22713
> getpgrp() = 22713
> dup(2) = 4
> getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
> fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
> dup2(4, 255) = 255
> close(4) = 0
> ioctl(255, TIOCGPGRP, [22709]) = 0
> rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> kill(0, SIGTTIN) = 0
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> ioctl(255, TIOCGPGRP, [22709]) = 0
> rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> kill(0, SIGTTIN) = 0
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> <<< at this point the traces repeat, until the terminal is
> disconnected >>>
>
>
> Could this be an unintended consequence of the "job control" changes to
> salloc that were introduced in SLURM 2.2.2 and further modified in SLURM
> 2.2.3?
>
> -Don Albert-
>
>
> [email protected] wrote on 03/07/2011 04:34:56 PM:
>
> > I'll add that salloc will revoke the allocation as soon as its
> > executable program
> > terminates. If your program 1grabslurm starts background processes and
> exits,
> > those background processes could persist after the 1grabslurm
> > program terminates
> > and the job allocation has been revoked.
> > ________________________________________
> > From: [email protected] [[email protected].
> > gov] On Behalf Of Gerrit Renker [[email protected]]
> > Sent: Saturday, March 05, 2011 2:10 AM
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> >
> > I just tried something similar under v2.3 pre3 (whose salloc is
> > almost identical
> > with that of 2.2.3) and it worked.
> >
> > Looking at your output it seems that the allocation (#38) is revoked
> > immediately
> > after it was granted, which could happen if the script '1grabslurm'
> > exits very soon.
> >
> > I can not see a problem with the script below, but there may well be
> > something that
> > needs to be checked in the '1grabslurm' script -- please send more
> > information, you
> > can also send this privately.
> >
> > Alternatively, since it seems that salloc is used to generate an
> > allocation, you
> > could consider the --no-shell mode.
> >
> > On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > > I have a few shell scripts that are basically things like:
> > > #!/bin/bash
> > > salloc -N1 -n1 -p pubint 1grabslurm 1
> > >
> > > At least under our ancient 1.3.x installation, salloc worked from a
> > > script in exactly the same way as from the command line.
> > >
> > > Under 2.2.3, the salloc line works fine when typed directly, but when
> > > invoked from a script I instead see:
> > >
> > > salloc: Granted job allocation 38
> > > salloc: Relinquishing job allocation 38
> > > salloc: Job allocation 38 has been revoked.
> > > /root # srun: forcing job termination
> > > srun: got SIGCONT
> > > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > > credential revoked
> > > srun: error: Application launch failed: Job credential revoked
> > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> > > srun: error: Timed out waiting for job step to complete
> > > tcsetattr: Input/output error
> > >
> > > Thanks in advance for any advice,
> > > Jeff Katcher
> >
salloc: clean up stopped child processes
This fixes a bug which is thanks to a report by Don Albert.
The problem is that whenever salloc exits with a child process in stopped state
(suspended or stopped on terminal input/output), a zombie process is generated,
since this case is not caught by the code evaluating the child status.
This patch adds the missing case. It uses SIGKILL, which is the only signal
that changes the state of a stopped process. It was decided not to try and
re-awken the process using SIGCONT, since (a) this happens during session
clean-up and (b) if the condition is due to SIGTTIN, the process immediately
becomes stopped again.
---
src/salloc/salloc.c | 3 +++
1 file changed, 3 insertions(+)
--- a/src/salloc/salloc.c
+++ b/src/salloc/salloc.c
@@ -476,6 +476,9 @@ relinquish:
if (WIFEXITED(status)) {
rc = WEXITSTATUS(status);
+ } else if (WIFSTOPPED(status)) {
+ /* Terminate stopped child process */
+ _forward_signal(SIGKILL);
} else if (WIFSIGNALED(status)) {
verbose("Command \"%s\" was terminated by signal %d",
command_argv[0], WTERMSIG(status));