I do not think it is as simple as just the executable program terminating
too soon. One of our testers encountered this problem with a simple
script that attempts to invoke a default shell and allow requesting either
a specific node or just a count of nodes, i.e.,
#! /bin/bash
# usage: salloc
# salloc host
#
if [ $# -eq 0 ]; then
salloc -N2
else
salloc -w $1
fi
When this script is executed (on SLURM 2.2.3), it immediately terminates:
[jouvin@xna0 slurmtest]$ test.sh
salloc: Granted job allocation 556
salloc: Relinquishing job allocation 556
salloc: Job allocation 556 has been revoked.
whereas if the "salloc -N2" is executed directly at the command prompt,
it makes the allocation and invokes the shell, as intended.
I reproduced the problem on SLURM 2.2.5 with an even simpler script:
#! /bin/bash
salloc -N1 /bin/bash
This script seems to terminate immediately, as above, but doing a "ps j"
reveals that there is a copy of "/bin/bash" in execution, with a parent
pid of "1":
[stag] (dalbert) dalbert> ps j
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
8778 8785 8785 8785 pts/1 28118 Ss 605 0:00 -bash
27154 27161 27161 27161 pts/5 27161 Ss+ 605 0:00 -bash
1 28090 28090 27161 pts/5 27161 R 605 0:04 /bin/bash
8785 28118 28118 8785 pts/1 28118 R+ 605 0:00 ps j
The "top" monitor shows that this pid is consuming 100% of a processor.
Attaching to the pid with gdb reveals:
[stag] (dalbert) 314778> gdb attach 28090
GNU gdb Fedora (6.8-27.el5)
(gdb) bt
#0 0x00000034198306a7 in kill () from /lib64/libc.so.6
#1 0x0000000000436f53 in initialize_job_control ()
#2 0x000000000041a7b8 in main ()
I added "strace" to the script:
#! /bin/bash
salloc -N1 strace /bin/bash
When I run the script with strace, it loops, displaying on the terminal
the following, until the terminal is disconnected. Ctrl-C or Ctrl-D have
no effect:
[stag] (dalbert) 314778> ./test.sh
salloc: Granted job allocation 87
execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
<<< lines deleted for brevity >>>
munmap(0x2b97ad742000, 4096) = 0
stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096,
...}) = 0
stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getpid() = 22714
getppid() = 22713
getpgrp() = 22713
dup(2) = 4
getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
fcntl(255, F_GETFD) = -1 EBADF (Bad file descriptor)
dup2(4, 255) = 255
close(4) = 0
ioctl(255, TIOCGPGRP, [22709]) = 0
rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
SA_RESTORER, 0x3419830280}, 8) = 0
kill(0, SIGTTIN) = 0
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL, [],
SA_RESTORER, 0x3419830280}, 8) = 0
ioctl(255, TIOCGPGRP, [22709]) = 0
rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
SA_RESTORER, 0x3419830280}, 8) = 0
kill(0, SIGTTIN) = 0
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
<<< at this point the traces repeat, until the terminal is
disconnected >>>
Could this be an unintended consequence of the "job control" changes to
salloc that were introduced in SLURM 2.2.2 and further modified in SLURM
2.2.3?
-Don Albert-
[email protected] wrote on 03/07/2011 04:34:56 PM:
> I'll add that salloc will revoke the allocation as soon as its
> executable program
> terminates. If your program 1grabslurm starts background processes and
exits,
> those background processes could persist after the 1grabslurm
> program terminates
> and the job allocation has been revoked.
> ________________________________________
> From: [email protected] [[email protected].
> gov] On Behalf Of Gerrit Renker [[email protected]]
> Sent: Saturday, March 05, 2011 2:10 AM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
>
> I just tried something similar under v2.3 pre3 (whose salloc is
> almost identical
> with that of 2.2.3) and it worked.
>
> Looking at your output it seems that the allocation (#38) is revoked
> immediately
> after it was granted, which could happen if the script '1grabslurm'
> exits very soon.
>
> I can not see a problem with the script below, but there may well be
> something that
> needs to be checked in the '1grabslurm' script -- please send more
> information, you
> can also send this privately.
>
> Alternatively, since it seems that salloc is used to generate an
> allocation, you
> could consider the --no-shell mode.
>
> On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > I have a few shell scripts that are basically things like:
> > #!/bin/bash
> > salloc -N1 -n1 -p pubint 1grabslurm 1
> >
> > At least under our ancient 1.3.x installation, salloc worked from a
> > script in exactly the same way as from the command line.
> >
> > Under 2.2.3, the salloc line works fine when typed directly, but when
> > invoked from a script I instead see:
> >
> > salloc: Granted job allocation 38
> > salloc: Relinquishing job allocation 38
> > salloc: Job allocation 38 has been revoked.
> > /root # srun: forcing job termination
> > srun: got SIGCONT
> > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > credential revoked
> > srun: error: Application launch failed: Job credential revoked
> > srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.
> > srun: error: Timed out waiting for job step to complete
> > tcsetattr: Input/output error
> >
> > Thanks in advance for any advice,
> > Jeff Katcher
>