I do not think it is as simple as just the executable program terminating 
too soon.   One of our testers encountered this problem with a simple 
script that attempts to invoke a default shell and allow requesting either 
a specific node or just a count of nodes,  i.e.,

#! /bin/bash
#  usage: salloc
#         salloc host
#
if [ $# -eq 0 ]; then
   salloc -N2
else
   salloc -w $1
fi

When this script is executed (on SLURM 2.2.3),  it immediately terminates:

[jouvin@xna0 slurmtest]$ test.sh
salloc: Granted job allocation 556
salloc: Relinquishing job allocation 556
salloc: Job allocation 556 has been revoked.

whereas if the "salloc  -N2" is executed directly at the command prompt, 
it makes the allocation and invokes the shell, as intended.

I reproduced the problem on SLURM 2.2.5 with an even simpler script:

#! /bin/bash
salloc -N1 /bin/bash

This script seems to terminate immediately,  as above, but doing a "ps j" 
reveals that there is a copy of "/bin/bash" in execution, with a parent 
pid of "1":

[stag] (dalbert) dalbert> ps j
 PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
 8778  8785  8785  8785 pts/1    28118 Ss     605   0:00 -bash
27154 27161 27161 27161 pts/5    27161 Ss+    605   0:00 -bash
    1 28090 28090 27161 pts/5    27161 R      605   0:04 /bin/bash
 8785 28118 28118  8785 pts/1    28118 R+     605   0:00 ps j

The "top" monitor shows that this pid is consuming 100% of a processor. 
Attaching to the pid with gdb reveals:

[stag] (dalbert) 314778> gdb attach 28090
GNU gdb Fedora (6.8-27.el5)
(gdb) bt
#0  0x00000034198306a7 in kill () from /lib64/libc.so.6
#1  0x0000000000436f53 in initialize_job_control ()
#2  0x000000000041a7b8 in main ()

I added "strace" to the script:

#! /bin/bash
salloc -N1 strace /bin/bash

When I run the script with strace,  it loops, displaying on the terminal 
the following,  until the terminal is disconnected.  Ctrl-C or Ctrl-D have 
no effect:

[stag] (dalbert) 314778> ./test.sh
salloc: Granted job allocation 87
execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
     <<< lines deleted for brevity >>>
munmap(0x2b97ad742000, 4096)            = 0
stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096, 
...}) = 0
stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getpid()                                = 22714
getppid()                               = 22713
getpgrp()                               = 22713
dup(2)                                  = 4
getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
fcntl(255, F_GETFD)                     = -1 EBADF (Bad file descriptor)
dup2(4, 255)                            = 255
close(4)                                = 0
ioctl(255, TIOCGPGRP, [22709])          = 0
rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [], 
SA_RESTORER, 0x3419830280}, 8) = 0
kill(0, SIGTTIN)                        = 0
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL, [], 
SA_RESTORER, 0x3419830280}, 8) = 0
ioctl(255, TIOCGPGRP, [22709])          = 0
rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [], 
SA_RESTORER, 0x3419830280}, 8) = 0
kill(0, SIGTTIN)                        = 0
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
--- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
   <<< at this point the traces repeat,  until the terminal is 
disconnected >>>


Could this be an unintended consequence of the "job control" changes to 
salloc that were introduced in SLURM 2.2.2 and further modified in SLURM 
2.2.3?

   -Don Albert-


[email protected] wrote on 03/07/2011 04:34:56 PM:

> I'll add that salloc will revoke the allocation as soon as its 
> executable program
> terminates. If your program 1grabslurm starts background processes and 
exits,
> those background processes could persist after the 1grabslurm 
> program terminates
> and the job allocation has been revoked.
> ________________________________________
> From: [email protected] [[email protected].
> gov] On Behalf Of Gerrit Renker [[email protected]]
> Sent: Saturday, March 05, 2011 2:10 AM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> 
> I just tried something similar under v2.3 pre3 (whose salloc is 
> almost identical
> with that of 2.2.3) and it worked.
> 
> Looking at your output it seems that the allocation (#38) is revoked
> immediately
> after it was granted, which could happen if the script '1grabslurm' 
> exits very soon.
> 
> I can not see a problem with the script below, but there may well be
> something that
> needs to be checked in the '1grabslurm' script -- please send more 
> information, you
> can also send this privately.
> 
> Alternatively, since it seems that salloc is used to generate an 
> allocation, you
> could consider the --no-shell mode.
> 
> On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > I have a few shell scripts that are basically things like:
> > #!/bin/bash
> > salloc -N1 -n1 -p pubint 1grabslurm 1
> >
> > At least under our ancient 1.3.x installation, salloc worked from a
> > script in exactly the same way as from the command line.
> >
> > Under 2.2.3, the salloc line works fine when typed directly, but when
> > invoked from a script I instead see:
> >
> > salloc: Granted job allocation 38
> > salloc: Relinquishing job allocation 38
> > salloc: Job allocation 38 has been revoked.
> > /root # srun: forcing job termination
> > srun: got SIGCONT
> > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > credential revoked
> > srun: error: Application launch failed: Job credential revoked
> > srun: Job step aborted: Waiting up to 2 seconds for job step to 
finish.
> > srun: error: Timed out waiting for job step to complete
> > tcsetattr: Input/output error
> >
> > Thanks in advance for any advice,
> > Jeff Katcher
> 

Reply via email to