This change will be in the next SLURM releases: v2.2.7 and 2.3.0-pre6.

Thank you!
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Gerrit Renker [[email protected]]
Sent: Wednesday, June 01, 2011 5:43 AM
To: [email protected]
Cc: [email protected]
Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?

Hi Don,

what you report below is definitively a bug, please find attached

  Patch #1: cleans up stopped child processes on salloc exit

I have tested this condition and your scripts under several conditions. What
happens is that the shell, which is spawned via salloc in a bash shell script
is suspended, therefore it sends itself continually SIGTTIN signals. The
problem was that this case was not caught by the code.

With regard to the scripts, I am sorry but this is not the intended mode for
salloc. We had an earlier discussion which involved scripts where salloc was
invoked to run jobs in a manner like

 #!/bin/sh
 salloc -N1 ... ./my_exe1 &
 salloc -N2 ... ./my_exe2 &

which was what triggered these changes that distinguished between interactive
and non-interactive mode.

Both of your scripts use a different mode which is between running salloc
non-interactively in a shell script, and running it interactively from the
commandline. Both scripts will work if you instead use

 . script.sh  (or "source script.sh")

but this mode is not supported. I spent some time working on whether it should
be supported, but since for these use cases alternative formulations exist:
 * shell functions, e.g.
   function _salloc() {
         if [ $# -eq 0 ];then salloc -N2; else salloc -w $1; fi
   },
 * sourcing the file directly,
 * shell aliases,

and since catering for these half-interactive modes makes it more difficult, I
can not see the point in taking this further.

Thank you for pointing out the bug.

Gerrit


On Tue, 31 May 2011 15:02:57 -0700 [email protected] wrote:
> I do not think it is as simple as just the executable program terminating
> too soon.   One of our testers encountered this problem with a simple
> script that attempts to invoke a default shell and allow requesting either
> a specific node or just a count of nodes,  i.e.,
>
> #! /bin/bash
> #  usage: salloc
> #         salloc host
> #
> if [ $# -eq 0 ]; then
>    salloc -N2
> else
>    salloc -w $1
> fi
>
> When this script is executed (on SLURM 2.2.3),  it immediately terminates:
>
> [jouvin@xna0 slurmtest]$ test.sh
> salloc: Granted job allocation 556
> salloc: Relinquishing job allocation 556
> salloc: Job allocation 556 has been revoked.
>
> whereas if the "salloc  -N2" is executed directly at the command prompt,
> it makes the allocation and invokes the shell, as intended.
>
> I reproduced the problem on SLURM 2.2.5 with an even simpler script:
>
> #! /bin/bash
> salloc -N1 /bin/bash
>
> This script seems to terminate immediately,  as above, but doing a "ps j"
> reveals that there is a copy of "/bin/bash" in execution, with a parent
> pid of "1":
>
> [stag] (dalbert) dalbert> ps j
>  PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
>  8778  8785  8785  8785 pts/1    28118 Ss     605   0:00 -bash
> 27154 27161 27161 27161 pts/5    27161 Ss+    605   0:00 -bash
>     1 28090 28090 27161 pts/5    27161 R      605   0:04 /bin/bash
>  8785 28118 28118  8785 pts/1    28118 R+     605   0:00 ps j
>
> The "top" monitor shows that this pid is consuming 100% of a processor.
> Attaching to the pid with gdb reveals:
>
> [stag] (dalbert) 314778> gdb attach 28090
> GNU gdb Fedora (6.8-27.el5)
> (gdb) bt
> #0  0x00000034198306a7 in kill () from /lib64/libc.so.6
> #1  0x0000000000436f53 in initialize_job_control ()
> #2  0x000000000041a7b8 in main ()
>
> I added "strace" to the script:
>
> #! /bin/bash
> salloc -N1 strace /bin/bash
>
> When I run the script with strace,  it loops, displaying on the terminal
> the following,  until the terminal is disconnected.  Ctrl-C or Ctrl-D have
> no effect:
>
> [stag] (dalbert) 314778> ./test.sh
> salloc: Granted job allocation 87
> execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
>      <<< lines deleted for brevity >>>
> munmap(0x2b97ad742000, 4096)            = 0
> stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096,
> ...}) = 0
> stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> getpid()                                = 22714
> getppid()                               = 22713
> getpgrp()                               = 22713
> dup(2)                                  = 4
> getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
> fcntl(255, F_GETFD)                     = -1 EBADF (Bad file descriptor)
> dup2(4, 255)                            = 255
> close(4)                                = 0
> ioctl(255, TIOCGPGRP, [22709])          = 0
> rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> kill(0, SIGTTIN)                        = 0
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> ioctl(255, TIOCGPGRP, [22709])          = 0
> rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, [],
> SA_RESTORER, 0x3419830280}, 8) = 0
> kill(0, SIGTTIN)                        = 0
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
>    <<< at this point the traces repeat,  until the terminal is
> disconnected >>>
>
>
> Could this be an unintended consequence of the "job control" changes to
> salloc that were introduced in SLURM 2.2.2 and further modified in SLURM
> 2.2.3?
>
>    -Don Albert-
>
>
> [email protected] wrote on 03/07/2011 04:34:56 PM:
>
> > I'll add that salloc will revoke the allocation as soon as its
> > executable program
> > terminates. If your program 1grabslurm starts background processes and
> exits,
> > those background processes could persist after the 1grabslurm
> > program terminates
> > and the job allocation has been revoked.
> > ________________________________________
> > From: [email protected] [[email protected].
> > gov] On Behalf Of Gerrit Renker [[email protected]]
> > Sent: Saturday, March 05, 2011 2:10 AM
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> >
> > I just tried something similar under v2.3 pre3 (whose salloc is
> > almost identical
> > with that of 2.2.3) and it worked.
> >
> > Looking at your output it seems that the allocation (#38) is revoked
> > immediately
> > after it was granted, which could happen if the script '1grabslurm'
> > exits very soon.
> >
> > I can not see a problem with the script below, but there may well be
> > something that
> > needs to be checked in the '1grabslurm' script -- please send more
> > information, you
> > can also send this privately.
> >
> > Alternatively, since it seems that salloc is used to generate an
> > allocation, you
> > could consider the --no-shell mode.
> >
> > On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > > I have a few shell scripts that are basically things like:
> > > #!/bin/bash
> > > salloc -N1 -n1 -p pubint 1grabslurm 1
> > >
> > > At least under our ancient 1.3.x installation, salloc worked from a
> > > script in exactly the same way as from the command line.
> > >
> > > Under 2.2.3, the salloc line works fine when typed directly, but when
> > > invoked from a script I instead see:
> > >
> > > salloc: Granted job allocation 38
> > > salloc: Relinquishing job allocation 38
> > > salloc: Job allocation 38 has been revoked.
> > > /root # srun: forcing job termination
> > > srun: got SIGCONT
> > > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > > credential revoked
> > > srun: error: Application launch failed: Job credential revoked
> > > srun: Job step aborted: Waiting up to 2 seconds for job step to  finish.
> > > srun: error: Timed out waiting for job step to complete
> > > tcsetattr: Input/output error
> > >
> > > Thanks in advance for any advice,
> > > Jeff Katcher
> >

Reply via email to