Gerrit,
Thank you for the patch that cleans up the hung process. But,....
Perhaps I was not clear in my description of the problem, but the patch
you supplied most emphatically does *not* solve the problem that the bash
shell is crashing and getting into the SIGTTIN loop before ever issuing
its own prompt to the user! The "/bin/bash" command should have
executed the shell and allowed the user to enter commands, and not
immediately terminated.
You seem to imply that it is somehow illegitimate to execute the "salloc"
within a script. I submit that it is almost second nature for
Linux/Unix programmers to create various "wrapper" scripts to invoke
commands (including "salloc") with certain fixed parameters, while
allowing easy substitution of other parameters. The "salloc" command
itself is essentially a wrapper which allows a user to invoke a specific
command or shell after calling SLURM to reserve some resources. I see
no reason that salloc should not be able to be executed within a script.
If it can not be, it is a defect and not a feature.
It seems to me that these job control features were added in to "salloc"
in an attempt to solve a perceived "problem" with occasionally leaving
jobs stranded that need to be manually killed. And now we are piling
patch upon patch to fix things that were broken by these changes. In
this case it is the ability to invoke "salloc" in a script that is broken,
and previously it was the ability to run "salloc" as a background job
that was broken.
In my opinion, all these job control changes should be stripped out of
"salloc" and let it go back to being a simple "wrapper" to allow running a
command or set of commands via a shell after obtaining a set of SLURM
resources.
-Don Albert-
"Jette, Moe" <[email protected]> wrote on 06/01/2011 09:55:04 AM:
> This change will be in the next SLURM releases: v2.2.7 and 2.3.0-pre6.
>
> Thank you!
> ________________________________________
> From: [email protected] [[email protected].
> gov] On Behalf Of Gerrit Renker [[email protected]]
> Sent: Wednesday, June 01, 2011 5:43 AM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
>
> Hi Don,
>
> what you report below is definitively a bug, please find attached
>
> Patch #1: cleans up stopped child processes on salloc exit
>
> I have tested this condition and your scripts under several conditions.
What
> happens is that the shell, which is spawned via salloc in a bash shell
script
> is suspended, therefore it sends itself continually SIGTTIN signals. The
> problem was that this case was not caught by the code.
>
> With regard to the scripts, I am sorry but this is not the intended mode
for
> salloc. We had an earlier discussion which involved scripts where salloc
was
> invoked to run jobs in a manner like
>
> #!/bin/sh
> salloc -N1 ... ./my_exe1 &
> salloc -N2 ... ./my_exe2 &
>
> which was what triggered these changes that distinguished between
interactive
> and non-interactive mode.
>
> Both of your scripts use a different mode which is between running
salloc
> non-interactively in a shell script, and running it interactively from
the
> commandline. Both scripts will work if you instead use
>
> . script.sh (or "source script.sh")
>
> but this mode is not supported. I spent some time working on whetherit
should
> be supported, but since for these use cases alternative formulations
exist:
> * shell functions, e.g.
> function _salloc() {
> if [ $# -eq 0 ];then salloc -N2; else salloc -w $1; fi
> },
> * sourcing the file directly,
> * shell aliases,
>
> and since catering for these half-interactive modes makes it more
difficult, I
> can not see the point in taking this further.
>
> Thank you for pointing out the bug.
>
> Gerrit
>
>
> On Tue, 31 May 2011 15:02:57 -0700 [email protected] wrote:
> > I do not think it is as simple as just the executable program
terminating
> > too soon. One of our testers encountered this problem with a simple
> > script that attempts to invoke a default shell and allow requesting
either
> > a specific node or just a count of nodes, i.e.,
> >
> > #! /bin/bash
> > # usage: salloc
> > # salloc host
> > #
> > if [ $# -eq 0 ]; then
> > salloc -N2
> > else
> > salloc -w $1
> > fi
> >
> > When this script is executed (on SLURM 2.2.3), it immediately
terminates:
> >
> > [jouvin@xna0 slurmtest]$ test.sh
> > salloc: Granted job allocation 556
> > salloc: Relinquishing job allocation 556
> > salloc: Job allocation 556 has been revoked.
> >
> > whereas if the "salloc -N2" is executed directly at the command
prompt,
> > it makes the allocation and invokes the shell, as intended.
> >
> > I reproduced the problem on SLURM 2.2.5 with an even simpler script:
> >
> > #! /bin/bash
> > salloc -N1 /bin/bash
> >
> > This script seems to terminate immediately, as above, but doing a "ps
j"
> > reveals that there is a copy of "/bin/bash" in execution, with a
parent
> > pid of "1":
> >
> > [stag] (dalbert) dalbert> ps j
> > PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
> > 8778 8785 8785 8785 pts/1 28118 Ss 605 0:00 -bash
> > 27154 27161 27161 27161 pts/5 27161 Ss+ 605 0:00 -bash
> > 1 28090 28090 27161 pts/5 27161 R 605 0:04 /bin/bash
> > 8785 28118 28118 8785 pts/1 28118 R+ 605 0:00 ps j
> >
> > The "top" monitor shows that this pid is consuming 100% of a
processor.
> > Attaching to the pid with gdb reveals:
> >
> > [stag] (dalbert) 314778> gdb attach 28090
> > GNU gdb Fedora (6.8-27.el5)
> > (gdb) bt
> > #0 0x00000034198306a7 in kill () from /lib64/libc.so.6
> > #1 0x0000000000436f53 in initialize_job_control ()
> > #2 0x000000000041a7b8 in main ()
> >
> > I added "strace" to the script:
> >
> > #! /bin/bash
> > salloc -N1 strace /bin/bash
> >
> > When I run the script with strace, it loops, displaying on the
terminal
> > the following, until the terminal is disconnected. Ctrl-C or Ctrl-D
have
> > no effect:
> >
> > [stag] (dalbert) 314778> ./test.sh
> > salloc: Granted job allocation 87
> > execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
> > <<< lines deleted for brevity >>>
> > munmap(0x2b97ad742000, 4096) = 0
> > stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096,
> > ...}) = 0
> > stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > getpid() = 22714
> > getppid() = 22713
> > getpgrp() = 22713
> > dup(2) = 4
> > getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
> > fcntl(255, F_GETFD) = -1 EBADF (Bad file
descriptor)
> > dup2(4, 255) = 255
> > close(4) = 0
> > ioctl(255, TIOCGPGRP, [22709]) = 0
> > rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1,
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > kill(0, SIGTTIN) = 0
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL,
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > ioctl(255, TIOCGPGRP, [22709]) = 0
> > rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1,
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > kill(0, SIGTTIN) = 0
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > <<< at this point the traces repeat, until the terminal is
> > disconnected >>>
> >
> >
> > Could this be an unintended consequence of the "job control" changes
to
> > salloc that were introduced in SLURM 2.2.2 and further modified in
SLURM
> > 2.2.3?
> >
> > -Don Albert-
> >
> >
> > [email protected] wrote on 03/07/2011 04:34:56 PM:
> >
> > > I'll add that salloc will revoke the allocation as soon as its
> > > executable program
> > > terminates. If your program 1grabslurm starts background processes
and
> > exits,
> > > those background processes could persist after the 1grabslurm
> > > program terminates
> > > and the job allocation has been revoked.
> > > ________________________________________
> > > From: [email protected] [[email protected].
> > > gov] On Behalf Of Gerrit Renker [[email protected]]
> > > Sent: Saturday, March 05, 2011 2:10 AM
> > > To: [email protected]
> > > Cc: [email protected]
> > > Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> > >
> > > I just tried something similar under v2.3 pre3 (whose salloc is
> > > almost identical
> > > with that of 2.2.3) and it worked.
> > >
> > > Looking at your output it seems that the allocation (#38) is revoked
> > > immediately
> > > after it was granted, which could happen if the script '1grabslurm'
> > > exits very soon.
> > >
> > > I can not see a problem with the script below, but there may well be
> > > something that
> > > needs to be checked in the '1grabslurm' script -- please send more
> > > information, you
> > > can also send this privately.
> > >
> > > Alternatively, since it seems that salloc is used to generate an
> > > allocation, you
> > > could consider the --no-shell mode.
> > >
> > > On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > > > I have a few shell scripts that are basically things like:
> > > > #!/bin/bash
> > > > salloc -N1 -n1 -p pubint 1grabslurm 1
> > > >
> > > > At least under our ancient 1.3.x installation, salloc worked from
a
> > > > script in exactly the same way as from the command line.
> > > >
> > > > Under 2.2.3, the salloc line works fine when typed directly, but
when
> > > > invoked from a script I instead see:
> > > >
> > > > salloc: Granted job allocation 38
> > > > salloc: Relinquishing job allocation 38
> > > > salloc: Job allocation 38 has been revoked.
> > > > /root # srun: forcing job termination
> > > > srun: got SIGCONT
> > > > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > > > credential revoked
> > > > srun: error: Application launch failed: Job credential revoked
> > > > srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.
> > > > srun: error: Timed out waiting for job step to complete
> > > > tcsetattr: Input/output error
> > > >
> > > > Thanks in advance for any advice,
> > > > Jeff Katcher
> > >