Gerrit,

Thank you for the patch that cleans up the hung process.   But,....

Perhaps I was not clear in my description of the problem,  but the patch 
you supplied most emphatically does *not*  solve the problem that the bash 
shell is crashing and getting into the SIGTTIN loop before ever issuing 
its own prompt to the user!    The "/bin/bash" command should have 
executed the shell and allowed the user to enter commands,  and not 
immediately terminated.

You seem to imply that it is somehow illegitimate to execute the "salloc" 
within a script.    I submit that it is almost second nature for 
Linux/Unix programmers to create various "wrapper" scripts to invoke 
commands (including "salloc") with certain fixed parameters, while 
allowing easy substitution of other parameters.  The "salloc" command 
itself is essentially a wrapper which allows a user to invoke a specific 
command or shell after calling SLURM to reserve some resources.    I see 
no reason that salloc should not be able to be executed within a script. 
If it can not be,  it is a defect and not a feature.

It seems to me that these job control features were added in to "salloc" 
in an attempt to solve a perceived "problem" with occasionally leaving 
jobs stranded that need to be manually killed.   And now we are piling 
patch upon patch to fix things that were broken by these changes.   In 
this case it is the ability to invoke "salloc" in a script that is broken, 
 and previously it was the ability to run "salloc" as a background job 
that was broken.

In my opinion,  all these job control changes should be stripped out of 
"salloc" and let it go back to being a simple "wrapper" to allow running a 
command or set of commands via a shell after obtaining a set of SLURM 
resources.

        -Don Albert-


"Jette, Moe" <[email protected]> wrote on 06/01/2011 09:55:04 AM:

> This change will be in the next SLURM releases: v2.2.7 and 2.3.0-pre6.
> 
> Thank you!
> ________________________________________
> From: [email protected] [[email protected].
> gov] On Behalf Of Gerrit Renker [[email protected]]
> Sent: Wednesday, June 01, 2011 5:43 AM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> 
> Hi Don,
> 
> what you report below is definitively a bug, please find attached
> 
>   Patch #1: cleans up stopped child processes on salloc exit
> 
> I have tested this condition and your scripts under several conditions. 
What
> happens is that the shell, which is spawned via salloc in a bash shell 
script
> is suspended, therefore it sends itself continually SIGTTIN signals. The
> problem was that this case was not caught by the code.
> 
> With regard to the scripts, I am sorry but this is not the intended mode 
for
> salloc. We had an earlier discussion which involved scripts where salloc 
was
> invoked to run jobs in a manner like
> 
>  #!/bin/sh
>  salloc -N1 ... ./my_exe1 &
>  salloc -N2 ... ./my_exe2 &
> 
> which was what triggered these changes that distinguished between 
interactive
> and non-interactive mode.
> 
> Both of your scripts use a different mode which is between running 
salloc
> non-interactively in a shell script, and running it interactively from 
the
> commandline. Both scripts will work if you instead use
> 
>  . script.sh  (or "source script.sh")
> 
> but this mode is not supported. I spent some time working on whetherit 
should
> be supported, but since for these use cases alternative formulations 
exist:
>  * shell functions, e.g.
>    function _salloc() {
>          if [ $# -eq 0 ];then salloc -N2; else salloc -w $1; fi
>    },
>  * sourcing the file directly,
>  * shell aliases,
> 
> and since catering for these half-interactive modes makes it more 
difficult, I
> can not see the point in taking this further.
> 
> Thank you for pointing out the bug.
> 
> Gerrit
> 
> 
> On Tue, 31 May 2011 15:02:57 -0700 [email protected] wrote:
> > I do not think it is as simple as just the executable program 
terminating
> > too soon.   One of our testers encountered this problem with a simple
> > script that attempts to invoke a default shell and allow requesting 
either
> > a specific node or just a count of nodes,  i.e.,
> >
> > #! /bin/bash
> > #  usage: salloc
> > #         salloc host
> > #
> > if [ $# -eq 0 ]; then
> >    salloc -N2
> > else
> >    salloc -w $1
> > fi
> >
> > When this script is executed (on SLURM 2.2.3),  it immediately 
terminates:
> >
> > [jouvin@xna0 slurmtest]$ test.sh
> > salloc: Granted job allocation 556
> > salloc: Relinquishing job allocation 556
> > salloc: Job allocation 556 has been revoked.
> >
> > whereas if the "salloc  -N2" is executed directly at the command 
prompt,
> > it makes the allocation and invokes the shell, as intended.
> >
> > I reproduced the problem on SLURM 2.2.5 with an even simpler script:
> >
> > #! /bin/bash
> > salloc -N1 /bin/bash
> >
> > This script seems to terminate immediately,  as above, but doing a "ps 
j"
> > reveals that there is a copy of "/bin/bash" in execution, with a 
parent
> > pid of "1":
> >
> > [stag] (dalbert) dalbert> ps j
> >  PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
> >  8778  8785  8785  8785 pts/1    28118 Ss     605   0:00 -bash
> > 27154 27161 27161 27161 pts/5    27161 Ss+    605   0:00 -bash
> >     1 28090 28090 27161 pts/5    27161 R      605   0:04 /bin/bash
> >  8785 28118 28118  8785 pts/1    28118 R+     605   0:00 ps j
> >
> > The "top" monitor shows that this pid is consuming 100% of a 
processor.
> > Attaching to the pid with gdb reveals:
> >
> > [stag] (dalbert) 314778> gdb attach 28090
> > GNU gdb Fedora (6.8-27.el5)
> > (gdb) bt
> > #0  0x00000034198306a7 in kill () from /lib64/libc.so.6
> > #1  0x0000000000436f53 in initialize_job_control ()
> > #2  0x000000000041a7b8 in main ()
> >
> > I added "strace" to the script:
> >
> > #! /bin/bash
> > salloc -N1 strace /bin/bash
> >
> > When I run the script with strace,  it loops, displaying on the 
terminal
> > the following,  until the terminal is disconnected.  Ctrl-C or Ctrl-D 
have
> > no effect:
> >
> > [stag] (dalbert) 314778> ./test.sh
> > salloc: Granted job allocation 87
> > execve("/bin/bash", ["/bin/bash"], [/* 59 vars */]) = 0
> >      <<< lines deleted for brevity >>>
> > munmap(0x2b97ad742000, 4096)            = 0
> > stat("/home/dalbert/test/314778", {st_mode=S_IFDIR|0775, st_size=4096,
> > ...}) = 0
> > stat(".", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > getpid()                                = 22714
> > getppid()                               = 22713
> > getpgrp()                               = 22713
> > dup(2)                                  = 4
> > getrlimit(RLIMIT_NOFILE, {rlim_cur=4*1024, rlim_max=64*1024}) = 0
> > fcntl(255, F_GETFD)                     = -1 EBADF (Bad file 
descriptor)
> > dup2(4, 255)                            = 255
> > close(4)                                = 0
> > ioctl(255, TIOCGPGRP, [22709])          = 0
> > rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, 
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > kill(0, SIGTTIN)                        = 0
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > rt_sigaction(SIGTTIN, {0x1, [], SA_RESTORER, 0x3419830280}, {SIG_DFL, 
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > ioctl(255, TIOCGPGRP, [22709])          = 0
> > rt_sigaction(SIGTTIN, {SIG_DFL, [], SA_RESTORER, 0x3419830280}, {0x1, 
[],
> > SA_RESTORER, 0x3419830280}, 8) = 0
> > kill(0, SIGTTIN)                        = 0
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> > --- SIGTTIN (Stopped (tty input)) @ 0 (0) ---
> >    <<< at this point the traces repeat,  until the terminal is
> > disconnected >>>
> >
> >
> > Could this be an unintended consequence of the "job control" changes 
to
> > salloc that were introduced in SLURM 2.2.2 and further modified in 
SLURM
> > 2.2.3?
> >
> >    -Don Albert-
> >
> >
> > [email protected] wrote on 03/07/2011 04:34:56 PM:
> >
> > > I'll add that salloc will revoke the allocation as soon as its
> > > executable program
> > > terminates. If your program 1grabslurm starts background processes 
and
> > exits,
> > > those background processes could persist after the 1grabslurm
> > > program terminates
> > > and the job allocation has been revoked.
> > > ________________________________________
> > > From: [email protected] [[email protected].
> > > gov] On Behalf Of Gerrit Renker [[email protected]]
> > > Sent: Saturday, March 05, 2011 2:10 AM
> > > To: [email protected]
> > > Cc: [email protected]
> > > Subject: Re: [slurm-dev] Running Salloc From Bash Fails Under 2.2.3?
> > >
> > > I just tried something similar under v2.3 pre3 (whose salloc is
> > > almost identical
> > > with that of 2.2.3) and it worked.
> > >
> > > Looking at your output it seems that the allocation (#38) is revoked
> > > immediately
> > > after it was granted, which could happen if the script '1grabslurm'
> > > exits very soon.
> > >
> > > I can not see a problem with the script below, but there may well be
> > > something that
> > > needs to be checked in the '1grabslurm' script -- please send more
> > > information, you
> > > can also send this privately.
> > >
> > > Alternatively, since it seems that salloc is used to generate an
> > > allocation, you
> > > could consider the --no-shell mode.
> > >
> > > On Fri, 4 Mar 2011 18:12:23 -0800 [email protected] wrote:
> > > > I have a few shell scripts that are basically things like:
> > > > #!/bin/bash
> > > > salloc -N1 -n1 -p pubint 1grabslurm 1
> > > >
> > > > At least under our ancient 1.3.x installation, salloc worked from 
a
> > > > script in exactly the same way as from the command line.
> > > >
> > > > Under 2.2.3, the salloc line works fine when typed directly, but 
when
> > > > invoked from a script I instead see:
> > > >
> > > > salloc: Granted job allocation 38
> > > > salloc: Relinquishing job allocation 38
> > > > salloc: Job allocation 38 has been revoked.
> > > > /root # srun: forcing job termination
> > > > srun: got SIGCONT
> > > > srun: error: Task launch for 38.0 failed on node hyraxD01: Job
> > > > credential revoked
> > > > srun: error: Application launch failed: Job credential revoked
> > > > srun: Job step aborted: Waiting up to 2 seconds for job step to 
finish.
> > > > srun: error: Timed out waiting for job step to complete
> > > > tcsetattr: Input/output error
> > > >
> > > > Thanks in advance for any advice,
> > > > Jeff Katcher
> > >

Reply via email to