Gerrit,
This report came to me from a customer site. From what I gather, they
are running a lot of test jobs from scripts that use "salloc" with
background jobs of the form:
salloc -n 16 -N 1 mpirun -n 16 <some-mpi-job> &
using SLURM to generate an allocation and mpirun to run a job. As such,
I don't think they need to be the controlling terminal as they would if
they were launching a shell under an allocation and running jobs
interactively.
Perhaps this could have been done using "sbatch" instead of "salloc", but
the fact remains that this change in the latest update broke their testing
procedures for regression tests on the new release.
-Don Albert-
Gerrit Renker <[email protected]> wrote on 02/03/2011 11:15:30 PM:
> Hi Don,
>
> I submitted the patch and can give an account why it is necessary.
> We had lots of
> problems with salloc due to the absence of job control (meaning
> those jobs that
> were spawned by salloc as child processes).
>
> This is not the only change, it needs to be seen in the context of
> the others. The
> loop is used in order to gain control over the terminal. As long as
> salloc runs in
> the background, it is not in control of the terminal.
>
> This piece of code is comparable to running
> prompt> bash &
> [1] 3291
> prompt> jobs -l
> [1]+ 3291 Stopped (tty input) bash
>
> bash is doing the same thing - as long as it is not the foreground
> process in control
> of the terminal, it receives SIGTTIN to stop itself.
>
> Further below in salloc, once it is in the foreground, it makes
> itself the controlling
> process (tpgid), and then hands this over to the child.
>
> Why would you want to start salloc in the background if, once you
> use it, it needs to
> run in the foreground?
>
> Gerrit
>
> On Thu, 3 Feb 2011 11:27:50 -0700 you wrote:
> > There appears to have been a change in "salloc.c" sometime between
SLURM
> > 2.2.0-RC1 and the final release of SLURM 2.2.0, involving signal
handling
> > and whether "salloc" is running in the foreground or background. In
> > particular, the lines:
> >
> > is_interactive = isatty(STDIN_FILENO);
> > if (is_interactive) {
> > bool sent_msg = false;
> > /* Wait as long as we are running in the background */
> > while (tcgetpgrp(STDIN_FILENO) != (pid = getpgrp())) {
> > if (!sent_msg) {
> > error("Waiting for program to be
placed in
> > "
> > "the foreground");
> > sent_msg = true;
> > }
> > killpg(pid, SIGTTIN);
> > }
> >
> > /*
> > * Save tty attributes and reset at exit, in case a
child
> > * process died before properly resetting terminal.
> > */
> > tcgetattr (STDIN_FILENO, &saved_tty_attributes);
> > atexit (_reset_input_mode);
> > }
> >
> > were added right at the beginning of the "main" function in
"salloc.c".
> > There are no comments to indicate the rationale for this change. I
don't
> > recall any discussion of such a change in the "slurm-dev" list, but I
> > could have missed it.
> >
> > This change seems to prevent salloc from being launched as a
background
> > process. An salloc with a simple command like "sleep" gets:
> >
> > [stag] (dalbert) dja-slurm> salloc -n 2 sleep 10 &
> > [2] 30235
> > [stag] (dalbert) dja-slurm> salloc: error: Waiting for program to be
> > placed in the foreground
> >
> > and the job sits until you bring it to the foreground. Can someone
> > comment on the reason for this change?
> >
> > -Don Albert-