Don,

Thanks for the detailed analysis and patch. It will be in version 2.3.0-rc2 which should be available later today.

Moe

Quoting [email protected]:

Yet another subtle problem has come to light with the changes that were
made to "salloc" some time ago that attempted to add job control features.
 It seems that when "salloc" is executed in a script, and completes, it
leaves the terminal foreground process group set to its own process group
value. If the "salloc" is not in a script, or if the script immediately
terminates after the "salloc" completes, then the original shell seems to
take care of this, and resets the foreground process group to itself.

However, if there are more commands in the script after the "salloc", then
the script continues to execute, but because the terminal foreground
process group is set to a now non-existent process group, the session is
effectively running in the "background" with no foreground group
designated.  If the commands remaining in the script are simple commands
such as "pdsh" or "ps", then they work, but if a command is issued such as
invoking a new shell (e.g., /bin/sh) or a command interpreter such as
python (e.g., /usr/bin/python), which require interactive input, then the
new command is "stopped", because it has no controlling terminal.

The following simple script illustrates the problem.  The script would
normally issue some command like "mpirun" within the salloc, but for
diagnostic purposes I have just executed "ps j" to display the various
process group values.

    #!/bin/bash
    ps j
    salloc -N1 ps j
    ps j
    /bin/sh
    ps j

Running the script results in the following output:

    [sulu] (dalbert) 315432> ./sc10.sh
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    26088 26089 26089 26089 pts/1    27036 Ss     605   0:00 -bash
    26089 27036 27036 26089 pts/1    27036 S+     605   0:00 /bin/bash
./sc10.sh
    27036 27037 27036 26089 pts/1    27036 R+     605   0:00 ps j
    salloc: Granted job allocation 289
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    26088 26089 26089 26089 pts/1    27041 Ss     605   0:00 -bash
    26089 27036 27036 26089 pts/1    27041 S      605   0:00 /bin/bash
./sc10.sh
    27036 27038 27038 26089 pts/1    27041 Sl     605   0:00 salloc -N1 ps
j
    27038 27041 27041 26089 pts/1    27041 R+     605   0:00 ps j
    salloc: Relinquishing job allocation 289
    salloc: Job allocation 289 has been revoked.
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    26088 26089 26089 26089 pts/1    27038 Ss     605   0:00 -bash
    26089 27036 27036 26089 pts/1    27038 S      605   0:00 /bin/bash
./sc10.sh
    27036 27048 27036 26089 pts/1    27038 R      605   0:00 ps j

    [1]+  Stopped                 ./sc10.sh
    [sulu] (dalbert) 315432> ps j
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    26088 26089 26089 26089 pts/1    27140 Ss     605   0:00 -bash
    26089 27036 27036 26089 pts/1    27140 T      605   0:00 /bin/bash
./sc10.sh
    27036 27051 27036 26089 pts/1    27140 T      605   0:00 /bin/sh
    26089 27140 27140 26089 pts/1    27140 R+     605   0:00 ps j

The "ps j" immediately after the termination of "salloc" shows that the
"salloc" has left the "TPGID" value set to its own process group (27038).
Since this process group no longer exists, this effectively makes the
shell and its script a background process.  When the script attempts to
start another shell (/bin/sh), which is normally a perfectly legitimate
thing to do, it ends up being stopped, along with the new shell process.

I have made a change to "salloc" to cause it to reset the terminal
foreground process group to the process group of the "parent" of the
"salloc" pid just prior to exiting the process.  This appears to correct
the problem and allow subsequent commands within the script to run.

Here is the patch, against 2.3.0-rc1:

[stag] (dalbert) salloc> cvs diff -u -r 1.1.1.49.2.1 -r 1.1.1.49.2.2
salloc.c
Index: salloc.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/salloc/salloc.c,v
retrieving revision 1.1.1.49.2.1
retrieving revision 1.1.1.49.2.2
diff -u -r1.1.1.49.2.1 -r1.1.1.49.2.2
--- salloc.c    29 Jul 2011 17:23:32 -0000      1.1.1.49.2.1
+++ salloc.c    22 Aug 2011 22:14:46 -0000      1.1.1.49.2.2
@@ -147,6 +147,10 @@
        int sig_block[] = { SIGTTOU, SIGTTIN, 0 };
        xsignal_block (sig_block);
        tcsetattr (STDIN_FILENO, TCSANOW, &saved_tty_attributes);
+       /* If salloc was run as interactive, with job control, reset the
foreground process
+          group of the terminal to the process group of the parent pid
before exiting  */
+       if (is_interactive)
+         tcsetpgrp(STDIN_FILENO, getpgid(getppid()));
 }

 int main(int argc, char *argv[])



Reply via email to