There seems to be another anomaly with the changes to salloc in SLURM 
version 2.2.0 and 2.2.1.  One of our testers submitted a bug where a 
script with  multiple salloc commands will only execute the first one, the 
second one gets the "salloc: error: Waiting for program to be placed in 
the foreground" message  and is left in the background, even though the 
salloc command lines do not contain an "&". Bringing it to the foreground 
with "fg" allows the salloc to continue.

Here is the basic script:

    [stag] (dalbert) test> cat /tmp/salloc
    #!/bin/bash
    salloc -N1 hostname
    echo after first salloc
    sleep 2
    salloc -N1 hostname
    echo after second salloc

Running this script results in:

    [sulu] (slurm) slurm> /tmp/salloc
    salloc: Granted job allocation 29073
    sulu
    salloc: Relinquishing job allocation 29073
    salloc: Job allocation 29073 has been revoked.
    after first salloc
    salloc: error: Waiting for program to be placed in the foreground

    [1]+  Stopped                 /tmp/salloc

Entering "fg" at the terminal allows it to proceed:

    [sulu] (slurm) slurm> fg
    /tmp/salloc
    salloc: Granted job allocation 29074
    sulu
    salloc: Relinquishing job allocation 29074
    salloc: Job allocation 29074 has been revoked.
    after second salloc

To try to get more information, I changed the command submitted under the 
salloc to "ps j", and got the following results.

Here is the script:

    [sulu] (slurm) slurm> cat >/tmp/salloc2
    #!/bin/bash
    salloc -N1 ps j
    echo after first alloc
    sleep 10
    salloc -N1 ps j
    echo after second alloc

Here is the run:

    [sulu] (slurm) slurm> /tmp/salloc2
    salloc: Granted job allocation 29068
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    27155 27339 27339 26711 pts/11   31573 S      200   0:00 -bash
    27339 31565 31565 26711 pts/11   31573 S      200   0:00 /bin/bash 
/tmp/salloc2
    31565 31567 31567 26711 pts/11   31573 Sl     200   0:00 salloc -N1 ps 
j
    31567 31573 31573 26711 pts/11   31573 R+     200   0:00 ps j
    salloc: Relinquishing job allocation 29068
    salloc: Job allocation 29068 has been revoked.
    after first alloc
    salloc: error: Waiting for program to be placed in the foreground

    [1]+  Stopped                 /tmp/salloc2

When the script went to the background, I entered another "ps j" at the 
terminal:

    [sulu] (slurm) slurm> ps j
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    27155 27339 27339 26711 pts/11   32504 S      200   0:00 -bash
    27339 31565 31565 26711 pts/11   32504 T      200   0:00 /bin/bash 
/tmp/salloc2
    31565 32081 31565 26711 pts/11   32504 T      200   0:00 salloc -N1 ps 
j
    27339 32504 32504 26711 pts/11   32504 R+     200   0:00 ps j

Now bringing the script back to foreground:

    [sulu] (slurm) slurm> fg
    /tmp/salloc2
    salloc: Granted job allocation 29069
     PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
    32081 18159 18159 26711 pts/11   18159 R+     200   0:00 ps j
    27155 27339 27339 26711 pts/11   18159 S      200   0:00 -bash
    27339 31565 31565 26711 pts/11   18159 S      200   0:00 /bin/bash 
/tmp/salloc2
    31565 32081 32081 26711 pts/11   18159 Sl     200   0:00 salloc -N1 ps 
j
    salloc: Relinquishing job allocation 29069
    salloc: Job allocation 29069 has been revoked.
    after second alloc

I haven't figured out just what is going on yet, but note that on the 
second instance of the "salloc" in the script, the PID and PGID don't 
match (as they do for the  first instance) while it is sitting in the 
background, but once the job is brought to foreground, the PGID is 
changed.

  -Don Albert-

Reply via email to