There seems to be another anomaly with the changes to salloc in SLURM
version 2.2.0 and 2.2.1. One of our testers submitted a bug where a
script with multiple salloc commands will only execute the first one, the
second one gets the "salloc: error: Waiting for program to be placed in
the foreground" message and is left in the background, even though the
salloc command lines do not contain an "&". Bringing it to the foreground
with "fg" allows the salloc to continue.
Here is the basic script:
[stag] (dalbert) test> cat /tmp/salloc
#!/bin/bash
salloc -N1 hostname
echo after first salloc
sleep 2
salloc -N1 hostname
echo after second salloc
Running this script results in:
[sulu] (slurm) slurm> /tmp/salloc
salloc: Granted job allocation 29073
sulu
salloc: Relinquishing job allocation 29073
salloc: Job allocation 29073 has been revoked.
after first salloc
salloc: error: Waiting for program to be placed in the foreground
[1]+ Stopped /tmp/salloc
Entering "fg" at the terminal allows it to proceed:
[sulu] (slurm) slurm> fg
/tmp/salloc
salloc: Granted job allocation 29074
sulu
salloc: Relinquishing job allocation 29074
salloc: Job allocation 29074 has been revoked.
after second salloc
To try to get more information, I changed the command submitted under the
salloc to "ps j", and got the following results.
Here is the script:
[sulu] (slurm) slurm> cat >/tmp/salloc2
#!/bin/bash
salloc -N1 ps j
echo after first alloc
sleep 10
salloc -N1 ps j
echo after second alloc
Here is the run:
[sulu] (slurm) slurm> /tmp/salloc2
salloc: Granted job allocation 29068
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
27155 27339 27339 26711 pts/11 31573 S 200 0:00 -bash
27339 31565 31565 26711 pts/11 31573 S 200 0:00 /bin/bash
/tmp/salloc2
31565 31567 31567 26711 pts/11 31573 Sl 200 0:00 salloc -N1 ps
j
31567 31573 31573 26711 pts/11 31573 R+ 200 0:00 ps j
salloc: Relinquishing job allocation 29068
salloc: Job allocation 29068 has been revoked.
after first alloc
salloc: error: Waiting for program to be placed in the foreground
[1]+ Stopped /tmp/salloc2
When the script went to the background, I entered another "ps j" at the
terminal:
[sulu] (slurm) slurm> ps j
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
27155 27339 27339 26711 pts/11 32504 S 200 0:00 -bash
27339 31565 31565 26711 pts/11 32504 T 200 0:00 /bin/bash
/tmp/salloc2
31565 32081 31565 26711 pts/11 32504 T 200 0:00 salloc -N1 ps
j
27339 32504 32504 26711 pts/11 32504 R+ 200 0:00 ps j
Now bringing the script back to foreground:
[sulu] (slurm) slurm> fg
/tmp/salloc2
salloc: Granted job allocation 29069
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
32081 18159 18159 26711 pts/11 18159 R+ 200 0:00 ps j
27155 27339 27339 26711 pts/11 18159 S 200 0:00 -bash
27339 31565 31565 26711 pts/11 18159 S 200 0:00 /bin/bash
/tmp/salloc2
31565 32081 32081 26711 pts/11 18159 Sl 200 0:00 salloc -N1 ps
j
salloc: Relinquishing job allocation 29069
salloc: Job allocation 29069 has been revoked.
after second alloc
I haven't figured out just what is going on yet, but note that on the
second instance of the "salloc" in the script, the PID and PGID don't
match (as they do for the first instance) while it is sitting in the
background, but once the job is brought to foreground, the PGID is
changed.
-Don Albert-