If you linked with SLURM's PMI and used srun to launch the tasks then
SLURM would directly signal each of the spawned tasks. Your current
environment relies upon mpiexec to signal the tasks that it spawns,
which are outside of SLURM's control.
Quoting Chen Shen <[email protected]>:
Hi Jette,
Thank you very much.
What MPI implementation/version are you using?
Many MPI implementations launch their tasks through SLURM, so this problem
should not exist in that case. More information about how various MPI
implementations work with SLURM is available here:
https://computing.llnl.gov/linux/slurm/mpi_guide.html
I was using MPICH2 1.2.1 in this case. We also use MVAPICH2, which is
similar.
I am trying to avoid linking to Slurm's PMI. Otherwise we will have to
distribute separate executable binaries for slurm and non-slurm versions.
On the other hand, I wonder how linking to Slurm's PMI would help in
Suspend/Resume.
Does Slurm send TSTP signal to the individual computation process launched
by mpich2?
I guess this is the only way to make sure the actual computation processes
are suspended?
What SLURM plugin are you using for process tracking?
Run "scontrol show config | grep Proctrack" to see.
$ scontrol show config | grep Proctrack
ProctrackType = proctrack/pgid
Regards,
Shen Chen
On Wed, Jul 6, 2011 at 3:32 AM, <[email protected]> wrote:
What MPI implementation/version are you using?
Many MPI implementations launch their tasks through SLURM, so this problem
should not exist in that case. More information about how various MPI
implementations work with SLURM is available here:
https://computing.llnl.gov/linux/slurm/mpi_guide.html
What SLURM plugin are you using for process tracking?
Run "scontrol show config | grep Proctrack" to see.
If your MPI implementation launches tasks outside of SLURM control, you
may
just need to increase the sleep time. I don't believe there will be a
general solution available for all configurations.
Quoting hash <[email protected]>:
Hi all,
In src/slurmd/slurmstepd/req.c, we learned that slurm sends SIGTSTP,
sleep(1), and sends SIGSTOP to suspend a job.
This is very important feature to us, as we have two partitions for
high/low priority jobs, and low priority jobs get suspended when
resources aren't enough.
However, the 1-sec sleep doesn't seem to be sufficient in some cases.
Our jobs are launched with MPICH2's mpiexec, e.g.
$ srun -c 8 mpiexec -n 8 /path/to/prog
The process IDs are:
mpiexec: 100
prog: 101-108
We issue the following command in terminal:
$ kill -SIGTSTP 100 && sleep 1 && kill -SIGSTOP 100
In half of the cases, the mpiexec process (100) is stopped, but the
underlying prog (101-108) are still running. Apparently, mpich2 hasn't
got enough time to handle the TSTP signal before STOP comes, which can
not be handled.
As a result, squeue reports that the low priority job is suspended,
and the high priority starts running, which overloads the workstation
with more processes than processors.
If we change the sleep time to 2 seconds, both mpiexec and prog
processes are correctly stopped, at least for my 10 consecutive
tests.
We could certainly changing the 1-second delay to a larger value in
req.c, but I'm not sure if it's going to work for larger jobs (more
memory, involving more nodes). I wonder if there can be a better
solution to the problem. Thank you!
Regards,
Shen Chen
Cogenda Pte Ltd
http://www.cogenda.com