Hello SLURM-DEV
I have a problem with slurm, openmpi, and “scontrol suspend”.
My setup is:
96-node cluster with IB, running rhel 6.8
slurm 17.02.1
openmpi 2.0.0 (built using Intel 2016 compiler)
I am running some application (hpl in this particular case) using batch script
similar to:
-----------------------------
#!/bin/bash
#SBATCH —partiotion=standard
#SBATCH -N 10
#SBATCH —ntasks-per-node=16
mpirun -np 160 xhpl | tee LOG
-----------------------------
So I am running it on 160 cores, 2 nodes.
Once job is submitted to the queue and is running I suspend it using
~# scontrol suspend JOBID
I see that indeed my job stopped producing output. I go to each of the 10
nodes that were assigned for my job and see if the xhpl processes are running
there with :
~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | wc
-l”; done
I expect this little script to return 0 from every node (because suspend sent
the
SIGSTOP and they shouldn’t show up in top). However I see that processes
are reliable suspended only on node10. I get:
0
16
16
…
16
So 9 out of 10 nodes still have 16 MPI threads of my xhpl application running
at 100%.
If I run “scontrol resume JOBID” and then suspend it again I see that
(sometimes) more
nodes have “xhpl” processes properly suspended. Every time I resume and suspend
the
job, I see different nodes returning 0 in my “ssh-run-top” script.
So all together it looks like the suspend mechanism doesn’t properly work in
SLURM with
OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm
—with-pmi=/path/to/my/slurm”.
I’ve observed the same behavior.
I would appreciate any help.
Thanks,
Eugene.