Hi Ralph,
thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same problem… :-\ Could you point me to some discussion of this? Thanks, Eugene. > On Jul 11, 2017, at 6:17 AM, [email protected] wrote: > > > There is an issue with how the signal is forwarded. This has been fixed in > the latest OMPI release so you might want to upgrade > > Ralph > > Sent from my iPad > >> On Jul 11, 2017, at 2:53 AM, Dennis Tants <[email protected]> >> wrote: >> >> >> Hello Eugene, >> >> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said >> you built OMPI with pmi support) instead of "mpirun". >> srun is build-in and I think the preferred way of running parallel >> processes. Maybe scontrol is able to suspend it this way. >> >> Regards, >> Dennis >> >>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits: >>> Hello SLURM-DEV >>> >>> >>> I have a problem with slurm, openmpi, and “scontrol suspend”. >>> >>> My setup is: >>> 96-node cluster with IB, running rhel 6.8 >>> slurm 17.02.1 >>> openmpi 2.0.0 (built using Intel 2016 compiler) >>> >>> >>> I am running some application (hpl in this particular case) using batch >>> script similar to: >>> ----------------------------- >>> #!/bin/bash >>> #SBATCH —partiotion=standard >>> #SBATCH -N 10 >>> #SBATCH —ntasks-per-node=16 >>> >>> mpirun -np 160 xhpl | tee LOG >>> ----------------------------- >>> >>> So I am running it on 160 cores, 2 nodes. >>> >>> Once job is submitted to the queue and is running I suspend it using >>> ~# scontrol suspend JOBID >>> >>> I see that indeed my job stopped producing output. I go to each of the 10 >>> nodes that were assigned for my job and see if the xhpl processes are >>> running >>> there with : >>> >>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | >>> wc -l”; done >>> >>> I expect this little script to return 0 from every node (because suspend >>> sent the >>> SIGSTOP and they shouldn’t show up in top). However I see that processes >>> are reliable suspended only on node10. I get: >>> 0 >>> 16 >>> 16 >>> … >>> 16 >>> >>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application >>> running at 100%. >>> >>> If I run “scontrol resume JOBID” and then suspend it again I see that >>> (sometimes) more >>> nodes have “xhpl” processes properly suspended. Every time I resume and >>> suspend the >>> job, I see different nodes returning 0 in my “ssh-run-top” script. >>> >>> So all together it looks like the suspend mechanism doesn’t properly work >>> in SLURM with >>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm >>> —with-pmi=/path/to/my/slurm”. >>> I’ve observed the same behavior. >>> >>> I would appreciate any help. >>> >>> >>> Thanks, >>> Eugene. >>> >>> >>> >>> >> >> -- >> Dennis Tants >> Auszubildender: Fachinformatiker für Systemintegration >> >> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation >> ZARM - Center of Applied Space Technology and Microgravity >> >> Universität Bremen >> Am Fallturm >> 28359 Bremen, Germany >> >> Telefon: 0421 218 57940 >> E-Mail: [email protected] >> >> www.zarm.uni-bremen.de
