Odd - I'm on travel this week but can look at it next week. One possibility - have you tried hitting us with SIGTSTOP instead of SIGSTOP? Difference in ability to trap and forward
Sent from my iPad > On Jul 11, 2017, at 9:29 AM, Eugene Dedits <[email protected]> wrote: > > > I’ve just tried 3.0.0rc1 and problems still persists there… > > Thanks, > E. > > > >> On Jul 11, 2017, at 10:20 AM, [email protected] wrote: >> >> >> Just checked the planning board and saw that my PR to bring that change to >> 2.1.2 is pending and not yet in the release branch. I’ll try to make that >> happen soon >> >> Sent from my iPad >> >>> On Jul 11, 2017, at 8:03 AM, "[email protected]" <[email protected]> wrote: >>> >>> >>> There is an mca param ess_base_forward_signals that controls which signals >>> to forward. However, I just looked at source code and see that it wasn't >>> backported. Sigh. >>> >>> You could try the 3.0.0 branch as it is in release candidate and should go >>> out within a week. I'd suggest just cloning that branch of the OMPI repo to >>> get the latest state. The fix is definitely there >>> >>> Sent from my iPad >>> >>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected]> wrote: >>>> >>>> >>>> Hi Ralph, >>>> >>>> >>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same >>>> problem… :-\ >>>> Could you point me to some discussion of this? >>>> >>>> Thanks, >>>> Eugene. >>>> >>>>> On Jul 11, 2017, at 6:17 AM, [email protected] wrote: >>>>> >>>>> >>>>> There is an issue with how the signal is forwarded. This has been fixed >>>>> in the latest OMPI release so you might want to upgrade >>>>> >>>>> Ralph >>>>> >>>>> Sent from my iPad >>>>> >>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants >>>>>> <[email protected]> wrote: >>>>>> >>>>>> >>>>>> Hello Eugene, >>>>>> >>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said >>>>>> you built OMPI with pmi support) instead of "mpirun". >>>>>> srun is build-in and I think the preferred way of running parallel >>>>>> processes. Maybe scontrol is able to suspend it this way. >>>>>> >>>>>> Regards, >>>>>> Dennis >>>>>> >>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits: >>>>>>> Hello SLURM-DEV >>>>>>> >>>>>>> >>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. >>>>>>> >>>>>>> My setup is: >>>>>>> 96-node cluster with IB, running rhel 6.8 >>>>>>> slurm 17.02.1 >>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler) >>>>>>> >>>>>>> >>>>>>> I am running some application (hpl in this particular case) using batch >>>>>>> script similar to: >>>>>>> ----------------------------- >>>>>>> #!/bin/bash >>>>>>> #SBATCH —partiotion=standard >>>>>>> #SBATCH -N 10 >>>>>>> #SBATCH —ntasks-per-node=16 >>>>>>> >>>>>>> mpirun -np 160 xhpl | tee LOG >>>>>>> ----------------------------- >>>>>>> >>>>>>> So I am running it on 160 cores, 2 nodes. >>>>>>> >>>>>>> Once job is submitted to the queue and is running I suspend it using >>>>>>> ~# scontrol suspend JOBID >>>>>>> >>>>>>> I see that indeed my job stopped producing output. I go to each of the >>>>>>> 10 >>>>>>> nodes that were assigned for my job and see if the xhpl processes are >>>>>>> running >>>>>>> there with : >>>>>>> >>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl >>>>>>> | wc -l”; done >>>>>>> >>>>>>> I expect this little script to return 0 from every node (because >>>>>>> suspend sent the >>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that >>>>>>> processes >>>>>>> are reliable suspended only on node10. I get: >>>>>>> 0 >>>>>>> 16 >>>>>>> 16 >>>>>>> … >>>>>>> 16 >>>>>>> >>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application >>>>>>> running at 100%. >>>>>>> >>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see that >>>>>>> (sometimes) more >>>>>>> nodes have “xhpl” processes properly suspended. Every time I resume and >>>>>>> suspend the >>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. >>>>>>> >>>>>>> So all together it looks like the suspend mechanism doesn’t properly >>>>>>> work in SLURM with >>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm >>>>>>> —with-pmi=/path/to/my/slurm”. >>>>>>> I’ve observed the same behavior. >>>>>>> >>>>>>> I would appreciate any help. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Eugene. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Dennis Tants >>>>>> Auszubildender: Fachinformatiker für Systemintegration >>>>>> >>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation >>>>>> ZARM - Center of Applied Space Technology and Microgravity >>>>>> >>>>>> Universität Bremen >>>>>> Am Fallturm >>>>>> 28359 Bremen, Germany >>>>>> >>>>>> Telefon: 0421 218 57940 >>>>>> E-Mail: [email protected] >>>>>> >>>>>> www.zarm.uni-bremen.de
