Odd - I'm on travel this week but can look at it next week. One possibility - 
have you tried hitting us with SIGTSTOP instead of SIGSTOP? Difference in 
ability to trap and forward

Sent from my iPad

> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <[email protected]> wrote:
> 
> 
> I’ve just tried 3.0.0rc1 and problems still persists there… 
> 
> Thanks,
> E. 
> 
> 
> 
>> On Jul 11, 2017, at 10:20 AM, [email protected] wrote:
>> 
>> 
>> Just checked the planning board and saw that my PR to bring that change to 
>> 2.1.2 is pending and not yet in the release branch. I’ll try to make that 
>> happen soon
>> 
>> Sent from my iPad
>> 
>>> On Jul 11, 2017, at 8:03 AM, "[email protected]" <[email protected]> wrote:
>>> 
>>> 
>>> There is an mca param ess_base_forward_signals that controls which signals 
>>> to forward. However, I just looked at source code and see that it wasn't 
>>> backported. Sigh.
>>> 
>>> You could try the 3.0.0 branch as it is in release candidate and should go 
>>> out within a week. I'd suggest just cloning that branch of the OMPI repo to 
>>> get the latest state. The fix is definitely there 
>>> 
>>> Sent from my iPad
>>> 
>>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected]> wrote:
>>>> 
>>>> 
>>>> Hi Ralph, 
>>>> 
>>>> 
>>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same 
>>>> problem… :-\
>>>> Could you point me to some discussion of this? 
>>>> 
>>>> Thanks,
>>>> Eugene. 
>>>> 
>>>>> On Jul 11, 2017, at 6:17 AM, [email protected] wrote:
>>>>> 
>>>>> 
>>>>> There is an issue with how the signal is forwarded. This has been fixed 
>>>>> in the latest OMPI release so you might want to upgrade 
>>>>> 
>>>>> Ralph
>>>>> 
>>>>> Sent from my iPad
>>>>> 
>>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> 
>>>>>> Hello Eugene,
>>>>>> 
>>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
>>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>>> srun is build-in and I think the preferred way of running parallel
>>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>>> 
>>>>>> Regards,
>>>>>> Dennis
>>>>>> 
>>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>>> Hello SLURM-DEV
>>>>>>> 
>>>>>>> 
>>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>>>> 
>>>>>>> My setup is:
>>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>>> slurm 17.02.1
>>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>>> 
>>>>>>> 
>>>>>>> I am running some application (hpl in this particular case) using batch 
>>>>>>> script similar to:
>>>>>>> -----------------------------
>>>>>>> #!/bin/bash
>>>>>>> #SBATCH —partiotion=standard
>>>>>>> #SBATCH -N 10
>>>>>>> #SBATCH —ntasks-per-node=16
>>>>>>> 
>>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>>> -----------------------------
>>>>>>> 
>>>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>>>> 
>>>>>>> Once job is submitted to the queue and is running I suspend it using
>>>>>>> ~# scontrol suspend JOBID
>>>>>>> 
>>>>>>> I see that indeed my job stopped producing output. I go to each of the 
>>>>>>> 10
>>>>>>> nodes that were assigned for my job and see if the xhpl processes are 
>>>>>>> running
>>>>>>> there with :
>>>>>>> 
>>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl 
>>>>>>> | wc -l”; done
>>>>>>> 
>>>>>>> I expect this little script to return 0 from every node (because 
>>>>>>> suspend sent the
>>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that 
>>>>>>> processes 
>>>>>>> are reliable suspended only on node10. I get:
>>>>>>> 0
>>>>>>> 16
>>>>>>> 16
>>>>>>> …
>>>>>>> 16
>>>>>>> 
>>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application 
>>>>>>> running at 100%. 
>>>>>>> 
>>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see that 
>>>>>>> (sometimes) more
>>>>>>> nodes have “xhpl” processes properly suspended. Every time I resume and 
>>>>>>> suspend the
>>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>>>>>>> 
>>>>>>> So all together it looks like the suspend mechanism doesn’t properly 
>>>>>>> work in SLURM with 
>>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>>>> I’ve observed the same behavior. 
>>>>>>> 
>>>>>>> I would appreciate any help.   
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Eugene. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Dennis Tants
>>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>>> 
>>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
>>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>>> 
>>>>>> Universität Bremen
>>>>>> Am Fallturm
>>>>>> 28359 Bremen, Germany
>>>>>> 
>>>>>> Telefon: 0421 218 57940
>>>>>> E-Mail: [email protected]
>>>>>> 
>>>>>> www.zarm.uni-bremen.de

Reply via email to