[slurm-dev] Re: slurm + openmpi + suspend problem

[email protected] Tue, 11 Jul 2017 07:20:42 -0700

Just checked the planning board and saw that my PR to bring that change to 
2.1.2 is pending and not yet in the release branch. I’ll try to make that 
happen soon


Sent from my iPad

> On Jul 11, 2017, at 8:03 AM, "[email protected]" <[email protected]> wrote:
> 
> 
> There is an mca param ess_base_forward_signals that controls which signals to 
> forward. However, I just looked at source code and see that it wasn't 
> backported. Sigh.
> 
> You could try the 3.0.0 branch as it is in release candidate and should go 
> out within a week. I'd suggest just cloning that branch of the OMPI repo to 
> get the latest state. The fix is definitely there 
> 
> Sent from my iPad
> 
>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected]> wrote:
>> 
>> 
>> Hi Ralph, 
>> 
>> 
>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same problem… 
>> :-\
>> Could you point me to some discussion of this? 
>> 
>> Thanks,
>> Eugene. 
>> 
>>> On Jul 11, 2017, at 6:17 AM, [email protected] wrote:
>>> 
>>> 
>>> There is an issue with how the signal is forwarded. This has been fixed in 
>>> the latest OMPI release so you might want to upgrade 
>>> 
>>> Ralph
>>> 
>>> Sent from my iPad
>>> 
>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>> <[email protected]> wrote:
>>>> 
>>>> 
>>>> Hello Eugene,
>>>> 
>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
>>>> you built OMPI with pmi support) instead of "mpirun".
>>>> srun is build-in and I think the preferred way of running parallel
>>>> processes. Maybe scontrol is able to suspend it this way.
>>>> 
>>>> Regards,
>>>> Dennis
>>>> 
>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>> Hello SLURM-DEV
>>>>> 
>>>>> 
>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>> 
>>>>> My setup is:
>>>>> 96-node cluster with IB, running rhel 6.8
>>>>> slurm 17.02.1
>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>> 
>>>>> 
>>>>> I am running some application (hpl in this particular case) using batch 
>>>>> script similar to:
>>>>> -----------------------------
>>>>> #!/bin/bash
>>>>> #SBATCH —partiotion=standard
>>>>> #SBATCH -N 10
>>>>> #SBATCH —ntasks-per-node=16
>>>>> 
>>>>> mpirun -np 160 xhpl | tee LOG
>>>>> -----------------------------
>>>>> 
>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>> 
>>>>> Once job is submitted to the queue and is running I suspend it using
>>>>> ~# scontrol suspend JOBID
>>>>> 
>>>>> I see that indeed my job stopped producing output. I go to each of the 10
>>>>> nodes that were assigned for my job and see if the xhpl processes are 
>>>>> running
>>>>> there with :
>>>>> 
>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | 
>>>>> wc -l”; done
>>>>> 
>>>>> I expect this little script to return 0 from every node (because suspend 
>>>>> sent the
>>>>> SIGSTOP and they shouldn’t show up in top). However I see that processes 
>>>>> are reliable suspended only on node10. I get:
>>>>> 0
>>>>> 16
>>>>> 16
>>>>> …
>>>>> 16
>>>>> 
>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application 
>>>>> running at 100%. 
>>>>> 
>>>>> If I run “scontrol resume JOBID” and then suspend it again I see that 
>>>>> (sometimes) more
>>>>> nodes have “xhpl” processes properly suspended. Every time I resume and 
>>>>> suspend the
>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>>>>> 
>>>>> So all together it looks like the suspend mechanism doesn’t properly work 
>>>>> in SLURM with 
>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>> I’ve observed the same behavior. 
>>>>> 
>>>>> I would appreciate any help.   
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Eugene. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> Dennis Tants
>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>> 
>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>> 
>>>> Universität Bremen
>>>> Am Fallturm
>>>> 28359 Bremen, Germany
>>>> 
>>>> Telefon: 0421 218 57940
>>>> E-Mail: [email protected]
>>>> 
>>>> www.zarm.uni-bremen.de

[slurm-dev] Re: slurm + openmpi + suspend problem

Reply via email to