[slurm-dev] Re: slurm + openmpi + suspend problem

[email protected] Tue, 18 Jul 2017 10:07:16 -0700

Okay, I tracked it down and have a fix pending for OMPI master: 
https://github.com/open-mpi/ompi/pull/3930 
<https://github.com/open-mpi/ompi/pull/3930>


Once that cycles thru, I’ll create a PR for the 3.0 release. I’m not sure about 
taking it back to v2.x - I’ll have to check with those release managers.

> On Jul 18, 2017, at 7:33 AM, [email protected] wrote:
> 
> Just looking at it today...
> 
>> On Jul 18, 2017, at 7:25 AM, Eugene Dedits <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Ralph, 
>> 
>> 
>> did you have a chance to take a look at this problem? 
>> 
>> Thanks!
>> Eugene.
>> 
>> 
>> 
>> 
>> On Tue, Jul 11, 2017 at 12:51 PM, Eugene Dedits <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Thanks! I really appreciate your help.
>> In a meantime I’ve tried experimenting with 1.8.3. Here is what I’ve 
>> noticed. 
>> 
>> 1. Running the job with “sbatch ./my_script” where my script calls
>> mpirun -np 160 -mca orte_forward_job_control 1 ./xhpl
>> 
>> and then suspending the job with “scontrol suspend JOBID” 
>> does not work. Of 10 nodes assigned to my job 4 are still running 
>> 16 mpi threads of xhpl. 
>> 
>> 2. Running exactly the same job and then sending TSPT to mpirun process
>> does work: all 10 nodes show that xhpl processes are stopped. Resuming 
>> them with -CONT also works. 
>> 
>> Again, this is with OpenMPI 1.8.3
>> 
>> Once again, thank you for all the help. 
>> 
>> Cheers,
>> Eugene. 
>> 
>> 
>> 
>> 
>>> On Jul 11, 2017, at 12:08 PM, [email protected] <mailto:[email protected]> 
>>> wrote:
>>> 
>>> Very odd - let me explore when I get back. Sorry for delay 
>>> 
>>> Sent from my iPad
>>> 
>>> On Jul 11, 2017, at 10:59 AM, Eugene Dedits <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> Ralph, 
>>>> 
>>>> 
>>>> Are you suggesting doing something similar to this:
>>>> https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume 
>>>> <https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume>
>>>> 
>>>> If yes, here is what I’ve done:
>>>> - start a job using slurm and "mpirun -mca orte_forward_job_control 1 -np 
>>>> 160 xhpl”
>>>> - ssh to the node where mpirun is launched
>>>> - “kill -STOP PID” where PID is mpirun pid
>>>> - “kill -TSTP PID” 
>>>> 
>>>> In both cases (STOP and TSTP) I observer that there were 16 mpi processes 
>>>> running
>>>> at 100% on all 10 nodes where the job was started. 
>>>> 
>>>> Thanks,
>>>> Eugene. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 11, 2017, at 10:35 AM, [email protected] 
>>>>> <mailto:[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> Odd - I'm on travel this week but can look at it next week. One 
>>>>> possibility - have you tried hitting us with SIGTSTOP instead of SIGSTOP? 
>>>>> Difference in ability to trap and forward
>>>>> 
>>>>> Sent from my iPad
>>>>> 
>>>>>> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>> I’ve just tried 3.0.0rc1 and problems still persists there… 
>>>>>> 
>>>>>> Thanks,
>>>>>> E. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jul 11, 2017, at 10:20 AM, [email protected] 
>>>>>>> <mailto:[email protected]> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Just checked the planning board and saw that my PR to bring that change 
>>>>>>> to 2.1.2 is pending and not yet in the release branch. I’ll try to make 
>>>>>>> that happen soon
>>>>>>> 
>>>>>>> Sent from my iPad
>>>>>>> 
>>>>>>>> On Jul 11, 2017, at 8:03 AM, "[email protected] 
>>>>>>>> <mailto:[email protected]>" <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There is an mca param ess_base_forward_signals that controls which 
>>>>>>>> signals to forward. However, I just looked at source code and see that 
>>>>>>>> it wasn't backported. Sigh.
>>>>>>>> 
>>>>>>>> You could try the 3.0.0 branch as it is in release candidate and 
>>>>>>>> should go out within a week. I'd suggest just cloning that branch of 
>>>>>>>> the OMPI repo to get the latest state. The fix is definitely there 
>>>>>>>> 
>>>>>>>> Sent from my iPad
>>>>>>>> 
>>>>>>>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hi Ralph, 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same 
>>>>>>>>> problem… :-\
>>>>>>>>> Could you point me to some discussion of this? 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Eugene. 
>>>>>>>>> 
>>>>>>>>>> On Jul 11, 2017, at 6:17 AM, [email protected] 
>>>>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> There is an issue with how the signal is forwarded. This has been 
>>>>>>>>>> fixed in the latest OMPI release so you might want to upgrade 
>>>>>>>>>> 
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPad
>>>>>>>>>> 
>>>>>>>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>>>>>>>>> <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hello Eugene,
>>>>>>>>>>> 
>>>>>>>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you 
>>>>>>>>>>> said
>>>>>>>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>>>>>>>> srun is build-in and I think the preferred way of running parallel
>>>>>>>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dennis
>>>>>>>>>>> 
>>>>>>>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>>>>>>>> Hello SLURM-DEV
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>>>>>>>>> 
>>>>>>>>>>>> My setup is:
>>>>>>>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>>>>>>>> slurm 17.02.1
>>>>>>>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I am running some application (hpl in this particular case) using 
>>>>>>>>>>>> batch script similar to:
>>>>>>>>>>>> -----------------------------
>>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>>> #SBATCH —partiotion=standard
>>>>>>>>>>>> #SBATCH -N 10
>>>>>>>>>>>> #SBATCH —ntasks-per-node=16
>>>>>>>>>>>> 
>>>>>>>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>>>>>>>> -----------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Once job is submitted to the queue and is running I suspend it 
>>>>>>>>>>>> using
>>>>>>>>>>>> ~# scontrol suspend JOBID
>>>>>>>>>>>> 
>>>>>>>>>>>> I see that indeed my job stopped producing output. I go to each of 
>>>>>>>>>>>> the 10
>>>>>>>>>>>> nodes that were assigned for my job and see if the xhpl processes 
>>>>>>>>>>>> are running
>>>>>>>>>>>> there with :
>>>>>>>>>>>> 
>>>>>>>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep 
>>>>>>>>>>>> xhpl | wc -l”; done
>>>>>>>>>>>> 
>>>>>>>>>>>> I expect this little script to return 0 from every node (because 
>>>>>>>>>>>> suspend sent the
>>>>>>>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that 
>>>>>>>>>>>> processes 
>>>>>>>>>>>> are reliable suspended only on node10. I get:
>>>>>>>>>>>> 0
>>>>>>>>>>>> 16
>>>>>>>>>>>> 16
>>>>>>>>>>>> …
>>>>>>>>>>>> 16
>>>>>>>>>>>> 
>>>>>>>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl 
>>>>>>>>>>>> application running at 100%. 
>>>>>>>>>>>> 
>>>>>>>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see 
>>>>>>>>>>>> that (sometimes) more
>>>>>>>>>>>> nodes have “xhpl” processes properly suspended. Every time I 
>>>>>>>>>>>> resume and suspend the
>>>>>>>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>>>>>>>>>>>> 
>>>>>>>>>>>> So all together it looks like the suspend mechanism doesn’t 
>>>>>>>>>>>> properly work in SLURM with 
>>>>>>>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>>>>>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>>>>>>>>> I’ve observed the same behavior. 
>>>>>>>>>>>> 
>>>>>>>>>>>> I would appreciate any help.   
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Eugene. 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Dennis Tants
>>>>>>>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>>>>>>>> 
>>>>>>>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und 
>>>>>>>>>>> Mikrogravitation
>>>>>>>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>>>>>>>> 
>>>>>>>>>>> Universität Bremen
>>>>>>>>>>> Am Fallturm
>>>>>>>>>>> 28359 Bremen, Germany
>>>>>>>>>>> 
>>>>>>>>>>> Telefon: 0421 218 57940
>>>>>>>>>>> E-Mail: [email protected] <mailto:[email protected]>
>>>>>>>>>>> 
>>>>>>>>>>> www.zarm.uni-bremen.de <http://www.zarm.uni-bremen.de/>
>>>> 
>> 
>> 
>

[slurm-dev] Re: slurm + openmpi + suspend problem

Reply via email to