[slurm-dev] Re: slurm + openmpi + suspend problem

r...@open-mpi.org Wed, 19 Jul 2017 11:27:07 -0700
Good to hear! I think it’ll be out pretty soon (i.e., next few weeks)

> On Jul 19, 2017, at 11:22 AM, Eugene Dedits <eugene.ded...@gmail.com> wrote:
> 
> Ralph, 
> 
> it seems to work now. Thanks a bunch! 
> When do you think we should expect 3.0.0 release? 
> 
> Best,
> Eugene. 
> 
> 
> 
> 
> On Tue, Jul 18, 2017 at 1:07 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:
> Okay, I tracked it down and have a fix pending for OMPI master: 
> https://github.com/open-mpi/ompi/pull/3930 
> <https://github.com/open-mpi/ompi/pull/3930>
> 
> Once that cycles thru, I’ll create a PR for the 3.0 release. I’m not sure 
> about taking it back to v2.x - I’ll have to check with those release managers.
> 
>> On Jul 18, 2017, at 7:33 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>> wrote:
>> 
>> Just looking at it today...
>> 
>>> On Jul 18, 2017, at 7:25 AM, Eugene Dedits <eugene.ded...@gmail.com 
>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>> 
>>> Hi Ralph, 
>>> 
>>> 
>>> did you have a chance to take a look at this problem? 
>>> 
>>> Thanks!
>>> Eugene.
>>> 
>>> 
>>> 
>>> 
>>> On Tue, Jul 11, 2017 at 12:51 PM, Eugene Dedits <eugene.ded...@gmail.com 
>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>> Thanks! I really appreciate your help.
>>> In a meantime I’ve tried experimenting with 1.8.3. Here is what I’ve 
>>> noticed. 
>>> 
>>> 1. Running the job with “sbatch ./my_script” where my script calls
>>> mpirun -np 160 -mca orte_forward_job_control 1 ./xhpl
>>> 
>>> and then suspending the job with “scontrol suspend JOBID” 
>>> does not work. Of 10 nodes assigned to my job 4 are still running 
>>> 16 mpi threads of xhpl. 
>>> 
>>> 2. Running exactly the same job and then sending TSPT to mpirun process
>>> does work: all 10 nodes show that xhpl processes are stopped. Resuming 
>>> them with -CONT also works. 
>>> 
>>> Again, this is with OpenMPI 1.8.3
>>> 
>>> Once again, thank you for all the help. 
>>> 
>>> Cheers,
>>> Eugene. 
>>> 
>>> 
>>> 
>>> 
>>>> On Jul 11, 2017, at 12:08 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>>> wrote:
>>>> 
>>>> Very odd - let me explore when I get back. Sorry for delay 
>>>> 
>>>> Sent from my iPad
>>>> 
>>>> On Jul 11, 2017, at 10:59 AM, Eugene Dedits <eugene.ded...@gmail.com 
>>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>>> 
>>>>> Ralph, 
>>>>> 
>>>>> 
>>>>> Are you suggesting doing something similar to this:
>>>>> https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume 
>>>>> <https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume>
>>>>> 
>>>>> If yes, here is what I’ve done:
>>>>> - start a job using slurm and "mpirun -mca orte_forward_job_control 1 -np 
>>>>> 160 xhpl”
>>>>> - ssh to the node where mpirun is launched
>>>>> - “kill -STOP PID” where PID is mpirun pid
>>>>> - “kill -TSTP PID” 
>>>>> 
>>>>> In both cases (STOP and TSTP) I observer that there were 16 mpi processes 
>>>>> running
>>>>> at 100% on all 10 nodes where the job was started. 
>>>>> 
>>>>> Thanks,
>>>>> Eugene. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jul 11, 2017, at 10:35 AM, r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> Odd - I'm on travel this week but can look at it next week. One 
>>>>>> possibility - have you tried hitting us with SIGTSTOP instead of 
>>>>>> SIGSTOP? Difference in ability to trap and forward
>>>>>> 
>>>>>> Sent from my iPad
>>>>>> 
>>>>>>> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <eugene.ded...@gmail.com 
>>>>>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> I’ve just tried 3.0.0rc1 and problems still persists there… 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> E. 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jul 11, 2017, at 10:20 AM, r...@open-mpi.org 
>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Just checked the planning board and saw that my PR to bring that 
>>>>>>>> change to 2.1.2 is pending and not yet in the release branch. I’ll try 
>>>>>>>> to make that happen soon
>>>>>>>> 
>>>>>>>> Sent from my iPad
>>>>>>>> 
>>>>>>>>> On Jul 11, 2017, at 8:03 AM, "r...@open-mpi.org 
>>>>>>>>> <mailto:r...@open-mpi.org>" <r...@open-mpi.org 
>>>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> There is an mca param ess_base_forward_signals that controls which 
>>>>>>>>> signals to forward. However, I just looked at source code and see 
>>>>>>>>> that it wasn't backported. Sigh.
>>>>>>>>> 
>>>>>>>>> You could try the 3.0.0 branch as it is in release candidate and 
>>>>>>>>> should go out within a week. I'd suggest just cloning that branch of 
>>>>>>>>> the OMPI repo to get the latest state. The fix is definitely there 
>>>>>>>>> 
>>>>>>>>> Sent from my iPad
>>>>>>>>> 
>>>>>>>>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <eugene.ded...@gmail.com 
>>>>>>>>>> <mailto:eugene.ded...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Ralph, 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same 
>>>>>>>>>> problem… :-\
>>>>>>>>>> Could you point me to some discussion of this? 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Eugene. 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 11, 2017, at 6:17 AM, r...@open-mpi.org 
>>>>>>>>>>> <mailto:r...@open-mpi.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> There is an issue with how the signal is forwarded. This has been 
>>>>>>>>>>> fixed in the latest OMPI release so you might want to upgrade 
>>>>>>>>>>> 
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>> Sent from my iPad
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>>>>>>>>>> <dennis.ta...@zarm.uni-bremen.de 
>>>>>>>>>>>> <mailto:dennis.ta...@zarm.uni-bremen.de>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello Eugene,
>>>>>>>>>>>> 
>>>>>>>>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you 
>>>>>>>>>>>> said
>>>>>>>>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>>>>>>>>> srun is build-in and I think the preferred way of running parallel
>>>>>>>>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Dennis
>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>>>>>>>>> Hello SLURM-DEV
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My setup is:
>>>>>>>>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>>>>>>>>> slurm 17.02.1
>>>>>>>>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am running some application (hpl in this particular case) using 
>>>>>>>>>>>>> batch script similar to:
>>>>>>>>>>>>> -----------------------------
>>>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>>>> #SBATCH —partiotion=standard
>>>>>>>>>>>>> #SBATCH -N 10
>>>>>>>>>>>>> #SBATCH —ntasks-per-node=16
>>>>>>>>>>>>> 
>>>>>>>>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>>>>>>>>> -----------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Once job is submitted to the queue and is running I suspend it 
>>>>>>>>>>>>> using
>>>>>>>>>>>>> ~# scontrol suspend JOBID
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I see that indeed my job stopped producing output. I go to each 
>>>>>>>>>>>>> of the 10
>>>>>>>>>>>>> nodes that were assigned for my job and see if the xhpl processes 
>>>>>>>>>>>>> are running
>>>>>>>>>>>>> there with :
>>>>>>>>>>>>> 
>>>>>>>>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | 
>>>>>>>>>>>>> grep xhpl | wc -l”; done
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I expect this little script to return 0 from every node (because 
>>>>>>>>>>>>> suspend sent the
>>>>>>>>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that 
>>>>>>>>>>>>> processes 
>>>>>>>>>>>>> are reliable suspended only on node10. I get:
>>>>>>>>>>>>> 0
>>>>>>>>>>>>> 16
>>>>>>>>>>>>> 16
>>>>>>>>>>>>> …
>>>>>>>>>>>>> 16
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl 
>>>>>>>>>>>>> application running at 100%. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see 
>>>>>>>>>>>>> that (sometimes) more
>>>>>>>>>>>>> nodes have “xhpl” processes properly suspended. Every time I 
>>>>>>>>>>>>> resume and suspend the
>>>>>>>>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” 
>>>>>>>>>>>>> script. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So all together it looks like the suspend mechanism doesn’t 
>>>>>>>>>>>>> properly work in SLURM with 
>>>>>>>>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>>>>>>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>>>>>>>>>> I’ve observed the same behavior. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would appreciate any help.   
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Eugene. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Dennis Tants
>>>>>>>>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>>>>>>>>> 
>>>>>>>>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und 
>>>>>>>>>>>> Mikrogravitation
>>>>>>>>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>>>>>>>>> 
>>>>>>>>>>>> Universität Bremen
>>>>>>>>>>>> Am Fallturm
>>>>>>>>>>>> 28359 Bremen, Germany
>>>>>>>>>>>> 
>>>>>>>>>>>> Telefon: 0421 218 57940
>>>>>>>>>>>> E-Mail: ta...@zarm.uni-bremen.de <mailto:ta...@zarm.uni-bremen.de>
>>>>>>>>>>>> 
>>>>>>>>>>>> www.zarm.uni-bremen.de <http://www.zarm.uni-bremen.de/>
>>>>> 
>>> 
>>> 
>> 
> 
>
[slurm-dev] Re: slurm + openmpi + suspend problem

Reply via email to