[slurm-dev] Re: slurm + openmpi + suspend problem

Eugene Dedits Tue, 11 Jul 2017 09:52:55 -0700

Thanks! I really appreciate your help.
In a meantime I’ve tried experimenting with 1.8.3. Here is what I’ve noticed.


1. Running the job with “sbatch ./my_script” where my script calls
mpirun -np 160 -mca orte_forward_job_control 1 ./xhpl

and then suspending the job with “scontrol suspend JOBID” 
does not work. Of 10 nodes assigned to my job 4 are still running 
16 mpi threads of xhpl. 

2. Running exactly the same job and then sending TSPT to mpirun process
does work: all 10 nodes show that xhpl processes are stopped. Resuming 
them with -CONT also works. 

Again, this is with OpenMPI 1.8.3

Once again, thank you for all the help. 

Cheers,
Eugene. 




> On Jul 11, 2017, at 12:08 PM, [email protected] wrote:
> 
> Very odd - let me explore when I get back. Sorry for delay 
> 
> Sent from my iPad
> 
> On Jul 11, 2017, at 10:59 AM, Eugene Dedits <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Ralph, 
>> 
>> 
>> Are you suggesting doing something similar to this:
>> https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume 
>> <https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume>
>> 
>> If yes, here is what I’ve done:
>> - start a job using slurm and "mpirun -mca orte_forward_job_control 1 -np 
>> 160 xhpl”
>> - ssh to the node where mpirun is launched
>> - “kill -STOP PID” where PID is mpirun pid
>> - “kill -TSTP PID” 
>> 
>> In both cases (STOP and TSTP) I observer that there were 16 mpi processes 
>> running
>> at 100% on all 10 nodes where the job was started. 
>> 
>> Thanks,
>> Eugene. 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Jul 11, 2017, at 10:35 AM, [email protected] <mailto:[email protected]> 
>>> wrote:
>>> 
>>> 
>>> Odd - I'm on travel this week but can look at it next week. One possibility 
>>> - have you tried hitting us with SIGTSTOP instead of SIGSTOP? Difference in 
>>> ability to trap and forward
>>> 
>>> Sent from my iPad
>>> 
>>>> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>> I’ve just tried 3.0.0rc1 and problems still persists there… 
>>>> 
>>>> Thanks,
>>>> E. 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 11, 2017, at 10:20 AM, [email protected] 
>>>>> <mailto:[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> Just checked the planning board and saw that my PR to bring that change 
>>>>> to 2.1.2 is pending and not yet in the release branch. I’ll try to make 
>>>>> that happen soon
>>>>> 
>>>>> Sent from my iPad
>>>>> 
>>>>>> On Jul 11, 2017, at 8:03 AM, "[email protected] 
>>>>>> <mailto:[email protected]>" <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>> There is an mca param ess_base_forward_signals that controls which 
>>>>>> signals to forward. However, I just looked at source code and see that 
>>>>>> it wasn't backported. Sigh.
>>>>>> 
>>>>>> You could try the 3.0.0 branch as it is in release candidate and should 
>>>>>> go out within a week. I'd suggest just cloning that branch of the OMPI 
>>>>>> repo to get the latest state. The fix is definitely there 
>>>>>> 
>>>>>> Sent from my iPad
>>>>>> 
>>>>>>> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Ralph, 
>>>>>>> 
>>>>>>> 
>>>>>>> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same 
>>>>>>> problem… :-\
>>>>>>> Could you point me to some discussion of this? 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Eugene. 
>>>>>>> 
>>>>>>>> On Jul 11, 2017, at 6:17 AM, [email protected] 
>>>>>>>> <mailto:[email protected]> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There is an issue with how the signal is forwarded. This has been 
>>>>>>>> fixed in the latest OMPI release so you might want to upgrade 
>>>>>>>> 
>>>>>>>> Ralph
>>>>>>>> 
>>>>>>>> Sent from my iPad
>>>>>>>> 
>>>>>>>>> On Jul 11, 2017, at 2:53 AM, Dennis Tants 
>>>>>>>>> <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hello Eugene,
>>>>>>>>> 
>>>>>>>>> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said
>>>>>>>>> you built OMPI with pmi support) instead of "mpirun".
>>>>>>>>> srun is build-in and I think the preferred way of running parallel
>>>>>>>>> processes. Maybe scontrol is able to suspend it this way.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Dennis
>>>>>>>>> 
>>>>>>>>>> Am 10.07.2017 um 22:20 schrieb Eugene Dedits:
>>>>>>>>>> Hello SLURM-DEV
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I have a problem with slurm, openmpi, and “scontrol suspend”. 
>>>>>>>>>> 
>>>>>>>>>> My setup is:
>>>>>>>>>> 96-node cluster with IB, running rhel 6.8
>>>>>>>>>> slurm 17.02.1
>>>>>>>>>> openmpi 2.0.0 (built using Intel 2016 compiler)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I am running some application (hpl in this particular case) using 
>>>>>>>>>> batch script similar to:
>>>>>>>>>> -----------------------------
>>>>>>>>>> #!/bin/bash
>>>>>>>>>> #SBATCH —partiotion=standard
>>>>>>>>>> #SBATCH -N 10
>>>>>>>>>> #SBATCH —ntasks-per-node=16
>>>>>>>>>> 
>>>>>>>>>> mpirun -np 160 xhpl | tee LOG
>>>>>>>>>> -----------------------------
>>>>>>>>>> 
>>>>>>>>>> So I am running it on 160 cores, 2 nodes. 
>>>>>>>>>> 
>>>>>>>>>> Once job is submitted to the queue and is running I suspend it using
>>>>>>>>>> ~# scontrol suspend JOBID
>>>>>>>>>> 
>>>>>>>>>> I see that indeed my job stopped producing output. I go to each of 
>>>>>>>>>> the 10
>>>>>>>>>> nodes that were assigned for my job and see if the xhpl processes 
>>>>>>>>>> are running
>>>>>>>>>> there with :
>>>>>>>>>> 
>>>>>>>>>> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep 
>>>>>>>>>> xhpl | wc -l”; done
>>>>>>>>>> 
>>>>>>>>>> I expect this little script to return 0 from every node (because 
>>>>>>>>>> suspend sent the
>>>>>>>>>> SIGSTOP and they shouldn’t show up in top). However I see that 
>>>>>>>>>> processes 
>>>>>>>>>> are reliable suspended only on node10. I get:
>>>>>>>>>> 0
>>>>>>>>>> 16
>>>>>>>>>> 16
>>>>>>>>>> …
>>>>>>>>>> 16
>>>>>>>>>> 
>>>>>>>>>> So 9 out of 10 nodes still have 16 MPI threads of my xhpl 
>>>>>>>>>> application running at 100%. 
>>>>>>>>>> 
>>>>>>>>>> If I run “scontrol resume JOBID” and then suspend it again I see 
>>>>>>>>>> that (sometimes) more
>>>>>>>>>> nodes have “xhpl” processes properly suspended. Every time I resume 
>>>>>>>>>> and suspend the
>>>>>>>>>> job, I see different nodes returning 0 in my “ssh-run-top” script. 
>>>>>>>>>> 
>>>>>>>>>> So all together it looks like the suspend mechanism doesn’t properly 
>>>>>>>>>> work in SLURM with 
>>>>>>>>>> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
>>>>>>>>>> —with-pmi=/path/to/my/slurm”. 
>>>>>>>>>> I’ve observed the same behavior. 
>>>>>>>>>> 
>>>>>>>>>> I would appreciate any help.   
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Eugene. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Dennis Tants
>>>>>>>>> Auszubildender: Fachinformatiker für Systemintegration
>>>>>>>>> 
>>>>>>>>> ZARM - Zentrum für angewandte Raumfahrttechnologie und 
>>>>>>>>> Mikrogravitation
>>>>>>>>> ZARM - Center of Applied Space Technology and Microgravity
>>>>>>>>> 
>>>>>>>>> Universität Bremen
>>>>>>>>> Am Fallturm
>>>>>>>>> 28359 Bremen, Germany
>>>>>>>>> 
>>>>>>>>> Telefon: 0421 218 57940
>>>>>>>>> E-Mail: [email protected] <mailto:[email protected]>
>>>>>>>>> 
>>>>>>>>> www.zarm.uni-bremen.de <http://www.zarm.uni-bremen.de/>
>>

[slurm-dev] Re: slurm + openmpi + suspend problem

Reply via email to