Ralph, it seems to work now. Thanks a bunch! When do you think we should expect 3.0.0 release?
Best, Eugene. On Tue, Jul 18, 2017 at 1:07 PM, [email protected] <[email protected]> wrote: > Okay, I tracked it down and have a fix pending for OMPI master: > https://github.com/open-mpi/ompi/pull/3930 > > Once that cycles thru, I’ll create a PR for the 3.0 release. I’m not sure > about taking it back to v2.x - I’ll have to check with those release > managers. > > On Jul 18, 2017, at 7:33 AM, [email protected] wrote: > > Just looking at it today... > > On Jul 18, 2017, at 7:25 AM, Eugene Dedits <[email protected]> > wrote: > > Hi Ralph, > > > did you have a chance to take a look at this problem? > > Thanks! > Eugene. > > > > > On Tue, Jul 11, 2017 at 12:51 PM, Eugene Dedits <[email protected]> > wrote: > >> Thanks! I really appreciate your help. >> In a meantime I’ve tried experimenting with 1.8.3. Here is what I’ve >> noticed. >> >> 1. Running the job with “sbatch ./my_script” where my script calls >> mpirun -np 160 -mca orte_forward_job_control 1 ./xhpl >> >> and then suspending the job with “scontrol suspend JOBID” >> does not work. Of 10 nodes assigned to my job 4 are still running >> 16 mpi threads of xhpl. >> >> 2. Running exactly the same job and then sending TSPT to mpirun process >> does work: all 10 nodes show that xhpl processes are stopped. Resuming >> them with -CONT also works. >> >> Again, this is with OpenMPI 1.8.3 >> >> Once again, thank you for all the help. >> >> Cheers, >> Eugene. >> >> >> >> >> On Jul 11, 2017, at 12:08 PM, [email protected] wrote: >> >> Very odd - let me explore when I get back. Sorry for delay >> >> Sent from my iPad >> >> On Jul 11, 2017, at 10:59 AM, Eugene Dedits <[email protected]> >> wrote: >> >> Ralph, >> >> >> Are you suggesting doing something similar to this: >> https://www.open-mpi.org/faq/?category=sge#sge-suspend-resume >> >> If yes, here is what I’ve done: >> - start a job using slurm and "mpirun -mca orte_forward_job_control 1 -np 160 >> xhpl” >> - ssh to the node where mpirun is launched >> - “kill -STOP PID” where PID is mpirun pid >> - “kill -TSTP PID” >> >> In both cases (STOP and TSTP) I observer that there were 16 mpi processes >> running >> at 100% on all 10 nodes where the job was started. >> >> Thanks, >> Eugene. >> >> >> >> >> >> >> On Jul 11, 2017, at 10:35 AM, [email protected] wrote: >> >> >> Odd - I'm on travel this week but can look at it next week. One >> possibility - have you tried hitting us with SIGTSTOP instead of SIGSTOP? >> Difference in ability to trap and forward >> >> Sent from my iPad >> >> On Jul 11, 2017, at 9:29 AM, Eugene Dedits <[email protected]> >> wrote: >> >> >> I’ve just tried 3.0.0rc1 and problems still persists there… >> >> Thanks, >> E. >> >> >> >> On Jul 11, 2017, at 10:20 AM, [email protected] wrote: >> >> >> Just checked the planning board and saw that my PR to bring that change >> to 2.1.2 is pending and not yet in the release branch. I’ll try to make >> that happen soon >> >> Sent from my iPad >> >> On Jul 11, 2017, at 8:03 AM, "[email protected]" <[email protected]> wrote: >> >> >> There is an mca param ess_base_forward_signals that controls which >> signals to forward. However, I just looked at source code and see that it >> wasn't backported. Sigh. >> >> You could try the 3.0.0 branch as it is in release candidate and should >> go out within a week. I'd suggest just cloning that branch of the OMPI repo >> to get the latest state. The fix is definitely there >> >> Sent from my iPad >> >> On Jul 11, 2017, at 7:45 AM, Eugene Dedits <[email protected]> >> wrote: >> >> >> Hi Ralph, >> >> >> thanks for reply. I’ve just tried upgrading to ompi 2.1.1. The same >> problem… :-\ >> Could you point me to some discussion of this? >> >> Thanks, >> Eugene. >> >> On Jul 11, 2017, at 6:17 AM, [email protected] wrote: >> >> >> There is an issue with how the signal is forwarded. This has been fixed >> in the latest OMPI release so you might want to upgrade >> >> Ralph >> >> Sent from my iPad >> >> On Jul 11, 2017, at 2:53 AM, Dennis Tants <[email protected]. >> de> wrote: >> >> >> Hello Eugene, >> >> it is just a wild guess, but could you try "srun --mpi=pmi2"(you said >> you built OMPI with pmi support) instead of "mpirun". >> srun is build-in and I think the preferred way of running parallel >> processes. Maybe scontrol is able to suspend it this way. >> >> Regards, >> Dennis >> >> Am 10.07.2017 um 22:20 schrieb Eugene Dedits: >> Hello SLURM-DEV >> >> >> I have a problem with slurm, openmpi, and “scontrol suspend”. >> >> My setup is: >> 96-node cluster with IB, running rhel 6.8 >> slurm 17.02.1 >> openmpi 2.0.0 (built using Intel 2016 compiler) >> >> >> I am running some application (hpl in this particular case) using batch >> script similar to: >> ----------------------------- >> #!/bin/bash >> #SBATCH —partiotion=standard >> #SBATCH -N 10 >> #SBATCH —ntasks-per-node=16 >> >> mpirun -np 160 xhpl | tee LOG >> ----------------------------- >> >> So I am running it on 160 cores, 2 nodes. >> >> Once job is submitted to the queue and is running I suspend it using >> ~# scontrol suspend JOBID >> >> I see that indeed my job stopped producing output. I go to each of the 10 >> nodes that were assigned for my job and see if the xhpl processes are >> running >> there with : >> >> ~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | >> wc -l”; done >> >> I expect this little script to return 0 from every node (because suspend >> sent the >> SIGSTOP and they shouldn’t show up in top). However I see that processes >> are reliable suspended only on node10. I get: >> 0 >> 16 >> 16 >> … >> 16 >> >> So 9 out of 10 nodes still have 16 MPI threads of my xhpl application >> running at 100%. >> >> If I run “scontrol resume JOBID” and then suspend it again I see that >> (sometimes) more >> nodes have “xhpl” processes properly suspended. Every time I resume and >> suspend the >> job, I see different nodes returning 0 in my “ssh-run-top” script. >> >> So all together it looks like the suspend mechanism doesn’t properly work >> in SLURM with >> OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm >> —with-pmi=/path/to/my/slurm”. >> I’ve observed the same behavior. >> >> I would appreciate any help. >> >> >> Thanks, >> Eugene. >> >> >> >> >> >> -- >> Dennis Tants >> Auszubildender: Fachinformatiker für Systemintegration >> >> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation >> ZARM - Center of Applied Space Technology and Microgravity >> >> Universität Bremen >> Am Fallturm >> 28359 Bremen, Germany >> >> Telefon: 0421 218 57940 >> E-Mail: [email protected] >> >> www.zarm.uni-bremen.de >> >> >> >> > > >
