W dniu niedziela, 9 listopada 2014 Ralph Castain <[email protected]> napisaĆ(a):
> > What stop signal is being sent, and where? We will catch and suspend the > job on receipt of a SIGTSTP signal by mpirun. > > > > On Nov 9, 2014, at 6:47 AM, Jason Bacon <[email protected] <javascript:;>> > wrote: > > > > > > > > Does anyone have SUSPEND,GANG working with openmpi via mpirun? > > > > I've set up a low-priority queue, which seems to be working, except that > for openmpi jobs, only the processes on the MPI root node seem to be > getting the stop signal. > > > > From slurm.conf: > > > > SelectType=select/cons_res > > SelectTypeParameters=CR_Core_Memory > > PreemptMode=SUSPEND,GANG > > PreemptType=preempt/partition_prio > > > > MpiDefault=none > > > > I've also tried --mca orte_forward_job_control 1, but it had no apparent > effect. > > > > Thanks, > > > > Jason > Hi jason, I've been testing the setup with freezer cgroup. It's working fine with mpi jobs, but intensive tests shown that in case of multinode jobs the job which should be suspended is using cpus on one of the nodes, this happends only for small percent of tests. Currently i'm not able to fully reproduce the issue, so i'm allowing only one node jobs on the lower priority partition. Cheers, Marcin
