W dniu niedziela, 9 listopada 2014 Ralph Castain <[email protected]>
napisaƂ(a):

>
> What stop signal is being sent, and where? We will catch and suspend the
> job on receipt of a SIGTSTP signal by mpirun.
>
>
> > On Nov 9, 2014, at 6:47 AM, Jason Bacon <[email protected] <javascript:;>>
> wrote:
> >
> >
> >
> > Does anyone have SUSPEND,GANG working with openmpi via mpirun?
> >
> > I've set up a low-priority queue, which seems to be working, except that
> for openmpi jobs, only the processes on the MPI root node seem to be
> getting the stop signal.
> >
> > From slurm.conf:
> >
> > SelectType=select/cons_res
> > SelectTypeParameters=CR_Core_Memory
> > PreemptMode=SUSPEND,GANG
> > PreemptType=preempt/partition_prio
> >
> > MpiDefault=none
> >
> > I've also tried --mca orte_forward_job_control 1, but it had no apparent
> effect.
> >
> > Thanks,
> >
> >   Jason
>
Hi jason,

I've been testing the setup with freezer cgroup. It's working fine with mpi
jobs, but intensive tests shown that in case of multinode jobs the job
which should be suspended is using cpus on one of the nodes, this happends
only for small percent of tests.
Currently i'm not able to fully reproduce the issue, so i'm allowing only
one node jobs on the lower priority partition.

Cheers,
Marcin

Reply via email to