Load average gets high if the job spawns more processes/threads than
allocated CPUs, but we haven't seen any problem with node instability. We
did have to remove np_load_avg from load_thresholds, though, to keep our
users from DoS'ing the cluster...

On Thu, Aug 29, 2019 at 05:27:36AM -0400, Mike Serkov wrote:
> Also, something to keep in mind - cgroups will not solve this issue 
> completely. It is just affinity enforcement. If the job spawns multiple 
> threads and they all active - it will cause LA growing as well as some other 
> side effects, regardless affinity setting. On big SMP boxes it may actually 
> cause more instability. Anyway, jobs should be configured to use exact amount 
> of threads they request, and it should be monitored.
> 
> Best regards,
> Mikhail Serkov 
> 
> > On Aug 29, 2019, at 4:16 AM, Ondrej Valousek 
> > <ondrej.valou...@adestotech.com> wrote:
> > 
> > Also a quick note: cgroups is the way to _enforce_ CPU affinity.
> > For vast majority of the jobs, I would say just a simple taskset 
> > configuration (i.e. i.e. something like ???-l binding linear???) would do 
> > as well.
> >  
> >  
> > From: Dietmar Rieder <dietmar.rie...@i-med.ac.at> 
> > Sent: Thursday, August 29, 2019 9:37 AM
> > To: users@gridengine.org; Ondrej Valousek <ondrej.valou...@adestotech.com>; 
> > users <users@gridengine.org>
> > Subject: Re: [gridengine users] limit CPU/slot resource to the number of 
> > reserved slots
> >  
> > Great, thanks so much!
> > 
> > Dietmar
> > 
> > Am 29. August 2019 09:05:35 MESZ schrieb Ondrej Valousek 
> > <ondrej.valou...@adestotech.com>:
> > Nope,
> > SoGE (as of 8.1.9) supports CGROUPS w/o any code changes, just add 
> > ???USE_CGROUPS=yes??? to the exec parameter list to make shepherd use 
> > CGroup saveset controller.
> > My path only extends it to supports system and hence possibility to hard 
> > enforce memory/cpu limits, etc???
> > Hth,
> > Ondrej
> >  
> > From: Daniel Povey <dpo...@gmail.com> 
> > Sent: Monday, August 26, 2019 10:12 PM
> > To: Dietmar Rieder <dietmar.rie...@i-med.ac.at>; Ondrej Valousek 
> > <ondrej.valou...@adestotech.com>; users <users@gridengine.org>
> > Subject: Re: [gridengine users] limit CPU/slot resource to the number of 
> > reserved slots
> >  
> > I don't think it's supported in Son of GridEngine.  Ondrej Valousek (cc'd) 
> > described in the first thread here
> > http://arc.liv.ac.uk/pipermail/sge-discuss/2019-August/thread.html
> > how he was able to implement it, but it required code changes, i.e. you 
> > would need to figure out how to build and install SGE from source, which is 
> > a task in itself.
> >  
> > Dan
> >  
> >  
> > On Mon, Aug 26, 2019 at 12:46 PM Dietmar Rieder 
> > <dietmar.rie...@i-med.ac.at> wrote:
> > Hi,
> > 
> > thanks for your reply. This sounds promising.
> > We are using Son of Grid Engine though. Can you point me to the right
> > docs to get cgroup enabled in the exec host (CentOS 7). I must admit I
> > have no experience with cgroups.
> > 
> > Thanks again
> >   Dietmar
> > 
> > On 8/26/19 4:03 PM, Skylar Thompson wrote:
> > > At least for UGE, you will want to use the CPU set integration, which will
> > > assign the job to a cgroup that has one CPU per requested slot. Once you
> > > have cgroups enabled in the exec host OS, you can then set these options 
> > > in
> > > sge_conf:
> > > 
> > > cgroup_path=/cgroup
> > > cpuset=1
> > > 
> > > You can use this mechanism to have the m_mem_free request enforced as 
> > > well.
> > > 
> > > On Mon, Aug 26, 2019 at 02:15:22PM +0200, Dietmar Rieder wrote:
> > >> Hi,
> > >>
> > >> may be this is a stupid question, but I'd like to limit the used/usable
> > >> number of cores to the number of slots that were reserved for a job.
> > >>
> > >> We often see that people reserve 1 slot, e.g. "qsub -pe smp 1 [...]"
> > >> but their program is then running in parallel on multiple cores. How can
> > >> this be prevented? Is it possible that with reserving only one slot a
> > >> process can not utilize more than this?
> > >>
> > >> I was told the this should be possible in slurm (which we don't have,
> > >> and to which we don't want to switch to currently).
> > >>
> > >> Thanks
> > >>   Dietmar
> > > 
> > 
> > 
> > -- 
> > _________________________________________
> > D i e t m a r  R i e d e r, Mag.Dr.
> > Innsbruck Medical University
> > Biocenter - Institute of Bioinformatics
> > Email: dietmar.rie...@i-med.ac.at
> > Web:   http://www.icbi.at
> > 
> > 
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> > 
> > --
> > D i e t m a r R i e d e r, Mag.Dr.
> > Innsbruck Medical University
> > Biocenter - Institute of Bioinformatics
> > Innrain 80, 6020 Innsbruck
> > Phone: +43 512 9003 71402
> > Fax: +43 512 9003 73100
> > Email: dietmar.rie...@i-med.ac.at
> > Web: http://www.icbi.at
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users

> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


-- 
-- Skylar Thompson (skyl...@u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to