Load average gets high if the job spawns more processes/threads than allocated CPUs, but we haven't seen any problem with node instability. We did have to remove np_load_avg from load_thresholds, though, to keep our users from DoS'ing the cluster...
On Thu, Aug 29, 2019 at 05:27:36AM -0400, Mike Serkov wrote: > Also, something to keep in mind - cgroups will not solve this issue > completely. It is just affinity enforcement. If the job spawns multiple > threads and they all active - it will cause LA growing as well as some other > side effects, regardless affinity setting. On big SMP boxes it may actually > cause more instability. Anyway, jobs should be configured to use exact amount > of threads they request, and it should be monitored. > > Best regards, > Mikhail Serkov > > > On Aug 29, 2019, at 4:16 AM, Ondrej Valousek > > <ondrej.valou...@adestotech.com> wrote: > > > > Also a quick note: cgroups is the way to _enforce_ CPU affinity. > > For vast majority of the jobs, I would say just a simple taskset > > configuration (i.e. i.e. something like ???-l binding linear???) would do > > as well. > > > > > > From: Dietmar Rieder <dietmar.rie...@i-med.ac.at> > > Sent: Thursday, August 29, 2019 9:37 AM > > To: users@gridengine.org; Ondrej Valousek <ondrej.valou...@adestotech.com>; > > users <users@gridengine.org> > > Subject: Re: [gridengine users] limit CPU/slot resource to the number of > > reserved slots > > > > Great, thanks so much! > > > > Dietmar > > > > Am 29. August 2019 09:05:35 MESZ schrieb Ondrej Valousek > > <ondrej.valou...@adestotech.com>: > > Nope, > > SoGE (as of 8.1.9) supports CGROUPS w/o any code changes, just add > > ???USE_CGROUPS=yes??? to the exec parameter list to make shepherd use > > CGroup saveset controller. > > My path only extends it to supports system and hence possibility to hard > > enforce memory/cpu limits, etc??? > > Hth, > > Ondrej > > > > From: Daniel Povey <dpo...@gmail.com> > > Sent: Monday, August 26, 2019 10:12 PM > > To: Dietmar Rieder <dietmar.rie...@i-med.ac.at>; Ondrej Valousek > > <ondrej.valou...@adestotech.com>; users <users@gridengine.org> > > Subject: Re: [gridengine users] limit CPU/slot resource to the number of > > reserved slots > > > > I don't think it's supported in Son of GridEngine. Ondrej Valousek (cc'd) > > described in the first thread here > > http://arc.liv.ac.uk/pipermail/sge-discuss/2019-August/thread.html > > how he was able to implement it, but it required code changes, i.e. you > > would need to figure out how to build and install SGE from source, which is > > a task in itself. > > > > Dan > > > > > > On Mon, Aug 26, 2019 at 12:46 PM Dietmar Rieder > > <dietmar.rie...@i-med.ac.at> wrote: > > Hi, > > > > thanks for your reply. This sounds promising. > > We are using Son of Grid Engine though. Can you point me to the right > > docs to get cgroup enabled in the exec host (CentOS 7). I must admit I > > have no experience with cgroups. > > > > Thanks again > > Dietmar > > > > On 8/26/19 4:03 PM, Skylar Thompson wrote: > > > At least for UGE, you will want to use the CPU set integration, which will > > > assign the job to a cgroup that has one CPU per requested slot. Once you > > > have cgroups enabled in the exec host OS, you can then set these options > > > in > > > sge_conf: > > > > > > cgroup_path=/cgroup > > > cpuset=1 > > > > > > You can use this mechanism to have the m_mem_free request enforced as > > > well. > > > > > > On Mon, Aug 26, 2019 at 02:15:22PM +0200, Dietmar Rieder wrote: > > >> Hi, > > >> > > >> may be this is a stupid question, but I'd like to limit the used/usable > > >> number of cores to the number of slots that were reserved for a job. > > >> > > >> We often see that people reserve 1 slot, e.g. "qsub -pe smp 1 [...]" > > >> but their program is then running in parallel on multiple cores. How can > > >> this be prevented? Is it possible that with reserving only one slot a > > >> process can not utilize more than this? > > >> > > >> I was told the this should be possible in slurm (which we don't have, > > >> and to which we don't want to switch to currently). > > >> > > >> Thanks > > >> Dietmar > > > > > > > > > -- > > _________________________________________ > > D i e t m a r R i e d e r, Mag.Dr. > > Innsbruck Medical University > > Biocenter - Institute of Bioinformatics > > Email: dietmar.rie...@i-med.ac.at > > Web: http://www.icbi.at > > > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > > > -- > > D i e t m a r R i e d e r, Mag.Dr. > > Innsbruck Medical University > > Biocenter - Institute of Bioinformatics > > Innrain 80, 6020 Innsbruck > > Phone: +43 512 9003 71402 > > Fax: +43 512 9003 73100 > > Email: dietmar.rie...@i-med.ac.at > > Web: http://www.icbi.at > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users -- -- Skylar Thompson (skyl...@u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users