Re: [gridengine users] sge_execd: corrupted mem or cpu at startup and buggy resource usage report

Korzennik, Sylvain Wed, 14 Oct 2015 19:21:20 -0700

1- qconf -sconf | grep gid_range returned
gid_range                    20000-20100


I changed it to
gid_range                    25000-29000

2- indeed we are using 20001 - 20023 in /etc/groups

3- we have ~80 compute nodes, with 12 to 64 CPUs (slots) ea for ~3000
CPUs/slots.

  Cheers,
    Sylvain
--


On Wed, Oct 14, 2015 at 3:56 PM, Reuti <[email protected]> wrote:

> Hi,
>
> Am 14.10.2015 um 19:53 schrieb Korzennik, Sylvain:
>
> > We found that after starting a large number of jobs (in the thousands),
> approx 1% of new jobs fail b/c they get mark after 1-3 sec of execution
> time as having exceeded either the cpu time limit or the memory limit.
> Neither condition is correct as the jobs barely started. It is my guess
> that the part of the code that handle keeping track of cpu time and/or
> memory gets corrupted. We fix teh problem by restarting the sge_execd
> daemons on all the compute nodes.
> >
> > On a possibly related issue, I also discovered that the usage line
> returned by qstat -j <jid>, ie:
> >
> > usage 1: cpu=91:18:14:27, mem=1818707.42063 GBs, io=1.35317,
> vmem=1.551G, maxvmem=1.553G
> >
> > gets also corrupted and is at times meaningless.
> >
> > In order to keep track of resource usage (esp. memory) I run a qstat -j
> <jid> on all the jobs in a specific set of queues ever 5 minutes and log
> the results. I found the following inconsistencies
> >  - glitches in the value of cpu= (it should increase monotonically)
> >  - jumps in the value of mem=; at some point it drops by a large value,
> as if there was an overflow in a counter
> >  - jumps in the value of maxvmem= again it starts high (10.89G), then
> drops to (1.553GB) which does not make sense. That drop appears at the same
> time as the mem= glitch
> >
> > Some jobs on our cluster run for 60 to 90 days. I also found
> inconsistencies in the accounting file.
>
> Maybe the additional group ID which is used by SGE to keep track of
> resource consumption of jobs is getting reused too fast. What range did you
> specify when you installed SGE? How many jobs run at the same tim on each
> exechost?
>
> $
> 
> 
> qconf -sconf | grep gid_range
>
> Are real groups occupying the same specified range and processes outside
> of SGE use these too?
>
> -- Reuti
>
>
> >
> >
> > We run OGS/GE 2011.11p1 under Rocks 6.1.1. We had the same problem
> (sge_execd) with Rocks 5.x.
> >
> > Any pointers/hint/suggestions/etc welcome.
> >
> >   Cheers,
> >     Sylvain
> > --
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_execd: corrupted mem or cpu at startup and buggy resource usage report

Reply via email to