We found that after starting a large number of jobs (in the thousands),
approx 1% of new jobs fail b/c they get mark after 1-3 sec of execution
time as having exceeded either the cpu time limit or the memory limit.
Neither condition is correct as the jobs barely started. It is my guess
that the part of the code that handle keeping track of cpu time and/or
memory gets corrupted. We fix teh problem by restarting the sge_execd
daemons on all the compute nodes.

On a possibly related issue, I also discovered that the usage line returned
by qstat -j <jid>, ie:

usage 1: cpu=91:18:14:27, mem=1818707.42063 GBs, io=1.35317, vmem=1.551G,
maxvmem=1.553G

gets also corrupted and is at times meaningless.

In order to keep track of resource usage (esp. memory) I run a qstat -j
<jid> on all the jobs in a specific set of queues ever 5 minutes and log
the results. I found the following inconsistencies
 - glitches in the value of cpu= (it should increase monotonically)
 - jumps in the value of mem=; at some point it drops by a large value, as
if there was an overflow in a counter
 - jumps in the value of maxvmem= again it starts high (10.89G), then drops
to (1.553GB) which does not make sense. That drop appears at the same time
as the mem= glitch

Some jobs on our cluster run for 60 to 90 days. I also found
inconsistencies in the accounting file.


We run OGS/GE 2011.11p1 under Rocks 6.1.1. We had the same problem
(sge_execd) with Rocks 5.x.

Any pointers/hint/suggestions/etc welcome.

  Cheers,
    Sylvain
--
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to