On Mon, 14 Mar 2011, Esztermann, Ansgar wrote:

Hi List,

can anyone give me a hint as to what scheduler performance to expect, and what would typically be the bottleneck? We have 6.2u5 running here, and one scheduler run takes about 5 minutes (with 600 jobs and 800 nodes).

From what I've seen with params monitor=1 and strace, the scheduler[1] has a list of running jobs almost instantaneously, then spends about four minutes at 100% CPU writing nothing to common/schedule (and actually not doing any system calls but futex() and write (stdout). During that time, it spews a lot of diagnostic messages about resource utilization to stdout (see below[2]). Finally, reservations are made (they take about four seconds each, which is not exactly fast, but quite manageable), and jobs are started (very quickly).

Is such a long delay between the :RUNNING: and :RESERVING: lines normal? I've thought our disk may be at fault here -- /var is often maxed out in terms of bandwidth. But then again, the thread with 100% CPU doesn't do any read() calls.
...

You're running at a bigger scale than we are (~420 hosts) but...

I/O on the $SGE_ROOT directory can certainly cause the problems you report. I would take a look at what your disks are doing with "iostat -x" if I were you. You might see a large number of small I/O requests: we certainly did.

* If $SGE_ROOT is not local to the qmaster, MONITOR=1 can itself generate a large number of small I/Os and be a significant contributor to the problem. Replacing common/schedule with a symlink to a disk local to the qmaster resolved many "slow running" problems for us.

* Do your compute nodes spool to local disk, or to an NFS share?
("qconf -sconf | grep execd_spool_dir")

* Is $SGE_ROOT local to the qmaster?

* Are you using classic or BDB spooling?

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to