By design, the scheduler (the scheduler thread in 6.2) is CPU bound, while the qmaster (excluding the scheduler thread) is mostly I/O (disk, network, etc) bound. If there is a thread at 100% CPU for 4 minutes, and only very few random I/O operations, then it is very likely that it is the scheduler thread.
Can you turn off qmaster profiling, and turn on scheduler profiling?? You can enable it by setting scheduler config "PROFILE=TRUE" or "PROFILE=1". You will then get the time each stage spends, something like: PROF: job-order calculation took 0.020 s You can get more info from doc/devel/rfe/profiling.txt if you have the source, or online at the Grid Scheduler homepage: http://gridscheduler.svn.sourceforge.net/viewvc/gridscheduler/trunk/doc/devel/rfe/profiling.txt?revision=9&view=markup Rayson On Mon, Mar 14, 2011 at 1:25 PM, Esztermann, Ansgar <[email protected]> wrote: >> I/O on the $SGE_ROOT directory can certainly cause the problems you >> report. I would take a look at what your disks are doing with "iostat -x" >> if I were you. You might see a large number of small I/O requests: we >> certainly did. > > There are many small requests, but they seem to be on /var, not $SGE_ROOT. Of > course, this might be caused by some process apart from SGE. Our cluster > management software uses MySQL, and that's using /var as well. > >> * If $SGE_ROOT is not local to the qmaster, MONITOR=1 can itself generate >> a large number of small I/Os and be a significant contributor to the >> problem. Replacing common/schedule with a symlink to a disk local to the >> qmaster resolved many "slow running" problems for us. >> >> * Do your compute nodes spool to local disk, or to an NFS share? >> ("qconf -sconf | grep execd_spool_dir") > > Local. > >> * Is $SGE_ROOT local to the qmaster? > > I was about to write "yes", but that's not entirely true. It's on drbd. > >> * Are you using classic or BDB spooling? > > Classic. > > > A. > > -- > Ansgar Esztermann > DV-Systemadministration > Max-Planck-Institut für biophysikalische Chemie, Abteilung 105 > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
