Turn on "schedd_job_info", and run qstat -j to see why the scheduler is not assigning jobs.
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html http://gridscheduler.sourceforge.net/howto/troubleshooting.html Rayson On Thu, Mar 10, 2011 at 2:04 PM, Lane Schwartz <[email protected]> wrote: > Hi, > > Lately I've noticed that many of my jobs take much longer than > expected (sometimes up to half an hour) to go from pending to > running, even when there are numerous nodes with sufficient resources > available. Right now, for example, I've got a couple dozen jobs in > pending, and 38 nodes where no jobs are running. > > I was wondering if anyone might be able to shed some light on why this > might be. As I said, there are plenty of nodes with sufficient > resources available to run the pending jobs, but they sometimes take a > long time to go from pending to running. > > For reference, mem_free is set to consumable, and my jobs use the > default value of 4GB for their requested mem_free. There are some > other users' jobs which request more memory than that. > > The only clue I've been able to find is from examining the qmaster > messages log file. It has lots of lines that look like the errors > below: > > 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded: > capacity is 66765959168.262146, job 495795 requests additional > 68719476736.000000 > 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as > resources have changed during a scheduling run > 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders > 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as > resources have changed during a scheduling run > > Any tips or pointers would be appreciated. > > Thanks, > Lane > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
