Hi, Am 10.03.2011 um 20:04 schrieb Lane Schwartz:
> Lately I've noticed that many of my jobs take much longer than > expected (sometimes up to half an hour) to go from pending to > running, even when there are numerous nodes with sufficient resources > available. Right now, for example, I've got a couple dozen jobs in > pending, and 38 nodes where no jobs are running. > > I was wondering if anyone might be able to shed some light on why this > might be. As I said, there are plenty of nodes with sufficient > resources available to run the pending jobs, but they sometimes take a > long time to go from pending to running. > > For reference, mem_free is set to consumable, and my jobs use the > default value of 4GB for their requested mem_free. There are some > other users' jobs which request more memory than that. > > The only clue I've been able to find is from examining the qmaster > messages log file. It has lots of lines that look like the errors > below: > > 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded: > capacity is 66765959168.262146, job 495795 requests additional > 68719476736.000000 > 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as > resources have changed during a scheduling run > 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders > 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as > resources have changed during a scheduling run - are these serial or parallel jobs? - do you use resource reservation for the mem_free request, as otherwise smaller ones with a lower request may slip in all the time? -- Reuti > Any tips or pointers would be appreciated. > > Thanks, > Lane > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
