Hi, Lately I've noticed that many of my jobs take much longer than expected (sometimes up to half an hour) to go from pending to running, even when there are numerous nodes with sufficient resources available. Right now, for example, I've got a couple dozen jobs in pending, and 38 nodes where no jobs are running.
I was wondering if anyone might be able to shed some light on why this might be. As I said, there are plenty of nodes with sufficient resources available to run the pending jobs, but they sometimes take a long time to go from pending to running. For reference, mem_free is set to consumable, and my jobs use the default value of 4GB for their requested mem_free. There are some other users' jobs which request more memory than that. The only clue I've been able to find is from examining the qmaster messages log file. It has lots of lines that look like the errors below: 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded: capacity is 66765959168.262146, job 495795 requests additional 68719476736.000000 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as resources have changed during a scheduling run 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as resources have changed during a scheduling run Any tips or pointers would be appreciated. Thanks, Lane _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
