Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reuti Thu, 10 Mar 2011 13:04:23 -0800

Hi,

Am 10.03.2011 um 20:04 schrieb Lane Schwartz:


> Lately I've noticed that many of my jobs take much longer than
> expected (sometimes up to half an hour)  to go from pending to
> running, even when there are numerous nodes with sufficient resources
> available. Right now, for example, I've got a couple dozen jobs in
> pending, and 38 nodes where no jobs are running.
> 
> I was wondering if anyone might be able to shed some light on why this
> might be. As I said, there are plenty of nodes with sufficient
> resources available to run the pending jobs, but they sometimes take a
> long time to go from pending to running.
> 
> For reference, mem_free is set to consumable, and my jobs use the
> default value of 4GB for their requested mem_free. There are some
> other users' jobs which request more memory than that.
> 
> The only clue I've been able to find is from examining the qmaster
> messages log file. It has lots of lines that look like the errors
> below:
> 
> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
> capacity is 66765959168.262146, job 495795 requests additional
> 68719476736.000000
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run
> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run

- are these serial or parallel jobs?

- do you use resource reservation for the mem_free request, as otherwise 
smaller ones with a lower request may slip in all the time?

-- Reuti


> Any tips or pointers would be appreciated.
> 
> Thanks,
> Lane
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reply via email to