Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reuti Fri, 11 Mar 2011 05:31:41 -0800

Am 10.03.2011 um 22:18 schrieb Lane Schwartz:

> Some of the jobs are array jobs, others are individually submitted. We
> don't use resource reservations at all.


As the resource consumption is changing, maybe the load sensor takes over and 
the jobs consume more memory than granted. For these default complexes the 
tighter restriction either by the consumable or the load sensor will be used.

Can you check with:

$ qhost -F mem_free

what's really left on the machines.

-- Reuti


> On Thu, Mar 10, 2011 at 4:03 PM, Reuti <[email protected]> wrote:
>> Hi,
>> 
>> Am 10.03.2011 um 20:04 schrieb Lane Schwartz:
>> 
>>> Lately I've noticed that many of my jobs take much longer than
>>> expected (sometimes up to half an hour)  to go from pending to
>>> running, even when there are numerous nodes with sufficient resources
>>> available. Right now, for example, I've got a couple dozen jobs in
>>> pending, and 38 nodes where no jobs are running.
>>> 
>>> I was wondering if anyone might be able to shed some light on why this
>>> might be. As I said, there are plenty of nodes with sufficient
>>> resources available to run the pending jobs, but they sometimes take a
>>> long time to go from pending to running.
>>> 
>>> For reference, mem_free is set to consumable, and my jobs use the
>>> default value of 4GB for their requested mem_free. There are some
>>> other users' jobs which request more memory than that.
>>> 
>>> The only clue I've been able to find is from examining the qmaster
>>> messages log file. It has lots of lines that look like the errors
>>> below:
>>> 
>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
>>> capacity is 66765959168.262146, job 495795 requests additional
>>> 68719476736.000000
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>> 
>> - are these serial or parallel jobs?
>> 
>> - do you use resource reservation for the mem_free request, as otherwise 
>> smaller ones with a lower request may slip in all the time?
>> 
>> -- Reuti
>> 
>> 
>>> Any tips or pointers would be appreciated.
>>> 
>>> Thanks,
>>> Lane
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 
> 
> -- 
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reply via email to