Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Lane Schwartz Fri, 11 Mar 2011 06:15:13 -0800

On Fri, Mar 11, 2011 at 8:30 AM, Reuti <[email protected]> wrote:
> Am 10.03.2011 um 22:18 schrieb Lane Schwartz:
>
>> Some of the jobs are array jobs, others are individually submitted. We
>> don't use resource reservations at all.
>
> As the resource consumption is changing, maybe the load sensor takes over and 
> the jobs consume more memory than granted. For these default complexes the 
> tighter restriction either by the consumable or the load sensor will be used.
>
> Can you check with:
>
> $ qhost -F mem_free


Reuti,

At the moment I have a couple dozen jobs queued and waiting, each of
which requested mem_free=3072M. There's one machine with no jobs
running on it. When I run qhost -F mem_free, that machine reports

hc:mem_free=29.000G
LOAD: 0.33
MEM_TOT: 31.4G
MEM_USE: 971.7M
SWAP_TOT: 59.6G
SWAP_USE: 506.5M

Lane




>
> what's really left on the machines.
>> On Thu, Mar 10, 2011 at 4:03 PM, Reuti <[email protected]> wrote:
>>> Hi,
>>>
>>> Am 10.03.2011 um 20:04 schrieb Lane Schwartz:
>>>
>>>> Lately I've noticed that many of my jobs take much longer than
>>>> expected (sometimes up to half an hour)  to go from pending to
>>>> running, even when there are numerous nodes with sufficient resources
>>>> available. Right now, for example, I've got a couple dozen jobs in
>>>> pending, and 38 nodes where no jobs are running.
>>>>
>>>> I was wondering if anyone might be able to shed some light on why this
>>>> might be. As I said, there are plenty of nodes with sufficient
>>>> resources available to run the pending jobs, but they sometimes take a
>>>> long time to go from pending to running.
>>>>
>>>> For reference, mem_free is set to consumable, and my jobs use the
>>>> default value of 4GB for their requested mem_free. There are some
>>>> other users' jobs which request more memory than that.
>>>>
>>>> The only clue I've been able to find is from examining the qmaster
>>>> messages log file. It has lots of lines that look like the errors
>>>> below:
>>>>
>>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
>>>> capacity is 66765959168.262146, job 495795 requests additional
>>>> 68719476736.000000
>>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>>> resources have changed during a scheduling run
>>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
>>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>>> resources have changed during a scheduling run
>>>
>>> - are these serial or parallel jobs?
>>>
>>> - do you use resource reservation for the mem_free request, as otherwise 
>>> smaller ones with a lower request may slip in all the time?
>>>
>>> -- Reuti
>>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reply via email to