Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reuti Fri, 11 Mar 2011 06:12:23 -0800

Am 11.03.2011 um 14:58 schrieb Reuti:

> Am 11.03.2011 um 14:49 schrieb Lane Schwartz:
> 
>> Rayson,
>> 
>> Thanks for the pointer. In the qmon scheduler configuration, I have
>> "Job Scheduling Information" set to true. I assume that's the same
>> setting you're refering to?
>> 
>> With this setting enabled, I still don't get very much info. When I
>> run qstat -j on my jobs, the only thing it tells me is that a queue
>> instance for a particular node is dropped because that node is
>> disabled.
> 
> Because "disabled"? Did someone use `qmon` to disable the node or set up any 
> calendar?


Should read `qmod`- but both can be used.

-- Reuti


> 
> -- Reuti
> 
> 
>> Thanks,
>> Lane
>> 
>> On Thu, Mar 10, 2011 at 4:28 PM, Rayson Ho <[email protected]> wrote:
>>> Turn on "schedd_job_info", and run qstat -j to see why the scheduler
>>> is not assigning jobs.
>>> 
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>>> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>>> 
>>> Rayson
>>> 
>>> 
>>> 
>>> On Thu, Mar 10, 2011 at 2:04 PM, Lane Schwartz <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> Lately I've noticed that many of my jobs take much longer than
>>>> expected (sometimes up to half an hour)  to go from pending to
>>>> running, even when there are numerous nodes with sufficient resources
>>>> available. Right now, for example, I've got a couple dozen jobs in
>>>> pending, and 38 nodes where no jobs are running.
>>>> 
>>>> I was wondering if anyone might be able to shed some light on why this
>>>> might be. As I said, there are plenty of nodes with sufficient
>>>> resources available to run the pending jobs, but they sometimes take a
>>>> long time to go from pending to running.
>>>> 
>>>> For reference, mem_free is set to consumable, and my jobs use the
>>>> default value of 4GB for their requested mem_free. There are some
>>>> other users' jobs which request more memory than that.
>>>> 
>>>> The only clue I've been able to find is from examining the qmaster
>>>> messages log file. It has lots of lines that look like the errors
>>>> below:
>>>> 
>>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
>>>> capacity is 66765959168.262146, job 495795 requests additional
>>>> 68719476736.000000
>>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>>> resources have changed during a scheduling run
>>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
>>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>>> resources have changed during a scheduling run
>>>> 
>>>> Any tips or pointers would be appreciated.
>>>> 
>>>> Thanks,
>>>> Lane
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> When a place gets crowded enough to require ID's, social collapse is not
>> far away.  It is time to go elsewhere.  The best thing about space travel
>> is that it made it possible to go elsewhere.
>>                -- R.A. Heinlein, "Time Enough For Love"
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reply via email to