Am 11.03.2011 um 14:49 schrieb Lane Schwartz:

> Rayson,
> 
> Thanks for the pointer. In the qmon scheduler configuration, I have
> "Job Scheduling Information" set to true. I assume that's the same
> setting you're refering to?
> 
> With this setting enabled, I still don't get very much info. When I
> run qstat -j on my jobs, the only thing it tells me is that a queue
> instance for a particular node is dropped because that node is
> disabled.

Because "disabled"? Did someone use `qmon` to disable the node or set up any 
calendar?

-- Reuti


> Thanks,
> Lane
> 
> On Thu, Mar 10, 2011 at 4:28 PM, Rayson Ho <[email protected]> wrote:
>> Turn on "schedd_job_info", and run qstat -j to see why the scheduler
>> is not assigning jobs.
>> 
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>> 
>> Rayson
>> 
>> 
>> 
>> On Thu, Mar 10, 2011 at 2:04 PM, Lane Schwartz <[email protected]> wrote:
>>> Hi,
>>> 
>>> Lately I've noticed that many of my jobs take much longer than
>>> expected (sometimes up to half an hour)  to go from pending to
>>> running, even when there are numerous nodes with sufficient resources
>>> available. Right now, for example, I've got a couple dozen jobs in
>>> pending, and 38 nodes where no jobs are running.
>>> 
>>> I was wondering if anyone might be able to shed some light on why this
>>> might be. As I said, there are plenty of nodes with sufficient
>>> resources available to run the pending jobs, but they sometimes take a
>>> long time to go from pending to running.
>>> 
>>> For reference, mem_free is set to consumable, and my jobs use the
>>> default value of 4GB for their requested mem_free. There are some
>>> other users' jobs which request more memory than that.
>>> 
>>> The only clue I've been able to find is from examining the qmaster
>>> messages log file. It has lots of lines that look like the errors
>>> below:
>>> 
>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
>>> capacity is 66765959168.262146, job 495795 requests additional
>>> 68719476736.000000
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>>> 
>>> Any tips or pointers would be appreciated.
>>> 
>>> Thanks,
>>> Lane
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
> 
> 
> 
> -- 
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to