may be relate to this
Multiplied Resource Requests Versus Non-Multiplied Resource
Requests
By default Sun Grid Engine performs multiplied resource
requests, which means that a consumable resource request is
multiplied by the number of slots allocated to a parallel job. The
configuration for multiplied resource requests is designated by a
YES flag in the consumable column of the job
row in the complex definition.
The following multiplied resource request is explained below:
qsub -l mem=100M -pe make=8
Sun Grid Engine multiples the consumable resource request (100
M) by the number of slots allocated for the parallel job (8).
The consumable usage is split across the queues and hosts on which
the job runs. If four tasks run on host A and four tasks run on
host B, the job consumes 400 Mbytes on each host.
While multiplied resource requests typically work well, in the
case of software licenses, it is more practical to make a per job
request, or a non-multiplied resource request, which
debits the exact amount requested. Starting in Sun Grid Engine
6.2u2, you can configure the complex to accept non-multiplied
resource requests by changing the jobs consumable flag
from YES to JOB, as shown below:
#name shortcut type relop requestable consumable default urgency
#-----------------------------------------------------------------------------
jobs j INT <= YES JOB 0 0
For more on the complex configuration, see the queue_conf(5)
man page.
On 3/10/2011 2:04 PM, Lane Schwartz wrote:
Hi,
Lately I've noticed that many of my jobs take much longer than
expected (sometimes up to half an hour) to go from pending to
running, even when there are numerous nodes with sufficient resources
available. Right now, for example, I've got a couple dozen jobs in
pending, and 38 nodes where no jobs are running.
I was wondering if anyone might be able to shed some light on why this
might be. As I said, there are plenty of nodes with sufficient
resources available to run the pending jobs, but they sometimes take a
long time to go from pending to running.
For reference, mem_free is set to consumable, and my jobs use the
default value of 4GB for their requested mem_free. There are some
other users' jobs which request more memory than that.
The only clue I've been able to find is from examining the qmaster
messages log file. It has lots of lines that look like the errors
below:
03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
capacity is 66765959168.262146, job 495795 requests additional
68719476736.000000
03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
resources have changed during a scheduling run
03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
resources have changed during a scheduling run
Any tips or pointers would be appreciated.
Thanks,
Lane
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users