Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Lane Schwartz Thu, 10 Mar 2011 13:20:41 -0800

Thanks for the tips.

We don't ever use the -pe flag. I assume that means all of our jobs are
serial.


Lane

On Thu, Mar 10, 2011 at 2:30 PM, "Hung-Sheng Tsao (Lao Tsao 老曹) Ph. D." <
[email protected]> wrote:

>
> may be relate to this
> Multiplied Resource Requests Versus Non-Multiplied Resource Requests
>
> By default Sun Grid Engine performs *multiplied resource requests*, which
> means that a consumable resource request is multiplied by the number of
> slots allocated to a parallel job. The configuration for multiplied resource
> requests is designated by a YES flag in the consumable column of the jobrow 
> in the complex definition.
>
> The following multiplied resource request is explained below:
>
> qsub -l mem=100M -pe make=8
>
> Sun Grid Engine multiples the consumable resource request (100 M) by the
> number of slots allocated for the parallel job (8). The consumable usage
> is split across the queues and hosts on which the job runs. If four tasks
> run on host A and four tasks run on host B, the job consumes 400 Mbytes on
> each host.
>
> While multiplied resource requests typically work well, in the case of
> software licenses, it is more practical to make a per job request, or a 
> *non-multiplied
> resource request*, which debits the exact amount requested. Starting in
> Sun Grid Engine 6.2u2, you can configure the complex to accept
> non-multiplied resource requests by changing the jobs consumable flag from
> YES to JOB, as shown below:
>
> #name   shortcut   type   relop   requestable   consumable   default   urgency
> #-----------------------------------------------------------------------------
> jobs       j        INT    <=          YES           JOB        0        0
>
> For more on the complex configuration, see the 
> queue_conf(5)<http://gridengine.sunsource.net/manpages.html>man page.
>
> On 3/10/2011 2:04 PM, Lane Schwartz wrote:
>
> Hi,
>
> Lately I've noticed that many of my jobs take much longer than
> expected (sometimes up to half an hour)  to go from pending to
> running, even when there are numerous nodes with sufficient resources
> available. Right now, for example, I've got a couple dozen jobs in
> pending, and 38 nodes where no jobs are running.
>
> I was wondering if anyone might be able to shed some light on why this
> might be. As I said, there are plenty of nodes with sufficient
> resources available to run the pending jobs, but they sometimes take a
> long time to go from pending to running.
>
> For reference, mem_free is set to consumable, and my jobs use the
> default value of 4GB for their requested mem_free. There are some
> other users' jobs which request more memory than that.
>
> The only clue I've been able to find is from examining the qmaster
> messages log file. It has lots of lines that look like the errors
> below:
>
> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
> capacity is 66765959168.262146, job 495795 requests additional
> 68719476736.000000
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run
> 03/10/2011 13:56:00|worker|t3n2|W|
> Skipping 108 remaining orders
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run
>
> Any tips or pointers would be appreciated.
>
> Thanks,
> Lane
> _______________________________________________
> users mailing 
> [email protected]https://gridengine.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>


-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"

<<linkext7.gif>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Long delay starting jobs, even when compute nodes are empty

Reply via email to