Re: [gridengine users] Requesting multiple cores on one node

Joseph Farran Wed, 15 Jan 2014 15:27:20 -0800

Allison,

I love Grid Engine but this is the one feature I truly miss from Torque:


-l nodes=x:ppn=[count]


Reuti,

We have a complex setup trying to accomplish this same thing and it kind of 
works but we have an issue with job not starting when jobs are running on a 
subordinate queue.

First, here is our setup:

qconf -sc | egrep "#|exclu"
#name               shortcut   type        relop   requestable consumable 
default  urgency
#------------------------------------------------------------------------------------------
exclusive           excl       BOOL        EXCL    YES         YES        FALSE 
   1000

Our MPI PE has:
$ qconf -sp mpi
pe_name            mpi
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

Our two queues:
$ qconf -sq free64 | grep sub
subordinate_list      NONE

$ qconf -sq pub64 | grep sub
subordinate_list      free64=1

When we submit our MPI jobs to pub64 with:

#!/bin/bash
#$ -q pub64
#$ -pe mpi 256
#$ -l exclusive=true

The MPI job will NOT suspend jobs on the "free64" queue.    The job waits until free64 
jobs are done and then the job runs and grabs the entire nodes correctly using the 
"exclusive" consumable.

Is there a fix to this?   So that jobs on free64 ARE suspended when using "-l 
exclusive=true" and pe "mpi" on our pub64 queue?

Using other pe like openmp works just fine and jobs are suspended correctly.    
So it's only with this combo.

Joseph


On 01/15/2014 02:58 PM, Reuti wrote:

Am 15.01.2014 um 23:28 schrieb Allison Walters:

We have OpenMP jobs that need a user-defined (usually more than one but less 
than all) number of cores on a single node for each job.  In addition to 
running these jobs, our program has an interface to the cluster so they can 
submit jobs through a custom GUI (and we build the qsub command in the 
background for the submission).  I'm trying to find a way for the job to 
request those multiple cores that does not depend on the cluster to be 
configured a certain way, since we have no control as to whether the client has 
a parallel environment created, how it's named, etc...

This is not in the paradigm of SGE. You can only create a consumable complex, 
attach it to each exechost and request the correct amount for each job, even 
serial ones (by a default of 1). But in this case, the memory requests (or 
other) won't be multiplied, as SGE always thinks it's a serial job. But then 
you replace the custom PE by a custom complex.

Basically, I'm just looking for the equivalent of -l nodes=[count]

Wouldn't it be: -l nodes=1:ppn=[count]

For -l nodes=[count] it's like SGE's allocation_rule $round_robin or $fill_up - 
depending on a setting somewhere in Torque (i.e. for all types of job the same 
will be applied all the time). It could spawn more than a node in either case.

-- Reuti

in PBS/Torque, or -n [count] in LSF, etc...  The program will use the correct 
number of cores we pass to it, but we need to pass that parameter to the 
cluster as well to ensure it only gets sent to a node with the correct amount 
of cores available.  This works fine in the other clusters we support but I'm 
completely at a loss as to how to do it in Grid Engine.  I feel like I must be 
missing something!  :-)

Thank you.

-Allison
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Requesting multiple cores on one node

Reply via email to