Allison,
I love Grid Engine but this is the one feature I truly miss from Torque:
-l nodes=x:ppn=[count]
Reuti,
We have a complex setup trying to accomplish this same thing and it kind of
works but we have an issue with job not starting when jobs are running on a
subordinate queue.
First, here is our setup:
qconf -sc | egrep "#|exclu"
#name shortcut type relop requestable consumable
default urgency
#------------------------------------------------------------------------------------------
exclusive excl BOOL EXCL YES YES FALSE
1000
Our MPI PE has:
$ qconf -sp mpi
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
qsort_args NONE
Our two queues:
$ qconf -sq free64 | grep sub
subordinate_list NONE
$ qconf -sq pub64 | grep sub
subordinate_list free64=1
When we submit our MPI jobs to pub64 with:
#!/bin/bash
#$ -q pub64
#$ -pe mpi 256
#$ -l exclusive=true
The MPI job will NOT suspend jobs on the "free64" queue. The job waits until free64
jobs are done and then the job runs and grabs the entire nodes correctly using the
"exclusive" consumable.
Is there a fix to this? So that jobs on free64 ARE suspended when using "-l
exclusive=true" and pe "mpi" on our pub64 queue?
Using other pe like openmp works just fine and jobs are suspended correctly.
So it's only with this combo.
Joseph
On 01/15/2014 02:58 PM, Reuti wrote:
Am 15.01.2014 um 23:28 schrieb Allison Walters:
We have OpenMP jobs that need a user-defined (usually more than one but less
than all) number of cores on a single node for each job. In addition to
running these jobs, our program has an interface to the cluster so they can
submit jobs through a custom GUI (and we build the qsub command in the
background for the submission). I'm trying to find a way for the job to
request those multiple cores that does not depend on the cluster to be
configured a certain way, since we have no control as to whether the client has
a parallel environment created, how it's named, etc...
This is not in the paradigm of SGE. You can only create a consumable complex,
attach it to each exechost and request the correct amount for each job, even
serial ones (by a default of 1). But in this case, the memory requests (or
other) won't be multiplied, as SGE always thinks it's a serial job. But then
you replace the custom PE by a custom complex.
Basically, I'm just looking for the equivalent of -l nodes=[count]
Wouldn't it be: -l nodes=1:ppn=[count]
For -l nodes=[count] it's like SGE's allocation_rule $round_robin or $fill_up -
depending on a setting somewhere in Torque (i.e. for all types of job the same
will be applied all the time). It could spawn more than a node in either case.
-- Reuti
in PBS/Torque, or -n [count] in LSF, etc... The program will use the correct
number of cores we pass to it, but we need to pass that parameter to the
cluster as well to ensure it only gets sent to a node with the correct amount
of cores available. This works fine in the other clusters we support but I'm
completely at a loss as to how to do it in Grid Engine. I feel like I must be
missing something! :-)
Thank you.
-Allison
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users