Am 16.01.2014 um 00:24 schrieb Joseph Farran: > Allison, > > I love Grid Engine but this is the one feature I truly miss from Torque: > > -l nodes=x:ppn=[count]
RFE could be the setting "allocation_rule $user_defined" in a PE, which would still be necessary to be requested during job submission. > <snip> > The MPI job will NOT suspend jobs on the "free64" queue. The job waits > until free64 jobs are done and then the job runs and grabs the entire nodes > correctly using the "exclusive" consumable. > > Is there a fix to this? So that jobs on free64 ARE suspended when using "-l > exclusive=true" and pe "mpi" on our pub64 queue? The suspension is the result of a job having started on this particular node, and for a fraction of a second the node is oversubscribed. This is also the reason why you may need more slots defined on an exechost level, as otherwise the limited slot count wouldn't allow to start the superordinated job. What you can try: as the jobs in free64 are suspended anyway, attach the exclusive attribute on a queue level, i.e. attach it to pub64 instead to the exechost. The job will start in pub64 then and suspend the free64 jobs. > Using other pe like openmp works just fine and jobs are suspended correctly. > So it's only with this combo. Interesting - which version of SGE are you running? -- Reuti > Joseph > > > On 01/15/2014 02:58 PM, Reuti wrote: >> Am 15.01.2014 um 23:28 schrieb Allison Walters: >> >>> We have OpenMP jobs that need a user-defined (usually more than one but >>> less than all) number of cores on a single node for each job. In addition >>> to running these jobs, our program has an interface to the cluster so they >>> can submit jobs through a custom GUI (and we build the qsub command in the >>> background for the submission). I'm trying to find a way for the job to >>> request those multiple cores that does not depend on the cluster to be >>> configured a certain way, since we have no control as to whether the client >>> has a parallel environment created, how it's named, etc... >> This is not in the paradigm of SGE. You can only create a consumable >> complex, attach it to each exechost and request the correct amount for each >> job, even serial ones (by a default of 1). But in this case, the memory >> requests (or other) won't be multiplied, as SGE always thinks it's a serial >> job. But then you replace the custom PE by a custom complex. >> >> >>> Basically, I'm just looking for the equivalent of -l nodes=[count] >> Wouldn't it be: -l nodes=1:ppn=[count] >> >> For -l nodes=[count] it's like SGE's allocation_rule $round_robin or >> $fill_up - depending on a setting somewhere in Torque (i.e. for all types of >> job the same will be applied all the time). It could spawn more than a node >> in either case. >> >> -- Reuti >> >> >>> in PBS/Torque, or -n [count] in LSF, etc... The program will use the >>> correct number of cores we pass to it, but we need to pass that parameter >>> to the cluster as well to ensure it only gets sent to a node with the >>> correct amount of cores available. This works fine in the other clusters >>> we support but I'm completely at a loss as to how to do it in Grid Engine. >>> I feel like I must be missing something! :-) >>> >>> Thank you. >>> >>> -Allison >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
