Hi,

Am 06.02.2012 um 22:25 schrieb Lane Schwartz:

> I have a large number of jobs that I need to run. Each of these jobs
> kicks off a number of child jobs. The child jobs do most of the actual
> work - the parent jobs mostly sit and wait until the child jobs have
> completed.
> 
> Ideally, I would like to kick off all of my parent jobs, and let them
> spawn off all of their respective child jobs, and wait until
> everything finishes. But there's a problem with this. If I kick off
> all of the parent jobs, then the parent jobs take up lots of slots in
> my grid, and it takes far longer than it should for the grid to work
> through all of the child jobs, because the parent jobs are taking up
> so many compute slots.
> 
> To solve this problem, it occurred to me that it would be nice if I
> could specify (perhaps by job name) a maximum number of parent jobs
> that can simultaneously be executing.
> 
> The way I'm currently working around this problem is the following. I
> launch one or two parent jobs, then wait until they have spawned their
> child jobs. At this point all of the slots in my grid have been
> filled. I then launch the rest of my parent jobs, which don't run,
> because no slots are available. I then use qmon to lower the priority
> of my waiting parent jobs. This works OK, but later on I still
> sometimes end up with too many parent jobs running simultaneously.
> 
> I've looked through the documentation to try to find a better
> solution. The closest thing I've found is the -tc flag to qsub, which
> allows me to limit the number of concurrent array jobs executing.
> Unfortunately, the parent jobs are not themselves array jobs, and
> while I suppose I could try to rewrite the parent launch scripts to
> launch as an array job, this would be less than ideal.
> 
> I was wondering if anyone has any other ideas on how to specify that
> no more than n instances of jobs with a specified name should be able
> to run simultaneously. I'd be open to other mechanisms, too.

As the parent jobs are not doing any work, a special parent.q would do which 
has to be requested by a forced boolean complex, so that only parent jobs can 
get in. You could even set a h_cpu limit on this queue to avoid abuse - jobs 
abusing this queue would get killed after 5 minutes or so. The overall slot 
count used in this cluster queue you can limit in an RQS.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to