Hi, Am 06.02.2012 um 22:25 schrieb Lane Schwartz:
> I have a large number of jobs that I need to run. Each of these jobs > kicks off a number of child jobs. The child jobs do most of the actual > work - the parent jobs mostly sit and wait until the child jobs have > completed. > > Ideally, I would like to kick off all of my parent jobs, and let them > spawn off all of their respective child jobs, and wait until > everything finishes. But there's a problem with this. If I kick off > all of the parent jobs, then the parent jobs take up lots of slots in > my grid, and it takes far longer than it should for the grid to work > through all of the child jobs, because the parent jobs are taking up > so many compute slots. > > To solve this problem, it occurred to me that it would be nice if I > could specify (perhaps by job name) a maximum number of parent jobs > that can simultaneously be executing. > > The way I'm currently working around this problem is the following. I > launch one or two parent jobs, then wait until they have spawned their > child jobs. At this point all of the slots in my grid have been > filled. I then launch the rest of my parent jobs, which don't run, > because no slots are available. I then use qmon to lower the priority > of my waiting parent jobs. This works OK, but later on I still > sometimes end up with too many parent jobs running simultaneously. > > I've looked through the documentation to try to find a better > solution. The closest thing I've found is the -tc flag to qsub, which > allows me to limit the number of concurrent array jobs executing. > Unfortunately, the parent jobs are not themselves array jobs, and > while I suppose I could try to rewrite the parent launch scripts to > launch as an array job, this would be less than ideal. > > I was wondering if anyone has any other ideas on how to specify that > no more than n instances of jobs with a specified name should be able > to run simultaneously. I'd be open to other mechanisms, too. As the parent jobs are not doing any work, a special parent.q would do which has to be requested by a forced boolean complex, so that only parent jobs can get in. You could even set a h_cpu limit on this queue to avoid abuse - jobs abusing this queue would get killed after 5 minutes or so. The overall slot count used in this cluster queue you can limit in an RQS. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
