On 2/6/12 4:25 PM, "Lane Schwartz" <[email protected]> wrote:

>Hi all,
>
>I have a large number of jobs that I need to run. Each of these jobs
>kicks off a number of child jobs. The child jobs do most of the actual
>work - the parent jobs mostly sit and wait until the child jobs have
>completed.
>
>Ideally, I would like to kick off all of my parent jobs, and let them
>spawn off all of their respective child jobs, and wait until
>everything finishes. But there's a problem with this. If I kick off
>all of the parent jobs, then the parent jobs take up lots of slots in
>my grid, and it takes far longer than it should for the grid to work
>through all of the child jobs, because the parent jobs are taking up
>so many compute slots.
>
>To solve this problem, it occurred to me that it would be nice if I
>could specify (perhaps by job name) a maximum number of parent jobs
>that can simultaneously be executing.
>
>The way I'm currently working around this problem is the following. I
>launch one or two parent jobs, then wait until they have spawned their
>child jobs. At this point all of the slots in my grid have been
>filled. I then launch the rest of my parent jobs, which don't run,
>because no slots are available. I then use qmon to lower the priority
>of my waiting parent jobs. This works OK, but later on I still
>sometimes end up with too many parent jobs running simultaneously.
>
>I've looked through the documentation to try to find a better
>solution. The closest thing I've found is the -tc flag to qsub, which
>allows me to limit the number of concurrent array jobs executing.
>Unfortunately, the parent jobs are not themselves array jobs, and
>while I suppose I could try to rewrite the parent launch scripts to
>launch as an array job, this would be less than ideal.
>
>I was wondering if anyone has any other ideas on how to specify that
>no more than n instances of jobs with a specified name should be able
>to run simultaneously. I'd be open to other mechanisms, too.
>

We do something similar, but we accomplish it differently. We have a
script that runs on the submit host that identifies how many chunks the
input dataset will be divided into, then submits an array job to process
that many chunks. This array job is submitted with a name (using '-N
<name>') that is generated by the script. The script then submits an
'accumulation' job that assembles the results of the array job, but uses
'qsub -hold_jid <name>' so it waits in queue until all tasks of the array
job finish. Of course, if your child jobs have to actually talk to your
parent job periodically, this won't do you much good.

John


----------------------------------------- Confidentiality Notice:
The following mail message, including any attachments, is for the
sole use of the intended recipient(s) and may contain confidential
and privileged information. The recipient is responsible to
maintain the confidentiality of this information and to use the
information only for authorized purposes. If you are not the
intended recipient (or authorized to receive information for the
intended recipient), you are hereby notified that any review, use,
disclosure, distribution, copying, printing, or action taken in
reliance on the contents of this e-mail is strictly prohibited. If
you have received this communication in error, please notify us
immediately by reply e-mail and destroy all copies of the original
message. Thank you.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to