>> Am 19.01.2012 um 01:40 schrieb Fernanda Foertter: >> >>> Thanks guys. I got this process to work using holds and dependencies >>> using a sample script... but: >>> >>> The user's code used system calls to run executables, within main code. >> >> is the idea to spread (i.e. fork) all the necessary processes, so that in >> the essence the first job is like a starter method (controller) of all >> the others and you want to put this scheme into SGE? >> >> So, instead of using the -hold_jid you prefer to release the hold by >> hand, which means that the node must be a submit host as you state >> correctly. >> >> Does the first job run a long time? I could also imagine to use in a loop >> `qrsh -inherit ...` according to the list of granted slots (i.e. the >> first job is already submitted as a parallel one) and you start all the >> follow up tasks this way (replacing the system calls). This would fit >> perfectly into SGE and the nodes don't be submit hosts. >> >> -- Reuti > > We do something similar, using qsub -sync y rather than qrsh. The downside > is you're stuck with a process that is controlling the qsub or qrsh > running on the submit node, and if the submit node goes down, then your > workflow breaks. Or am I missing something?
Correct. Therefore it might be better to avoid `qsub -sync y ...` and put all in an enclosing job script which is run inside the cluster. It might avoid an additional point of failure this way on the submit host as the `qrsh -inherit ...` runs inside the cluster and not on the submit host. It's the adaptopn of a parallel job for running many serial jobs inside it. This assumes, that the serial jobs (or parallel jobs with a subnet of all granted slots) have almost the same run-time to avoid blocking slots for a long time from the pool of granted ones. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
