>> Am 19.01.2012 um 01:40 schrieb Fernanda Foertter:
>> 
>>> Thanks guys.  I got this process to work using holds and dependencies
>>> using a sample script... but:
>>> 
>>> The user's code used system calls to run executables, within main code.
>> 
>> is the idea to spread (i.e. fork) all the necessary processes, so that in
>> the essence the first job is like a starter method (controller) of all
>> the others and you want to put this scheme into SGE?
>> 
>> So, instead of using the -hold_jid you prefer to release the hold by
>> hand, which means that the node must be a submit host as you state
>> correctly.
>> 
>> Does the first job run a long time? I could also imagine to use in a loop
>> `qrsh -inherit ...` according to the list of granted slots (i.e. the
>> first job is already submitted as a parallel one) and you start all the
>> follow up tasks this way (replacing the system calls). This would fit
>> perfectly into SGE and the nodes don't be submit hosts.
>> 
>> -- Reuti
> 
> We do something similar, using qsub -sync y rather than qrsh. The downside
> is you're stuck with a process that is controlling the qsub or qrsh
> running on the submit node, and if the submit node goes down, then your
> workflow breaks. Or am I missing something?

Correct.

Therefore it might be better to avoid `qsub -sync y ...` and put all in an 
enclosing job script which is run inside the cluster. It might avoid an 
additional point of failure this way on the submit host as the `qrsh -inherit 
...` runs inside the cluster and not on the submit host. It's the adaptopn of a 
parallel job for running many serial jobs inside it. This assumes, that the 
serial jobs (or parallel jobs with a subnet of all granted slots) have almost 
the same run-time to avoid blocking slots for a long time from the pool of 
granted ones.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to