Hi,

Am 19.01.2012 um 01:40 schrieb Fernanda Foertter:

> Thanks guys.  I got this process to work using holds and dependencies using a 
> sample script... but:
> 
> The user's code used system calls to run executables, within main code.

is the idea to spread (i.e. fork) all the necessary processes, so that in the 
essence the first job is like a starter method (controller) of all the others 
and you want to put this scheme into SGE?

So, instead of using the -hold_jid you prefer to release the hold by hand, 
which means that the node must be a submit host as you state correctly.

Does the first job run a long time? I could also imagine to use in a loop `qrsh 
-inherit ...` according to the list of granted slots (i.e. the first job is 
already submitted as a parallel one) and you start all the follow up tasks this 
way (replacing the system calls). This would fit perfectly into SGE and the 
nodes don't be submit hosts.

-- Reuti


> I planned to replaced the original executable calls with qalter.
> Unfortunately, qalter -h U requires the node to be a submit node... again, I 
> want to avoid this.
> 
> Any suggestions?
> 
> Thanks in advance!
> 
> On Jan 3, 2012, at 2:54 PM, Reuti wrote:
> 
>> Hi,
>> 
>> Am 03.01.2012 um 19:20 schrieb Fernanda Foertter:
>> 
>>>     I have this complicated job sequence that requires the "spawning" of 
>>> many other jobs related to the original one.  It's a bunch of serial jobs 
>>> spread of a large cluster, but for accounting purposes, I'd like to keep 
>>> track of the all.
>>> 
>>>     The basic sequence is:
>>> 
>>> Main Job [spawns]
>>> |
>>> \/
>>> (18)Layer_1_jobs [each_spawn]---->(20)Iterative_jobs [collect all 
>>> results]---->(1)Serial_job [sends 20 results back to Layer_1_Job]
>>> |
>>> \/
>>> (1) serial_job [final calcs]
>>> |
>>> \/
>>> Ends Main Job after collecting all 18 datasets.
>>> 
>>> The reason for this complicated zoo is that even if one of the iterative 
>>> jobs die, we can still proceed with the whole process... so robustness 
>>> (completeness) is the key here.
>>> 
>>> So I wonder if I shouldn't just code a controller using DRMAA instead of 
>>> doing Job arrays?
>> 
>> where are you using arrays above?
>> 
>> Why not submit one job, for the 20 jobs you put a hold by jobnumber/jobname 
>> of the initial one. For the final job you use again hold by jobnumbe/jobname 
>> for the 20 jobs. SGE judges a job as completed as soon as he left the system 
>> - whether it was successful or not doesn't matter. So all can be submitted 
>> on on the master node as usual.
>> 
>> It just necessary to have unique jobnames for each workflow, but the 20 jobs 
>> you could name job_a1, job_a2, ... and wait for: -hold_jid "job_a*" It could 
>> also be an array job with just one number which you have to wait for.
>> 
>> 
>>> For development the easy hack was to make a couple of compute nodes 
>>> submit_hosts so that 18 jobs do to those two nodes and from there each one 
>>> spawns it's 20 job load to the rest of the cluster.  This was fine for 
>>> testing, but now there'll be multiple runs and I don't want to make my 
>>> nodes submit nodes.
>> 
>> DRMAA(1) isn't offering much in this area I fear, v2 has a JobSession to 
>> which you could reconnect. If for now your "workflow supervisor" crashes, 
>> the workflow can't be restarted.
>> 
>> ==
>> 
>> Another option could be Wildfire:
>> 
>> http://wildfire.bii.a-star.edu.sg/screens.php
>> 
>> Although the project looks dead, it can still be used and will allow to 
>> create a workflow. It's also working without the GUI, just by supplying a 
>> file with the depndences, loops and cases.
>> 
>> -- Reuti
>> 
>> 
>>> I welcome your input.
>>> 
>>> Fernanda
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to