Hi all,
I have this complicated job sequence that requires the "spawning" of
many other jobs related to the original one. It's a bunch of serial jobs
spread of a large cluster, but for accounting purposes, I'd like to keep track
of the all.
The basic sequence is:
Main Job [spawns]
|
\/
(18)Layer_1_jobs [each_spawn]---->(20)Iterative_jobs [collect all
results]---->(1)Serial_job [sends 20 results back to Layer_1_Job]
|
\/
(1) serial_job [final calcs]
|
\/
Ends Main Job after collecting all 18 datasets.
The reason for this complicated zoo is that even if one of the iterative jobs
die, we can still proceed with the whole process... so robustness
(completeness) is the key here.
So I wonder if I shouldn't just code a controller using DRMAA instead of doing
Job arrays? For development the easy hack was to make a couple of compute
nodes submit_hosts so that 18 jobs do to those two nodes and from there each
one spawns it's 20 job load to the rest of the cluster. This was fine for
testing, but now there'll be multiple runs and I don't want to make my nodes
submit nodes.
I welcome your input.
Fernanda
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users