Hello everybody,
we face serious issues in controlling the execution of several
sbatch-Jobs from a shell script running on one of our submit hosts.
I.) What we want to achieve:
1.) We want to have a controlling shell script, let's call it
control_submits.sh
2.) This control_submits.sh should run several
sbatch submit_job1.sh
sbatch submit_job2.sh
...
3.) Each of the submit_jobN.sh invokes mpiexec.
The point is, that we want job2 to start execution *after* job1 has
finished and so on. Thus, we need some way to do a blocking wait for the
job completions in control_submits.sh.
We can't use the SLURM's --dependency-facility because the existing
setup is actually way more complicated, we have more than one
control_submits.sh and these do extra work between the job submissions.
It is important, that the submit_jobN.sh scripts themselves run on some
of the allocated resources, not on the submit host.
I would like to know if I am missing something and it is very easy to
get the desired effects and if not, what would be the best way to
achieve this.
II.) Possible suggested solutions:
II.1)
Replace sbatch by mysubmit.pl which parses the submit_jobN.sh's inlined
"#SBATCH" options, creates a wrapped_submit_jobN.sh and invokes
srun <options I got out of #SBATCH> wrapped_submit_jobN.sh
where wrapped_submit_jobN.sh exits immediately if $SLURM_PROC_ID != 0,
otherwise, it runs the contents of the original submit_jobN.sh,
i.e. mpiexec besides other stuff.
Question: Would the srun's invoked within the submit_jobN.sh's (through
mpiexec) cause a new allocation, or would they create their steps under
the allocation of the original srun-call?
A first quick test supposes the latter, but I'm not sure.
II.2)
Submitting the submit_jobN.sh's with sbatch as before and polling with
scontrol show job <jobid>
for completion from within control_submits.sh
II.3)
Submitting the submit_jobN.sh's with sbatch and
implementing some fancy wait_for_sbatch which somehow uses the SLURM
protocol directly. I'm not sure if this is actually possible and where
to start though.
Thank you very much for your input!
Best,
Nicolai