Hello everybody,

we face serious issues in controlling the execution of several
sbatch-Jobs from a shell script running on one of our submit hosts.

I.) What we want to achieve:
1.) We want to have a controlling shell script, let's call it
    control_submits.sh
2.) This control_submits.sh should run several
    sbatch submit_job1.sh
    sbatch submit_job2.sh
    ...
3.) Each of the submit_jobN.sh invokes mpiexec.

The point is, that we want job2 to start execution *after* job1 has
finished and so on. Thus, we need some way to do a blocking wait for the
job completions in control_submits.sh.

We can't use the SLURM's --dependency-facility because the existing
setup is actually way more complicated, we have more than one
control_submits.sh and these do extra work between the job submissions.

It is important, that the submit_jobN.sh scripts themselves run on some
of the allocated resources, not on the submit host.

I would like to know if I am missing something and it is very easy to
get the desired effects and if not, what would be the best way to
achieve this.

II.) Possible suggested solutions:
II.1)
Replace sbatch by mysubmit.pl which parses the submit_jobN.sh's inlined
"#SBATCH" options, creates a wrapped_submit_jobN.sh and invokes
srun <options I got out of #SBATCH> wrapped_submit_jobN.sh
where wrapped_submit_jobN.sh exits immediately if $SLURM_PROC_ID != 0,
otherwise, it runs the contents of the original submit_jobN.sh,
i.e. mpiexec besides other stuff.

Question: Would the srun's invoked within the submit_jobN.sh's (through
mpiexec) cause a new allocation, or would they create their steps under
the allocation of the original srun-call?

A first quick test supposes the latter, but I'm not sure.

II.2)
Submitting the submit_jobN.sh's with sbatch as before and polling with
scontrol show job <jobid>
for completion from within control_submits.sh

II.3)
Submitting the submit_jobN.sh's with sbatch and
implementing some fancy wait_for_sbatch which somehow uses the SLURM
protocol directly. I'm not sure if this is actually possible and where
to start though.

Thank you very much for your input!

Best,

Nicolai

Reply via email to