On Thu, 15 Sep 2011 18:11:27 +0200
Carles Fenoy <[email protected]> wrote:
> I would suggest using srun inside the job, and not sbatch.
>
> If you submit a job with sbatch and inside use srun you can easily use
> dependencies between stages.
>
> So you will have
>
> $ sbatch --job-name=stage1 --ntasks=6 ./script_stage1.sh
> Submitted batch job 1
>
> script_stage1.sh:
>
> srun -n1 process_data.exe 1 &
> srun -n1 process_data.exe 2 &
> srun -n1 process_data.exe 3 &
> srun -n1 process_data.exe 4 &
> ...
> # You should control the number of srun running at the same time
>
> wait
Well, yes, but that's very kludgy.
Before trying with SLURM (and torque, if that matters) I was using first a set
of scripts, which basically were managing subjobs manually in the shell,
waiting for instances to finish, etc. It kinda-worked, but scheduling was far
from optimal, it was not scalable and requires too much planning between users.
Then I switched to a script which queued jobs using "pexec" using its
hypervisor mode. Better scheduling, but still requires too much planning.
I want to be able to dynamically scale the partition. Also, I want job
scheduling to be as simple as "queuing-command ./job". Handling parallellism in
a job itself like you propose is a step backward in my opinion.
I mean, the easiest way would be
$ sbatch ./script
./script
while ....; do
for x in $(seq 1 Y); do
srun ... &
done
wait
done
but this does not take advantage of the full parallism unless multiple
instances of the same script are run at the same time (the wait here will
create usage gaps). Not to mention that you have to provision for Y steps,
where Y needs to be determined manually.
> That's the way I'll do it, but don't know if there is any other way to do
> it.
> You should also have a control of which steps in every stage have finished
> in order to be able to resume the execution in case of any node failure.
The way I see it, I would like a "swait" command. This would allow me to queue
steps, then wait for these steps to finish.
sbatch stage1.sh
# stage1.sh
for ....; do
sbatch --jobid $SLURM_JOB_ID stage1-step.sh
done
swait $SLURM_JOB_ID
#####
When "swait" is invoked inside the same job id as given on the command like, it
should simply wait for all steps to finish, without counting the id of the
allocation. Better yet, if SLURM_JOB_ID is defined, it should use it directly.
That way "swait" would map *perfectly* to the "wait" built-in.
This way I could also implement super-easily my steps:
sbatch multi-stage.sh
# multi-stage.sh
for ...; do
sbatch --jobid $SLURM_JOB_ID stage1-step.sh
done
swait
for ...; do
sbatch --jobid $SLURM_JOB_ID stage2-step.sh
done
swait
echo "finished"
#####
There! The dependency problem is resolved without using dependencies in the
first place. Also, managing the queue becomes *much* easier.
I'm actually prepared to implement it (I'm unsure if I could use the API or
have to change the sources of the slurmctld for this to work efficiently).
Does anybody have any suggestion/idea/comment on this? Is the idea sound?
Any recommendation as where to start?
Thanks again.