Hi all,
I'm still puzzled by the expected behaviour of the following:
$ sbatch --hold fakejob.sh
Submitted batch job 25909273
$ sbatch --hold fakejob.sh
Submitted batch job 25909274
$ sbatch --hold fakejob.sh
Submitted batch job 25909275
$ scontrol update jobid=25909273 Dependency=singleton
$ scontrol update jobid=25909274 Dependency=singleton,after:25909275
$ scontrol update jobid=25909275 Dependency=singleton,after:25909273
$ scontrol release 25909273 25909274 25909275
I expected these to be executed as 25909273, 25909275, 25909274. However, it
seems that singletons are executed in order of submission so that this leads to
a circular dependency. That is, 25909274 depends on 25909275 due to "after",
and 25909275 depends on 25909274 due to "singleton" plus order of submission.
>From the man page for sbatch, that wasn't really clear to me:
singleton
This job can begin execution after any previously
launched jobs sharing the same
job name and user have terminated.
I'm somewhat interested in creating a patch for this, but before I can look
into this, I'll need to know what the expected behaviour is.
If "launched" means submitted to the queue and preserving order, then I should
focus on the circular dependency detection.
If "launched" means entered the running state without preserving order, then I
should focus on the dependency resolving.
Any thoughts on this?
Thanks,
Jarno
Jarno van der Kolk, PhD Phys.
Analyste principal en informatique scientifique | Senior Scientific Computing
Specialist
Solutions TI | IT Solutions
Université d’Ottawa | University of Ottawa