On Wed, 18 Jan 2012 12:02:11 -0700 Moe Jette <je...@schedmd.com> wrote:
> We have held some discussions on this subject and it isn't simple to > resolve. The best way to do this would probably be to establish > finer-grained locking so there can be more parallelism, say by locking > individual job records rather than the entire job list. That would > impact quite a few sub-systems, for example how we preserve job state. > > If you could submit a smaller number of jobs that each have many job > steps, that could address your problem today (say submitting 1000 jobs > each with 1000 steps). Problem is (but correct me if I'm wrong), if I submit a job with 1000 steps, I have to manually manage the concurrency among steps. I think we discussed this already in the list, when I was talking about implementing "swait" to group jobs and why in the end it wasn't a good solution for the way steps are actually run. Also, I have to request a specific cpu/node count (which basically means: the entire cluster), which I would like to avoid. Once the cluster is allocated all the other features such as priority etc would become useless.