On Wed, 14 Sep 2011 10:44:36 -0700, Danny Auble wrote:
Have you had a look at the HTC documentation?
http://schedmd.com/slurmdocs/high_throughput.html
Yes, I have. I was able to improve the scheduling speed by tuning the
configuration (before that, I couldn't even queue 65k jobs before
getting timeouts and abysmal performance). Meanwhile, I will update to
2.2 to get larger job counts, but still that doesn't address all my
concerns. Please be patient :)
Without knowing what your real objective is it is hard to prescribe a
real solution.
From your description it seems strange you would have the script
sbatch is calling call sbatch once again. What are you trying to
accomplish there?
Wouldn't it just be easier to run this script outside of an
allocation?
Ok, I will restate my problem in a more practical manner. Please ask if
there's any question or any idea on how to improve the behavior.
I'm running bioinformatic batches of various kinds on genetic data. A
typical analysis will involve running a short batch (~ 10 minutes)
multiplied for each polymorphism we have (roughly 100k times in the
smallest case). Perfect candidate for distribution, since every step in
a single stage is independent.
Analyses are usually multi-stage:
- we run "stage 1" (first 100k jobs)
- collect and aggregate data (a single job depending on "stage 1"
- run "stage 2" using collected data (another 100k jobs)
- (repeat)
Let's assume queuing ~200k jobs is not a problem with 2.2.
First issue: "squeue" takes forever with more than >5000 jobs. If more
than one user is scheduling a workflow like this it becomes impossible
to use it at all. Also, managing the queue itself (managing jobs,
killing just "stage 1" is impossible). I would like to group the first
100k jobs in a single "id", so that I know that jobs 1-100k belong to
"stage 1".
My impression by reading the docs is that I can create an allocation
and run "steps" to achieve this behavior. squeue or salloc is the
easiest way, but since queuing that many jobs is also time-consuming,
running the queuing script on the queue itself seemed a perfect solution
(hence sbatch --jobid within sbatch). This method (using salloc or
sbatch) also seems to work fine if I put a fat "sleep" to keep alive the
allocation.
Also, consider that eventually I will need to queue jobs within a
script anyway (the ending step of "stage 1" might be scheduling "stage
2" itself).
Second issue: job dependencies. If I can use a single job with steps, I
can put dependencies for "step 2" easily on a single id and schedule
everything "outside" of slurm. If this is not possible, then I need a
barrier (like "wait" in a script like you suggested) so that as soon a
"stage 1" finishes I can schedule the next stages within the batch
itself.
Right now, to word around these issues, I'm artificially limiting the
jobs by scheduling N/Z jobs, where each resuling job runs Z steps
sequentially. This limits parallelism however. To work around
dependencies issues, I'm looping with a script around "squeue" to see if
a pre-determined stage has finished. Ugly, but having people wait to
schedule more jobs (and thus letting the machines idle) is worse.