On Wed, 14 Sep 2011 10:44:36 -0700, Danny Auble wrote:
Have you had a look at  the HTC documentation?

http://schedmd.com/slurmdocs/high_throughput.html

Yes, I have. I was able to improve the scheduling speed by tuning the configuration (before that, I couldn't even queue 65k jobs before getting timeouts and abysmal performance). Meanwhile, I will update to 2.2 to get larger job counts, but still that doesn't address all my concerns. Please be patient :)

Without knowing what your real objective is it is hard to prescribe a
real solution.

From your description it seems strange you would have the script
sbatch is calling call sbatch once again.  What are you trying to
accomplish there?
Wouldn't it just be easier to run this script outside of an allocation?

Ok, I will restate my problem in a more practical manner. Please ask if there's any question or any idea on how to improve the behavior.

I'm running bioinformatic batches of various kinds on genetic data. A typical analysis will involve running a short batch (~ 10 minutes) multiplied for each polymorphism we have (roughly 100k times in the smallest case). Perfect candidate for distribution, since every step in a single stage is independent.

Analyses are usually multi-stage:

- we run "stage 1" (first 100k jobs)
- collect and aggregate data (a single job depending on "stage 1"
- run "stage 2" using collected data (another 100k jobs)
- (repeat)

Let's assume queuing ~200k jobs is not a problem with 2.2.

First issue: "squeue" takes forever with more than >5000 jobs. If more than one user is scheduling a workflow like this it becomes impossible to use it at all. Also, managing the queue itself (managing jobs, killing just "stage 1" is impossible). I would like to group the first 100k jobs in a single "id", so that I know that jobs 1-100k belong to "stage 1".

My impression by reading the docs is that I can create an allocation and run "steps" to achieve this behavior. squeue or salloc is the easiest way, but since queuing that many jobs is also time-consuming, running the queuing script on the queue itself seemed a perfect solution (hence sbatch --jobid within sbatch). This method (using salloc or sbatch) also seems to work fine if I put a fat "sleep" to keep alive the allocation.

Also, consider that eventually I will need to queue jobs within a script anyway (the ending step of "stage 1" might be scheduling "stage 2" itself).

Second issue: job dependencies. If I can use a single job with steps, I can put dependencies for "step 2" easily on a single id and schedule everything "outside" of slurm. If this is not possible, then I need a barrier (like "wait" in a script like you suggested) so that as soon a "stage 1" finishes I can schedule the next stages within the batch itself.

Right now, to word around these issues, I'm artificially limiting the jobs by scheduling N/Z jobs, where each resuling job runs Z steps sequentially. This limits parallelism however. To work around dependencies issues, I'm looping with a script around "squeue" to see if a pre-determined stage has finished. Ugly, but having people wait to schedule more jobs (and thus letting the machines idle) is worse.

Reply via email to