I'm not sure if this will be useful or not, but your use case reminded me of a project by Jim Garlick awhile back called "industrial strength pipes" (ISP). This project allows you to set up a chain of dependent tasks much like a UNIX pipeline, and it has some kind of support for spawning the tasks in the pipeline with srun(1). It might not exactly map to your usage case, but I thought I'd mention it nonetheless.
Another project that this discussion reminded me of was a set of scripts I wrote awhile back to run a personal instance of SLURM as a SLURM job. When this nested SLURM instance was launched, it then appeared to commands running within the job that a full SLURM cluster of however many nodes were in the job was available. You could then submit multiple batch jobs to this nested instance (even another request for a nested SLURM) The solution was kind of kludgy though, and a proper implementation was never accepted into SLURM proper, so unfortunately no such support exists today. mark On Wed, 14 Sep 2011 15:09:47 -0700, Yuri D'Elia <[email protected]> wrote: > On Wed, 14 Sep 2011 10:44:36 -0700, Danny Auble wrote: > > Have you had a look at the HTC documentation? > > > > http://schedmd.com/slurmdocs/high_throughput.html > > Yes, I have. I was able to improve the scheduling speed by tuning the > configuration (before that, I couldn't even queue 65k jobs before > getting timeouts and abysmal performance). Meanwhile, I will update to > 2.2 to get larger job counts, but still that doesn't address all my > concerns. Please be patient :) > > > Without knowing what your real objective is it is hard to prescribe a > > real solution. > > > > From your description it seems strange you would have the script > > sbatch is calling call sbatch once again. What are you trying to > > accomplish there? > > Wouldn't it just be easier to run this script outside of an > > allocation? > > Ok, I will restate my problem in a more practical manner. Please ask if > there's any question or any idea on how to improve the behavior. > > I'm running bioinformatic batches of various kinds on genetic data. A > typical analysis will involve running a short batch (~ 10 minutes) > multiplied for each polymorphism we have (roughly 100k times in the > smallest case). Perfect candidate for distribution, since every step in > a single stage is independent. > > Analyses are usually multi-stage: > > - we run "stage 1" (first 100k jobs) > - collect and aggregate data (a single job depending on "stage 1" > - run "stage 2" using collected data (another 100k jobs) > - (repeat) > > Let's assume queuing ~200k jobs is not a problem with 2.2. > > First issue: "squeue" takes forever with more than >5000 jobs. If more > than one user is scheduling a workflow like this it becomes impossible > to use it at all. Also, managing the queue itself (managing jobs, > killing just "stage 1" is impossible). I would like to group the first > 100k jobs in a single "id", so that I know that jobs 1-100k belong to > "stage 1". > > My impression by reading the docs is that I can create an allocation > and run "steps" to achieve this behavior. squeue or salloc is the > easiest way, but since queuing that many jobs is also time-consuming, > running the queuing script on the queue itself seemed a perfect solution > (hence sbatch --jobid within sbatch). This method (using salloc or > sbatch) also seems to work fine if I put a fat "sleep" to keep alive the > allocation. > > Also, consider that eventually I will need to queue jobs within a > script anyway (the ending step of "stage 1" might be scheduling "stage > 2" itself). > > Second issue: job dependencies. If I can use a single job with steps, I > can put dependencies for "step 2" easily on a single id and schedule > everything "outside" of slurm. If this is not possible, then I need a > barrier (like "wait" in a script like you suggested) so that as soon a > "stage 1" finishes I can schedule the next stages within the batch > itself. > > Right now, to word around these issues, I'm artificially limiting the > jobs by scheduling N/Z jobs, where each resuling job runs Z steps > sequentially. This limits parallelism however. To work around > dependencies issues, I'm looping with a script around "squeue" to see if > a pre-determined stage has finished. Ugly, but having people wait to > schedule more jobs (and thus letting the machines idle) is worse. >
