I would suggest using srun inside the job, and not sbatch. If you submit a job with sbatch and inside use srun you can easily use dependencies between stages.
So you will have $ sbatch --job-name=stage1 --ntasks=6 ./script_stage1.sh Submitted batch job 1 script_stage1.sh: srun -n1 process_data.exe 1 & srun -n1 process_data.exe 2 & srun -n1 process_data.exe 3 & srun -n1 process_data.exe 4 & ... # You should control the number of srun running at the same time wait sbatch --job-name=collect_and_aggregate --dependency=afterok:1 ./collect_and_aggregate.sh Submitted batch job 2 sbatch --job-name=stage2 --ntasks=6 --dependency=afterok:2 ./script_stage2.sh That's the way I'll do it, but don't know if there is any other way to do it. You should also have a control of which steps in every stage have finished in order to be able to resume the execution in case of any node failure. Regards, Carles Fenoy Barcelona Supercomputing Center On Thu, Sep 15, 2011 at 12:09 AM, Yuri D'Elia <[email protected]> wrote: > On Wed, 14 Sep 2011 10:44:36 -0700, Danny Auble wrote: > >> Have you had a look at the HTC documentation? >> >> http://schedmd.com/slurmdocs/**high_throughput.html<http://schedmd.com/slurmdocs/high_throughput.html> >> > > Yes, I have. I was able to improve the scheduling speed by tuning the > configuration (before that, I couldn't even queue 65k jobs before getting > timeouts and abysmal performance). Meanwhile, I will update to 2.2 to get > larger job counts, but still that doesn't address all my concerns. Please be > patient :) > > > Without knowing what your real objective is it is hard to prescribe a >> real solution. >> >> From your description it seems strange you would have the script >> sbatch is calling call sbatch once again. What are you trying to >> accomplish there? >> Wouldn't it just be easier to run this script outside of an allocation? >> > > Ok, I will restate my problem in a more practical manner. Please ask if > there's any question or any idea on how to improve the behavior. > > I'm running bioinformatic batches of various kinds on genetic data. A > typical analysis will involve running a short batch (~ 10 minutes) > multiplied for each polymorphism we have (roughly 100k times in the smallest > case). Perfect candidate for distribution, since every step in a single > stage is independent. > > Analyses are usually multi-stage: > > - we run "stage 1" (first 100k jobs) > - collect and aggregate data (a single job depending on "stage 1" > - run "stage 2" using collected data (another 100k jobs) > - (repeat) > > Let's assume queuing ~200k jobs is not a problem with 2.2. > > First issue: "squeue" takes forever with more than >5000 jobs. If more > than one user is scheduling a workflow like this it becomes impossible to > use it at all. Also, managing the queue itself (managing jobs, killing just > "stage 1" is impossible). I would like to group the first 100k jobs in a > single "id", so that I know that jobs 1-100k belong to "stage 1". > > My impression by reading the docs is that I can create an allocation and > run "steps" to achieve this behavior. squeue or salloc is the easiest way, > but since queuing that many jobs is also time-consuming, running the queuing > script on the queue itself seemed a perfect solution (hence sbatch --jobid > within sbatch). This method (using salloc or sbatch) also seems to work fine > if I put a fat "sleep" to keep alive the allocation. > > Also, consider that eventually I will need to queue jobs within a script > anyway (the ending step of "stage 1" might be scheduling "stage 2" itself). > > Second issue: job dependencies. If I can use a single job with steps, I can > put dependencies for "step 2" easily on a single id and schedule everything > "outside" of slurm. If this is not possible, then I need a barrier (like > "wait" in a script like you suggested) so that as soon a "stage 1" finishes > I can schedule the next stages within the batch itself. > > Right now, to word around these issues, I'm artificially limiting the jobs > by scheduling N/Z jobs, where each resuling job runs Z steps sequentially. > This limits parallelism however. To work around dependencies issues, I'm > looping with a script around "squeue" to see if a pre-determined stage has > finished. Ugly, but having people wait to schedule more jobs (and thus > letting the machines idle) is worse. > > -- -- Carles Fenoy
