I'm not sure if this will be useful or not, but your use case reminded
me of a project by Jim Garlick awhile back called "industrial strength
pipes" (ISP). This project allows you to set up a chain of dependent
tasks much like a UNIX pipeline, and it has some kind of support for
spawning the tasks in the pipeline with srun(1). It might not exactly
map to your usage case, but I thought I'd mention it nonetheless.

Another project that this discussion reminded me of was a set of
scripts I wrote awhile back to run a personal instance of SLURM
as a SLURM job. When this nested SLURM instance was launched, it
then appeared to commands running within the job that a full SLURM
cluster of however many nodes were in the job was available. You
could then submit multiple batch jobs to this nested instance (even
another request for a nested SLURM)

The solution was kind of kludgy though, and a proper implementation
was never accepted into SLURM proper, so unfortunately no such
support exists today.

mark



On Wed, 14 Sep 2011 15:09:47 -0700, Yuri D'Elia <[email protected]> wrote:
>  On Wed, 14 Sep 2011 10:44:36 -0700, Danny Auble wrote:
> > Have you had a look at  the HTC documentation?
> >
> > http://schedmd.com/slurmdocs/high_throughput.html
> 
>  Yes, I have. I was able to improve the scheduling speed by tuning the 
>  configuration (before that, I couldn't even queue 65k jobs before 
>  getting timeouts and abysmal performance). Meanwhile, I will update to 
>  2.2 to get larger job counts, but still that doesn't address all my 
>  concerns. Please be patient :)
> 
> > Without knowing what your real objective is it is hard to prescribe a
> > real solution.
> >
> > From your description it seems strange you would have the script
> > sbatch is calling call sbatch once again.  What are you trying to
> > accomplish there?
> > Wouldn't it just be easier to run this script outside of an 
> > allocation?
> 
>  Ok, I will restate my problem in a more practical manner. Please ask if 
>  there's any question or any idea on how to improve the behavior.
> 
>  I'm running bioinformatic batches of various kinds on genetic data. A 
>  typical analysis will involve running a short batch (~ 10 minutes) 
>  multiplied for each polymorphism we have (roughly 100k times in the 
>  smallest case). Perfect candidate for distribution, since every step in 
>  a single stage is independent.
> 
>  Analyses are usually multi-stage:
> 
>  - we run "stage 1" (first 100k jobs)
>  - collect and aggregate data (a single job depending on "stage 1"
>  - run "stage 2" using collected data (another 100k jobs)
>  - (repeat)
> 
>  Let's assume queuing ~200k jobs is not a problem with 2.2.
> 
>  First issue:  "squeue" takes forever with more than >5000 jobs. If more 
>  than one user is scheduling a workflow like this it becomes impossible 
>  to use it at all. Also, managing the queue itself (managing jobs, 
>  killing just "stage 1" is impossible). I would like to group the first 
>  100k jobs in a single "id", so that I know that jobs 1-100k belong to 
>  "stage 1".
> 
>  My impression by reading the docs is that I can create an allocation 
>  and run "steps" to achieve this behavior. squeue or salloc is the 
>  easiest way, but since queuing that many jobs is also time-consuming, 
>  running the queuing script on the queue itself seemed a perfect solution 
>  (hence sbatch --jobid within sbatch). This method (using salloc or 
>  sbatch) also seems to work fine if I put a fat "sleep" to keep alive the 
>  allocation.
> 
>  Also, consider that eventually I will need to queue jobs within a 
>  script anyway (the ending step of "stage 1" might be scheduling "stage 
>  2" itself).
> 
>  Second issue: job dependencies. If I can use a single job with steps, I 
>  can put dependencies for "step 2" easily on a single id and schedule 
>  everything "outside" of slurm.  If this is not possible, then I need a 
>  barrier (like "wait" in a script like you suggested) so that as soon a 
>  "stage 1" finishes I can schedule the next stages within the batch 
>  itself.
> 
>  Right now, to word around these issues, I'm artificially limiting the 
>  jobs by scheduling N/Z jobs, where each resuling job runs Z steps 
>  sequentially. This limits parallelism however. To work around 
>  dependencies issues, I'm looping with a script around "squeue" to see if 
>  a pre-determined stage has finished. Ugly, but having people wait to 
>  schedule more jobs (and thus letting the machines idle) is worse.
> 

Reply via email to