Hi Matt,

The current sbatch functionality can get you part of the way there. You can
submit to --partition=<limited>,<default> and the job will run whereever it
can first.  In your particular case (at NCCS), your --qos=high is only
available in the default partition, so specifying it would render the job
unable to run in the <limited> partition.

Regards,
Lyn

On Fri, Apr 29, 2016 at 7:56 AM, Thompson, Matt[SCIENCE SYSTEMS AND
APPLICATIONS INC] <[email protected]> wrote:

>
> SLURM Devs,
>
> This is probably a FAQ whose answer is "nope" but my search-fu has failed
> me. We recently had a need to think about something. This is going to be a
> generic experiment because I don't want to have to remember all the details
> of the real names of qos, etc.
>
> Namely, on our cluster, lets say we have three ways to run:
>
>   1. --partition=limited
>   2. --qos=high
>   3. Default
>
> Number one is a partition that not many can submit to, is a dedicated
> chunk of the cluster, but one can only run 3 jobs in it.
>
> Number two is a qos with a high priority in the "general" "default"
> partition of the machine. This might have a limit on number of jobs (let's
> say 6, though I don't know if there is a limit) so people don't abuse it.
>
> Number three is when you just sbatch and get whatever the default is.
>
>
> Obviously, #1 is the gold standard, run until you limit out; #2 is better,
> and #3 is least attractive.
>
> Now, we have a situation where an experiment needs to run, say 12 jobs
> that take 3 hours each. If we had our druthers, we'd submit all 12 to #1
> and all 12 would launch at once. Can't do that. You get only 3 in. So now
> go to #2, only get 6 in (assuming the general cluster partition isn't
> full). If you limit out of #2, then fall over to #3.
>
> I think you get what I want. I'd love to have a single sbatch call that
> says:
>
>   Take this job and submit such that it runs under #1,  #2,  #3, and
>   whatever can take it first wins.
>
> In our case, I can see 3 perhaps getting in right away into #1, a few more
> a bit later in #2 and then the next ones maybe when #1 is free again, or
> perhaps #3... I know the --constraint has a nice OR operator, but I'm not
> sure anything else does.
>
>
> Now, one way we can think to do this (since I don't know if you can do the
> above) is to submit 12 jobs to *each* queue-config possibility and then
> underneath, have a lockfile-managed script that holds a MasterList of all
> the possible jobs. If someone manages to get an allocation, that one pops a
> job off the MasterList, now there are 11 left, and so on.
>
> Once the MasterList is empty (aka all jobs run or running), you could then
> clean up all the queued jobs that never will run anything useful (and if
> they get an allocation, the empty MasterList would just return the
> allocation immediately).
>
> We have experience with this lock and masterlist (for other purposes), so
> we can do it, but as I said, it'd be nice if we could make one big meta
> sbatch call. Because it's nice to only have 12 jobs in the queue instead of
> 36 :)
>
> Matt
> --
> Matt Thompson, SSAI, Sr Scientific Programmer/Analyst
> NASA GSFC,    Global Modeling and Assimilation Office
> Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
> Phone: 301-614-6712                 Fax: 301-614-6246
> http://science.gsfc.nasa.gov/sed/bio/matthew.thompson
>

Reply via email to