Hi Matt, The current sbatch functionality can get you part of the way there. You can submit to --partition=<limited>,<default> and the job will run whereever it can first. In your particular case (at NCCS), your --qos=high is only available in the default partition, so specifying it would render the job unable to run in the <limited> partition.
Regards, Lyn On Fri, Apr 29, 2016 at 7:56 AM, Thompson, Matt[SCIENCE SYSTEMS AND APPLICATIONS INC] <[email protected]> wrote: > > SLURM Devs, > > This is probably a FAQ whose answer is "nope" but my search-fu has failed > me. We recently had a need to think about something. This is going to be a > generic experiment because I don't want to have to remember all the details > of the real names of qos, etc. > > Namely, on our cluster, lets say we have three ways to run: > > 1. --partition=limited > 2. --qos=high > 3. Default > > Number one is a partition that not many can submit to, is a dedicated > chunk of the cluster, but one can only run 3 jobs in it. > > Number two is a qos with a high priority in the "general" "default" > partition of the machine. This might have a limit on number of jobs (let's > say 6, though I don't know if there is a limit) so people don't abuse it. > > Number three is when you just sbatch and get whatever the default is. > > > Obviously, #1 is the gold standard, run until you limit out; #2 is better, > and #3 is least attractive. > > Now, we have a situation where an experiment needs to run, say 12 jobs > that take 3 hours each. If we had our druthers, we'd submit all 12 to #1 > and all 12 would launch at once. Can't do that. You get only 3 in. So now > go to #2, only get 6 in (assuming the general cluster partition isn't > full). If you limit out of #2, then fall over to #3. > > I think you get what I want. I'd love to have a single sbatch call that > says: > > Take this job and submit such that it runs under #1, #2, #3, and > whatever can take it first wins. > > In our case, I can see 3 perhaps getting in right away into #1, a few more > a bit later in #2 and then the next ones maybe when #1 is free again, or > perhaps #3... I know the --constraint has a nice OR operator, but I'm not > sure anything else does. > > > Now, one way we can think to do this (since I don't know if you can do the > above) is to submit 12 jobs to *each* queue-config possibility and then > underneath, have a lockfile-managed script that holds a MasterList of all > the possible jobs. If someone manages to get an allocation, that one pops a > job off the MasterList, now there are 11 left, and so on. > > Once the MasterList is empty (aka all jobs run or running), you could then > clean up all the queued jobs that never will run anything useful (and if > they get an allocation, the empty MasterList would just return the > allocation immediately). > > We have experience with this lock and masterlist (for other purposes), so > we can do it, but as I said, it'd be nice if we could make one big meta > sbatch call. Because it's nice to only have 12 jobs in the queue instead of > 36 :) > > Matt > -- > Matt Thompson, SSAI, Sr Scientific Programmer/Analyst > NASA GSFC, Global Modeling and Assimilation Office > Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 > Phone: 301-614-6712 Fax: 301-614-6246 > http://science.gsfc.nasa.gov/sed/bio/matthew.thompson >
