On 3 October 2016 at 23:26, Douglas Jacobsen <dmjacob...@lbl.gov> wrote:
> Hi Lachlan, > > You mentioned your slurm.conf has: > AccountingStorageEnforce=qos > > The "qos" restriction only enforces that a user is authorized to use a > particular qos (in the qos string of the association in the slurm > database). To enforce limits, you need to also use limits. If you want to > prevent partial jobs from running and potentially being killed when a > resource runs out (only applicable for certain limits), you might also > consider setting "safe", e.g., > > AccountingStorageEnforce=limits,safe,qos > > http://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce > > I hope that helps, > Doug > OH! Ok. I was using, rightly or wrongly, the Resource Limits page ( http://slurm.schedmd.com/resource_limits.html ) for guidance on AccountingStorageEnforce. And while I now understand, I feel like the wording under configurations->limits states "This will enforce limits set to associations". I feel this could say "This will enforce limits set to associations or qos" or something to that effect. Basically I don't feel that the Resource Limits page goes far enough to make explicit that setting qos will *only* enforce that a qos is applied, not that a limit assigned to a qos will be applied. Thanks, much appreciated. Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacob...@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Sun, Oct 2, 2016 at 9:08 PM, Lachlan Musicman <data...@gmail.com> > wrote: > >> I started a thread on understand QOS, but quickly realised I had made a >> fundamental error in my configuration. I fixed that problem last week. >> (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30Wwmrm >> U/SoOMHmRVDAAJ ) >> >> Despite these changes, the issue remains, so I would like to ask again, >> with more background information and more analysis. >> >> >> Desired scenario: That any one user can only ever have jobs adding up to >> 90 CPUs at a time. They can submit requests for more than this, but their >> running jobs will max out at 90 and the rest of their jobs will be put in >> queue. A CPU being defined as a thread in a system that has 2 sockets, each >> with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo >> on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs) >> >> Current scenario: users are getting every CPU they have requested, >> blocking other users from the partitions. >> >> Our users are able to use 40 CPUs per node, so we know that every thread >> is available as a consumable resource, as we wanted. >> >> When I use sinfo -o %C, the results re per CPU utilization reflect that >> the thread is being used as the CPU measure. >> >> Yet, as noted above, when I do an squeue, I see that users have jobs >> running with more than 90 CPUs in total. >> >> squeue that shows allocated CPUs. Note that both running users have more >> than 90 CPUS each (threads): >> >> $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l" >> CPUS QOS JOBID PARTITION NAME USER STATE TIME >> TIME_LIMI >> 8 normal 193424 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> 8 normal 193423 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> 8 normal 193422 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> >> 20 normal 189360 prod MuVd_WGS lij@pete RUNNING 23:49:15 >> 6-00:00:00 >> 20 normal 189353 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 >> 6-00:00:00 >> 20 normal 189354 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 >> 6-00:00:00 >> 20 normal 189356 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 >> 6-00:00:00 >> 20 normal 189358 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 >> 6-00:00:00 >> 8 normal 193417 prod Halo3 kamarasi RUNNING 0:01 >> 1-00:00:00 >> 8 normal 193416 prod Halo3 kamarasi RUNNING 0:18 >> 1-00:00:00 >> 8 normal 193415 prod Halo3 kamarasi RUNNING 0:19 >> 1-00:00:00 >> 8 normal 193414 prod Halo3 kamarasi RUNNING 0:47 >> 1-00:00:00 >> 8 normal 193413 prod Halo3 kamarasi RUNNING 2:08 >> 1-00:00:00 >> 8 normal 193412 prod Halo3 kamarasi RUNNING 2:09 >> 1-00:00:00 >> 8 normal 193411 prod Halo3 kamarasi RUNNING 3:24 >> 1-00:00:00 >> 8 normal 193410 prod Halo3 kamarasi RUNNING 5:04 >> 1-00:00:00 >> 8 normal 193409 prod Halo3 kamarasi RUNNING 5:06 >> 1-00:00:00 >> 8 normal 193408 prod Halo3 kamarasi RUNNING 7:40 >> 1-00:00:00 >> 8 normal 193407 prod Halo3 kamarasi RUNNING 10:48 >> 1-00:00:00 >> 8 normal 193406 prod Halo3 kamarasi RUNNING 10:50 >> 1-00:00:00 >> 8 normal 193405 prod Halo3 kamarasi RUNNING 11:34 >> 1-00:00:00 >> 8 normal 193404 prod Halo3 kamarasi RUNNING 12:00 >> 1-00:00:00 >> 8 normal 193403 prod Halo3 kamarasi RUNNING 12:10 >> 1-00:00:00 >> 8 normal 193402 prod Halo3 kamarasi RUNNING 12:21 >> 1-00:00:00 >> 8 normal 193401 prod Halo3 kamarasi RUNNING 12:40 >> 1-00:00:00 >> 8 normal 193400 prod Halo3 kamarasi RUNNING 17:02 >> 1-00:00:00 >> 8 normal 193399 prod Halo3 kamarasi RUNNING 21:03 >> 1-00:00:00 >> 8 normal 193396 prod Halo3 kamarasi RUNNING 22:01 >> 1-00:00:00 >> 8 normal 193394 prod Halo3 kamarasi RUNNING 23:40 >> 1-00:00:00 >> 8 normal 193393 prod Halo3 kamarasi RUNNING 25:21 >> 1-00:00:00 >> 8 normal 193390 prod Halo3 kamarasi RUNNING 25:58 >> 1-00:00:00 >> >> >> Yet when I run squeue that shows Sockets/Cores/Threads as S/C/T: >> squeue -o "%z %q %.8i %.9P %.8j %.8u %.8T %.10M %.9l" >> >> S:C:T QOS JOBID PARTITION NAME USER STATE TIME >> TIME_LIMI >> *:*:* normal 193441 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> *:*:* normal 193440 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> *:*:* normal 193439 prod Halo3 kamarasi PENDING 0:00 >> 1-00:00:00 >> .... >> >> ie, no CPUs ("threads") have been requested? >> >> How can this be? >> >> The sbatch files in question look like >> >> #!/bin/bash >> #SBATCH --nodes=1 >> #SBATCH --ntasks=8 >> srun -n 1 <command> >> >> and >> >> #!/bin/bash >> #SBATCH --nodes=1 >> #SBATCH --ntasks=20 >> srun -n 1 <command> >> >> Ah. Is this the problem? Neither user has requested any CPUs. Only tasks. >> The docs for sbatch and srun don't mention a way to explicitly ask for >> threads-as-cpus, but there is a --cpus-per-task which we've never used, >> because the default is 1, which is what we wanted. So the >> accounting/priority/scheduling system hasn't accounted for that? >> >> Nope. When I do four tests with the following: >> >> 1. #SBATCH --cpus-per-task=1 >> 2. srun -n 1 -c 1 <command> >> 3. #SBATCH --cpus-per-task=1 AND srun -n 1 -c 1 <command> >> 4. Setting the environment variable SLURM_CPUS_PER_TASK=1 >> >> None of which returned any values for S:C:T. I didn't continue with the >> permutations because I was getting the feeling that this wasn't the problem. >> >> Now I'm at a loss. Is it that using SLURM with threads as CPUs is the >> problem - it's not designed to work like that? >> >> So, the question remains. How do I effectively limit people from running >> more than X CPUs worth of jobs simultaneously? Or, alternatively, what have >> I done wrong setting up QOS so that this might happen? >> >> The relevant configuration details are below. >> >> Slurm conf defines: >> >> SelectType=select/cons_res >> SelectTypeParameters=CR_CPU >> >> AccountingStorageEnforce=qos >> >> NodeName=stpr-res-compute[01-02] CPUs=40 RealMemory=385000 Sockets=2 >> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN >> NodeName=papr-res-compute[01-09] CPUs=40 RealMemory=385000 Sockets=2 >> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN >> >> NOTES: we chose qos because the MaxTRESPerUser isn't available to the >> Account object which would then allow for using "limits". Assigning GRPTres >> on a per Association basis would require touching/managing each >> association. Not impossible, but clunky cf using QOS on Partitions. >> >> sacctmgr defines: >> >> All human users belong to QOS normal, sacctmgr show qos: >> >> sacctmgr show qos format=Name,Priority,PreemptMo >> de,UsageFactor,MaxTRESPerUser >> Name Priority PreemptMode UsageFactor MaxTRESPU >> ---------- ---------- ----------- ----------- ------------- >> normal 10 cluster 1.000000 cpu=90 >> firstclass 100 cluster 1.000000 >> >> >> sinfo shows: >> >> $ sinfo -o "%18n %9P %.11T %.4c %.8z %.6m %C" >> HOSTNAMES PARTITION STATE CPUS S:C:T MEMORY >> CPUS(A/I/O/T) >> papr-res-compute08 pipeline idle 40 2:10:2 385000 0/40/0/40 >> papr-res-compute09 pipeline idle 40 2:10:2 385000 0/40/0/40 >> papr-res-compute08 bcl2fastq idle 40 2:10:2 385000 0/40/0/40 >> papr-res-compute08 pathology idle 40 2:10:2 385000 0/40/0/40 >> papr-res-compute09 pathology idle 40 2:10:2 385000 0/40/0/40 >> papr-res-compute02 prod* mixed 40 2:10:2 385000 36/4/0/40 >> papr-res-compute03 prod* mixed 40 2:10:2 385000 36/4/0/40 >> papr-res-compute04 prod* mixed 40 2:10:2 385000 36/4/0/40 >> papr-res-compute05 prod* mixed 40 2:10:2 385000 36/4/0/40 >> papr-res-compute01 prod* allocated 40 2:10:2 385000 40/0/0/40 >> papr-res-compute06 prod* allocated 40 2:10:2 385000 40/0/0/40 >> papr-res-compute07 prod* allocated 40 2:10:2 385000 40/0/0/40 >> stpr-res-compute01 debug idle 40 2:10:2 385000 0/40/0/40 >> stpr-res-compute02 debug idle 40 2:10:2 385000 >> 0/40/0/40 >> >> >> >> ------ >> The most dangerous phrase in the language is, "We've always done it this >> way." >> >> - Grace Hopper >> > >