[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

Lachlan Musicman Mon, 03 Oct 2016 15:32:34 -0700

On 3 October 2016 at 23:26, Douglas Jacobsen <dmjacob...@lbl.gov> wrote:


> Hi Lachlan,
>
> You mentioned your slurm.conf has:
> AccountingStorageEnforce=qos
>
> The "qos" restriction only enforces that a user is authorized to use a
> particular qos (in the qos string of the association in the slurm
> database).  To enforce limits, you need to also use limits.  If you want to
> prevent partial jobs from running and potentially being killed when a
> resource runs out (only applicable for certain limits), you might also
> consider setting "safe", e.g.,
>
> AccountingStorageEnforce=limits,safe,qos
>
> http://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce
>
> I hope that helps,
> Doug
>


OH!

Ok. I was using, rightly or wrongly, the Resource Limits page (
http://slurm.schedmd.com/resource_limits.html ) for guidance on
AccountingStorageEnforce. And while I now understand, I feel like the
wording under configurations->limits states "This will enforce limits set
to associations". I feel this could say "This will enforce limits set to
associations or qos" or something to that effect. Basically I don't feel
that the Resource Limits page goes far enough to make explicit that setting
qos will *only* enforce that a qos is applied, not that a limit assigned to
a qos will be applied.

Thanks, much appreciated.

Cheers
L.


------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper




> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacob...@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Sun, Oct 2, 2016 at 9:08 PM, Lachlan Musicman <data...@gmail.com>
> wrote:
>
>> I started a thread on understand QOS, but quickly realised I had made a
>> fundamental error in my configuration. I fixed that problem last week.
>> (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30Wwmrm
>> U/SoOMHmRVDAAJ )
>>
>> Despite these changes, the issue remains, so I would like to ask again,
>> with more background information and more analysis.
>>
>>
>> Desired scenario: That any one user can only ever have jobs adding up to
>> 90 CPUs at a time. They can submit requests for more than this, but their
>> running jobs will max out at 90 and the rest of their jobs will be put in
>> queue. A CPU being defined as a thread in a system that has 2 sockets, each
>> with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo
>> on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs)
>>
>> Current scenario: users are getting every CPU they have requested,
>> blocking other users from the partitions.
>>
>> Our users are able to use 40 CPUs per node, so we know that every thread
>> is available as a consumable resource, as we wanted.
>>
>> When I use sinfo -o %C, the results re per CPU utilization reflect that
>> the thread is being used as the CPU measure.
>>
>> Yet, as noted above, when I do an squeue, I see that users have jobs
>> running with more than 90 CPUs in total.
>>
>> squeue that shows allocated CPUs. Note that both running users have more
>> than 90 CPUS each (threads):
>>
>> $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
>> CPUS QOS         JOBID PARTITION     NAME     USER    STATE       TIME
>> TIME_LIMI
>>    8 normal     193424      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>>    8 normal     193423      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>>    8 normal     193422      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>>
>>   20 normal     189360      prod MuVd_WGS lij@pete  RUNNING   23:49:15
>> 6-00:00:00
>>   20 normal     189353      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal     189354      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal     189356      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal     189358      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>    8 normal     193417      prod    Halo3 kamarasi  RUNNING       0:01
>> 1-00:00:00
>>    8 normal     193416      prod    Halo3 kamarasi  RUNNING       0:18
>> 1-00:00:00
>>    8 normal     193415      prod    Halo3 kamarasi  RUNNING       0:19
>> 1-00:00:00
>>    8 normal     193414      prod    Halo3 kamarasi  RUNNING       0:47
>> 1-00:00:00
>>    8 normal     193413      prod    Halo3 kamarasi  RUNNING       2:08
>> 1-00:00:00
>>    8 normal     193412      prod    Halo3 kamarasi  RUNNING       2:09
>> 1-00:00:00
>>    8 normal     193411      prod    Halo3 kamarasi  RUNNING       3:24
>> 1-00:00:00
>>    8 normal     193410      prod    Halo3 kamarasi  RUNNING       5:04
>> 1-00:00:00
>>    8 normal     193409      prod    Halo3 kamarasi  RUNNING       5:06
>> 1-00:00:00
>>    8 normal     193408      prod    Halo3 kamarasi  RUNNING       7:40
>> 1-00:00:00
>>    8 normal     193407      prod    Halo3 kamarasi  RUNNING      10:48
>> 1-00:00:00
>>    8 normal     193406      prod    Halo3 kamarasi  RUNNING      10:50
>> 1-00:00:00
>>    8 normal     193405      prod    Halo3 kamarasi  RUNNING      11:34
>> 1-00:00:00
>>    8 normal     193404      prod    Halo3 kamarasi  RUNNING      12:00
>> 1-00:00:00
>>    8 normal     193403      prod    Halo3 kamarasi  RUNNING      12:10
>> 1-00:00:00
>>    8 normal     193402      prod    Halo3 kamarasi  RUNNING      12:21
>> 1-00:00:00
>>    8 normal     193401      prod    Halo3 kamarasi  RUNNING      12:40
>> 1-00:00:00
>>    8 normal     193400      prod    Halo3 kamarasi  RUNNING      17:02
>> 1-00:00:00
>>    8 normal     193399      prod    Halo3 kamarasi  RUNNING      21:03
>> 1-00:00:00
>>    8 normal     193396      prod    Halo3 kamarasi  RUNNING      22:01
>> 1-00:00:00
>>    8 normal     193394      prod    Halo3 kamarasi  RUNNING      23:40
>> 1-00:00:00
>>    8 normal     193393      prod    Halo3 kamarasi  RUNNING      25:21
>> 1-00:00:00
>>    8 normal     193390      prod    Halo3 kamarasi  RUNNING      25:58
>> 1-00:00:00
>>
>>
>> Yet when I run squeue that shows Sockets/Cores/Threads as S/C/T:
>> squeue -o "%z %q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
>>
>> S:C:T QOS    JOBID PARTITION     NAME     USER    STATE       TIME
>> TIME_LIMI
>> *:*:* normal   193441      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>> *:*:* normal   193440      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>> *:*:* normal   193439      prod    Halo3 kamarasi  PENDING       0:00
>> 1-00:00:00
>> ....
>>
>> ie, no CPUs ("threads") have been requested?
>>
>> How can this be?
>>
>> The sbatch files in question look like
>>
>>  #!/bin/bash
>>  #SBATCH --nodes=1
>>  #SBATCH --ntasks=8
>> srun -n 1 <command>
>>
>> and
>>
>>  #!/bin/bash
>>  #SBATCH --nodes=1
>>  #SBATCH --ntasks=20
>> srun -n 1 <command>
>>
>> Ah. Is this the problem? Neither user has requested any CPUs. Only tasks.
>> The docs for sbatch and srun don't mention a way to explicitly ask for
>> threads-as-cpus, but there is a  --cpus-per-task which we've never used,
>> because the default is 1, which is what we wanted. So the
>> accounting/priority/scheduling system hasn't accounted for that?
>>
>> Nope. When I do four tests with the following:
>>
>> 1. #SBATCH --cpus-per-task=1
>> 2. srun -n 1 -c 1 <command>
>> 3. #SBATCH --cpus-per-task=1 AND srun -n 1 -c 1 <command>
>> 4. Setting the environment variable SLURM_CPUS_PER_TASK=1
>>
>> None of which returned any values for S:C:T. I didn't continue with the
>> permutations because I was getting the feeling that this wasn't the problem.
>>
>> Now I'm at a loss. Is it that using SLURM with threads as CPUs is the
>> problem - it's not designed to work like that?
>>
>> So, the question remains. How do I effectively limit people from running
>> more than X CPUs worth of jobs simultaneously? Or, alternatively, what have
>> I done wrong setting up QOS so that this might happen?
>>
>> The relevant configuration details are below.
>>
>> Slurm conf defines:
>>
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU
>>
>> AccountingStorageEnforce=qos
>>
>> NodeName=stpr-res-compute[01-02] CPUs=40 RealMemory=385000 Sockets=2
>> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
>> NodeName=papr-res-compute[01-09] CPUs=40 RealMemory=385000 Sockets=2
>> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
>>
>> NOTES: we chose qos because the MaxTRESPerUser isn't available to the
>> Account object which would then allow for using "limits". Assigning GRPTres
>> on a per Association basis would require touching/managing each
>> association. Not impossible, but clunky cf using QOS on Partitions.
>>
>> sacctmgr defines:
>>
>> All human users belong to QOS normal, sacctmgr show qos:
>>
>> sacctmgr show qos format=Name,Priority,PreemptMo
>> de,UsageFactor,MaxTRESPerUser
>>       Name   Priority PreemptMode UsageFactor     MaxTRESPU
>> ---------- ---------- ----------- ----------- -------------
>>     normal         10     cluster    1.000000        cpu=90
>> firstclass        100     cluster    1.000000
>>
>>
>> sinfo shows:
>>
>> $ sinfo -o "%18n %9P %.11T %.4c %.8z %.6m %C"
>> HOSTNAMES          PARTITION       STATE CPUS    S:C:T MEMORY
>> CPUS(A/I/O/T)
>> papr-res-compute08 pipeline         idle   40   2:10:2 385000 0/40/0/40
>> papr-res-compute09 pipeline         idle   40   2:10:2 385000 0/40/0/40
>> papr-res-compute08 bcl2fastq        idle   40   2:10:2 385000 0/40/0/40
>> papr-res-compute08 pathology        idle   40   2:10:2 385000 0/40/0/40
>> papr-res-compute09 pathology        idle   40   2:10:2 385000 0/40/0/40
>> papr-res-compute02 prod*           mixed   40   2:10:2 385000 36/4/0/40
>> papr-res-compute03 prod*           mixed   40   2:10:2 385000 36/4/0/40
>> papr-res-compute04 prod*           mixed   40   2:10:2 385000 36/4/0/40
>> papr-res-compute05 prod*           mixed   40   2:10:2 385000 36/4/0/40
>> papr-res-compute01 prod*       allocated   40   2:10:2 385000 40/0/0/40
>> papr-res-compute06 prod*       allocated   40   2:10:2 385000 40/0/0/40
>> papr-res-compute07 prod*       allocated   40   2:10:2 385000 40/0/0/40
>> stpr-res-compute01 debug            idle   40   2:10:2 385000 0/40/0/40
>> stpr-res-compute02 debug            idle   40   2:10:2 385000
>> 0/40/0/40
>>
>>
>>
>> ------
>> The most dangerous phrase in the language is, "We've always done it this
>> way."
>>
>> - Grace Hopper
>>
>
>

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

Reply via email to