[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

2016-10-03 Thread Lachlan Musicman
On 3 October 2016 at 23:26, Douglas Jacobsen  wrote:

> Hi Lachlan,
>
> You mentioned your slurm.conf has:
> AccountingStorageEnforce=qos
>
> The "qos" restriction only enforces that a user is authorized to use a
> particular qos (in the qos string of the association in the slurm
> database).  To enforce limits, you need to also use limits.  If you want to
> prevent partial jobs from running and potentially being killed when a
> resource runs out (only applicable for certain limits), you might also
> consider setting "safe", e.g.,
>
> AccountingStorageEnforce=limits,safe,qos
>
> http://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce
>
> I hope that helps,
> Doug
>


OH!

Ok. I was using, rightly or wrongly, the Resource Limits page (
http://slurm.schedmd.com/resource_limits.html ) for guidance on
AccountingStorageEnforce. And while I now understand, I feel like the
wording under configurations->limits states "This will enforce limits set
to associations". I feel this could say "This will enforce limits set to
associations or qos" or something to that effect. Basically I don't feel
that the Resource Limits page goes far enough to make explicit that setting
qos will *only* enforce that a qos is applied, not that a limit assigned to
a qos will be applied.

Thanks, much appreciated.

Cheers
L.


--
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper




> 
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> 
> dmjacob...@lbl.gov
>
> - __o
> -- _ '\<,_
> --(_)/  (_)__
>
>
> On Sun, Oct 2, 2016 at 9:08 PM, Lachlan Musicman 
> wrote:
>
>> I started a thread on understand QOS, but quickly realised I had made a
>> fundamental error in my configuration. I fixed that problem last week.
>> (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30Wwmrm
>> U/SoOMHmRVDAAJ )
>>
>> Despite these changes, the issue remains, so I would like to ask again,
>> with more background information and more analysis.
>>
>>
>> Desired scenario: That any one user can only ever have jobs adding up to
>> 90 CPUs at a time. They can submit requests for more than this, but their
>> running jobs will max out at 90 and the rest of their jobs will be put in
>> queue. A CPU being defined as a thread in a system that has 2 sockets, each
>> with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo
>> on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs)
>>
>> Current scenario: users are getting every CPU they have requested,
>> blocking other users from the partitions.
>>
>> Our users are able to use 40 CPUs per node, so we know that every thread
>> is available as a consumable resource, as we wanted.
>>
>> When I use sinfo -o %C, the results re per CPU utilization reflect that
>> the thread is being used as the CPU measure.
>>
>> Yet, as noted above, when I do an squeue, I see that users have jobs
>> running with more than 90 CPUs in total.
>>
>> squeue that shows allocated CPUs. Note that both running users have more
>> than 90 CPUS each (threads):
>>
>> $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
>> CPUS QOS JOBID PARTITION NAME USERSTATE   TIME
>> TIME_LIMI
>>8 normal 193424  prodHalo3 kamarasi  PENDING   0:00
>> 1-00:00:00
>>8 normal 193423  prodHalo3 kamarasi  PENDING   0:00
>> 1-00:00:00
>>8 normal 193422  prodHalo3 kamarasi  PENDING   0:00
>> 1-00:00:00
>>
>>   20 normal 189360  prod MuVd_WGS lij@pete  RUNNING   23:49:15
>> 6-00:00:00
>>   20 normal 189353  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal 189354  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal 189356  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>   20 normal 189358  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
>> 6-00:00:00
>>8 normal 193417  prodHalo3 kamarasi  RUNNING   0:01
>> 1-00:00:00
>>8 normal 193416  prodHalo3 kamarasi  RUNNING   0:18
>> 1-00:00:00
>>8 normal 193415  prodHalo3 kamarasi  RUNNING   0:19
>> 1-00:00:00
>>8 normal 193414  prodHalo3 kamarasi  RUNNING   0:47
>> 1-00:00:00
>>8 normal 193413  prodHalo3 kamarasi  RUNNING   2:08
>> 1-00:00:00
>>8 normal 193412  prodHalo3 kamarasi  RUNNING   2:09
>> 1-00:00:00
>>8 normal 193411  prodHalo3 kamarasi  RUNNING   3:24
>> 1-00:00:00
>>8 normal 193410  prodHalo3 kamarasi  RUNNING   5:04
>> 1-00:00:00
>>8 normal 193409  prodHalo3 kamarasi  RUNNING   5:06
>> 1-00:00:00
>>8 normal 193408  prodHalo3 kamarasi  RUNNING   7:40
>> 1-00:00:00
>>8 normal 

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

2016-10-03 Thread Douglas Jacobsen
Hi Lachlan,

You mentioned your slurm.conf has:
AccountingStorageEnforce=qos

The "qos" restriction only enforces that a user is authorized to use a
particular qos (in the qos string of the association in the slurm
database).  To enforce limits, you need to also use limits.  If you want to
prevent partial jobs from running and potentially being killed when a
resource runs out (only applicable for certain limits), you might also
consider setting "safe", e.g.,

AccountingStorageEnforce=limits,safe,qos

http://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce

I hope that helps,
Doug


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Sun, Oct 2, 2016 at 9:08 PM, Lachlan Musicman  wrote:

> I started a thread on understand QOS, but quickly realised I had made a
> fundamental error in my configuration. I fixed that problem last week.
> (ref: https://groups.google.com/forum/#!msg/slurm-devel/
> dqL30WwmrmU/SoOMHmRVDAAJ )
>
> Despite these changes, the issue remains, so I would like to ask again,
> with more background information and more analysis.
>
>
> Desired scenario: That any one user can only ever have jobs adding up to
> 90 CPUs at a time. They can submit requests for more than this, but their
> running jobs will max out at 90 and the rest of their jobs will be put in
> queue. A CPU being defined as a thread in a system that has 2 sockets, each
> with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo
> on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs)
>
> Current scenario: users are getting every CPU they have requested,
> blocking other users from the partitions.
>
> Our users are able to use 40 CPUs per node, so we know that every thread
> is available as a consumable resource, as we wanted.
>
> When I use sinfo -o %C, the results re per CPU utilization reflect that
> the thread is being used as the CPU measure.
>
> Yet, as noted above, when I do an squeue, I see that users have jobs
> running with more than 90 CPUs in total.
>
> squeue that shows allocated CPUs. Note that both running users have more
> than 90 CPUS each (threads):
>
> $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
> CPUS QOS JOBID PARTITION NAME USERSTATE   TIME
> TIME_LIMI
>8 normal 193424  prodHalo3 kamarasi  PENDING   0:00
> 1-00:00:00
>8 normal 193423  prodHalo3 kamarasi  PENDING   0:00
> 1-00:00:00
>8 normal 193422  prodHalo3 kamarasi  PENDING   0:00
> 1-00:00:00
>
>   20 normal 189360  prod MuVd_WGS lij@pete  RUNNING   23:49:15
> 6-00:00:00
>   20 normal 189353  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal 189354  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal 189356  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal 189358  prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>8 normal 193417  prodHalo3 kamarasi  RUNNING   0:01
> 1-00:00:00
>8 normal 193416  prodHalo3 kamarasi  RUNNING   0:18
> 1-00:00:00
>8 normal 193415  prodHalo3 kamarasi  RUNNING   0:19
> 1-00:00:00
>8 normal 193414  prodHalo3 kamarasi  RUNNING   0:47
> 1-00:00:00
>8 normal 193413  prodHalo3 kamarasi  RUNNING   2:08
> 1-00:00:00
>8 normal 193412  prodHalo3 kamarasi  RUNNING   2:09
> 1-00:00:00
>8 normal 193411  prodHalo3 kamarasi  RUNNING   3:24
> 1-00:00:00
>8 normal 193410  prodHalo3 kamarasi  RUNNING   5:04
> 1-00:00:00
>8 normal 193409  prodHalo3 kamarasi  RUNNING   5:06
> 1-00:00:00
>8 normal 193408  prodHalo3 kamarasi  RUNNING   7:40
> 1-00:00:00
>8 normal 193407  prodHalo3 kamarasi  RUNNING  10:48
> 1-00:00:00
>8 normal 193406  prodHalo3 kamarasi  RUNNING  10:50
> 1-00:00:00
>8 normal 193405  prodHalo3 kamarasi  RUNNING  11:34
> 1-00:00:00
>8 normal 193404  prodHalo3 kamarasi  RUNNING  12:00
> 1-00:00:00
>8 normal 193403  prodHalo3 kamarasi  RUNNING  12:10
> 1-00:00:00
>8 normal 193402  prodHalo3 kamarasi  RUNNING  12:21
> 1-00:00:00
>8 normal 193401  prodHalo3 kamarasi  RUNNING  12:40
> 1-00:00:00
>8 normal 193400  prodHalo3 kamarasi  RUNNING  17:02
> 1-00:00:00
>8 normal 193399  prodHalo3 kamarasi  RUNNING  21:03
> 1-00:00:00
>8 normal 193396  prodHalo3 kamarasi  RUNNING  22:01
> 1-00:00:00
>8 normal 193394  prodHalo3 kamarasi  RUNNING  23:40
> 1-00:00:00
>8 normal