[slurm-dev] Slurmdb access outside of slurm commands

2016-10-18 Thread Christopher Benjamin Coffey
Hi,

We are building a webapp which will utilize data stored in the slurm msyql db.  
Is there anything wrong with adding a read only user that the app can use 
indirectly to cache statistics? I don’t see any issue with it, but I’m curious 
if it would get in the way of any normal slurm operations.  Any considerations 
you can think of to prevent degradation of normal slurm performance?  Thank you!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



[slurm-dev] set maximum CPU usage per user

2016-10-18 Thread Steven Lo



Hi,

We are trying to limit 300 CPU usage per user in our cluster.

We have tried:

sacctmgr modify qos normal set Grpcpus=300

and

sacctmgr modify user username set GrpCPUs=300


Both seems to allow job to run which asking for 308 CPUs.


Is there other way to implement this requirement?


Thanks advance for your suggestion.


Steven.


[slurm-dev] nodes hang at CP status

2016-10-18 Thread Benedikt Schaefer
Hi,

I have a few nodes in the cluster which hangs every job in complete state
and will not return to idle.
I cannot find out why.
All nodes are running same OS (diskless image).
>From log I only see:
--
[2016-10-12T08:44:08.133] error: we don't have select plugin type 102
[2016-10-12T08:44:08.133] error: select_g_select_jobinfo_unpack: unpack
error
[2016-10-12T08:44:08.133] error: Malformed RPC of type
REQUEST_TERMINATE_JOB(6011) received
[2016-10-12T08:44:08.133] error: slurm_receive_msg_and_forward: Header
lengths are longer than data received
[2016-10-12T08:44:08.143] error: service_connection: slurm_receive_msg:
Header lengths are longer than data received
-- 

The first two lines I see on all nodes.

I have a cluster with ~550 nodes and about 5-10 nodes has this problem.
Mostly every job.

Any idea?

slurm version: slurm-15.08.12-1.el7.centos.x86_64
kernel version : 3.10.0-327.13.1.el7.x86_64

Thanks.

Best regards
Benedikt

~ Benedikt Schaefer
benedikt.schae...@emea.nec.com 
~
~ Senior System Analyst
~
~ NEC Deutschland GmbH
~
~ HPCE Division
~
~ Raiffeisenstr.14, 70771 Leinfelden-Echterdingen, Germany
~
~ Tel:+49  711 780 55 21  Mobile: +49 152 22851542  Fax:+49 711 780 55 25
~ 

~ NEC Deutschland GmbH, Prinzenallee 11, D-40549 Duesseldorf
~
~ Geschaeftsfuehrer: Yuichi Kojima
~
~ Handelsregister Duesseldorf HRB 57941; VAT ID DE129424743
~





smime.p7s
Description: S/MIME cryptographic signature


[slurm-dev] Re: How to manage priorities without a DB?

2016-10-18 Thread Nathan Smith
On 10/18/2016 10:07 AM, cfernanrodri . wrote:
> I am managing a small machine but I am not sysadmin or so.
>
> For the moment Slrum is working fine but jobs are runing in a FIFO
> scheduling, I would like to implement priorities per user, without a DB.
>
> Is possible to do it with txt accounting? if not, can I set at least
> fair-sharing with *only* txt accounting?
>
> What do I have to add to my slurm.conf? There is not much documentation
> on the network
>
> Thanks for your help

See http://slurm.schedmd.com/priority_multifactor.html#fairshare

 > Note: Computing the fair-share factor requires the installation and
 > operation of the Slurm Accounting Database to provide the assigned
 > shares and the consumed, computing resources described below.

You can use multifactor without the DB, but the fair-share factor 
requires it.

-- 
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University

[slurm-dev] How to manage priorities without a DB?

2016-10-18 Thread cfernanrodri .
I am managing a small machine but I am not sysadmin or so.

For the moment Slrum is working fine but jobs are runing in a FIFO
scheduling, I would like to implement priorities per user, without a DB.

Is possible to do it with txt accounting? if not, can I set at least
fair-sharing with *only* txt accounting?

What do I have to add to my slurm.conf? There is not much documentation on
the network

Thanks for your help


[slurm-dev] Re: Reserved column on UserUtilizationByAccount sreports

2016-10-18 Thread Albert Gil Moreno
Hi!


> For the same time period, what is the reserved column say for "sreport -T
> CPU -t MinPer cluster utilization"?  Meaning, not by account.  If that
> Reserved is 0 for the cluster overall, then that does explain why it's also
> zero for all accounts.  If there is a   discrepancy, then perhaps there
> may be something to investigate.
>
In fact I've just submitted a bug for the cluster overall sreport, but just
for the GRES-Reserved values:
https://bugs.schedmd.com/show_bug.cgi?id=3187

The CPU Reserved is non-zero (so there is a discrepancy, right?):

$ sreport -T CPU,GRES/gpu  -t HourPer cluster utilization
Format=TresName,Allocated,Reserved,Idle,Down Start=`date -d "last month"
+%D`
End=now
Cluster Utilization 2016-09-18T00:00:00 - 2016-10-18T13:59:59
Use reported in TRES Hours/Percentage of Total

 TRES Name Allocated  Reserved  Idle
   Down
-- - - -
-
   cpu 10718(23.47%)148(0.33%) 26881(58.87%)
 7914(17.33%)
  gres/gpu  2882(36.72%)  0(0.00%)  4570(58.24%)
 395(5.04%)


I expected that the sum of the by-account Reserved on CPU was the same of
the overall CPU Reserved (as it is by the Allocated).

In the GRES/GPU Reserved case is always 0 in all cases, so it's
consistent... but for me itlooked like a bug, so I reported it there... was
it ok?

Reserved time in sreport is time nodes are held idle (by the backfill
> scheduler) to start the job.  If you aren't using backfill,
>
We are using SchedulerType=sched/backfill.


> or if all job submissions request about the same quantity of hardware
> resources then it may always be zero.  If there were some users submitting
> large jobs and some small, then I would expect there to be some non-zero
> time.
>
Users are requesting different amount of resources, but I'm not sure about
the value of the "varience"... ;-)
Anyway, I need to read and think why it should be 0 if all the users ask
for the same amount of CPUs... even if they wait in the queue?
I'm sure I'm missing something here...


Thanks!


Albert

-- 
_

OOO Albert Gil Moreno 
OOO Image Processing Group 
OOO Universitat Politècnica de Catalunya 
_


[slurm-dev] Re: Reserved column on UserUtilizationByAccount sreports

2016-10-18 Thread Douglas Jacobsen
Reserved time in sreport is time nodes are held idle (by the backfill 
scheduler) to start the job.  If you aren't using backfill, or if all 
job submissions request about the same quantity of hardware resources 
then it may always be zero.  If there were some users submitting large 
jobs and some small, then I would expect there to be some non-zero time.


For the same time period, what is the reserved column say for "sreport 
-T CPU -t MinPer cluster utilization"?  Meaning, not by account.  If 
that Reserved is 0 for the cluster overall, then that does explain why 
it's also zero for all accounts.  If there is a discrepancy, then 
perhaps there may be something to investigate.


-Doug

On 10/18/16 2:36 AM, Albert Gil Moreno wrote:

Reserved column on UserUtilizationByAccount sreports
Hi,

It seems that right-now (or at least in version 15.08.9) the column 
Reserved in a UserUtilizationByAccount  sreport is always 0, like Idle 
and Down.


For example:

sreport -T CPU -t MinPer cluster UserUtilizationByAccount 
Format=TresName%4,Login,Used,Reserved,Idle,Down Start=`date -d "last 
month" +%D` End=now


Cluster/User/Account Utilization 2016-09-18T00:00:00 - 
2016-10-18T09:59:59 (2628000 secs)

Use reported in TRES Minutes/Percentage of Total

TRES LoginUsed   Reserved Idle   Down
 - - -- 
 --
 cpu   pbellot 266612(9.77%)   0(0.00%) 0(0.00%)   
  0(0.00%)
 cpumpomar 157124(5.76%)   0(0.00%) 0(0.00%)   
  0(0.00%)
 cpu  mbellver  61747(2.26%)   0(0.00%) 0(0.00%)   
  0(0.00%)




For me it's clear that Down and Idle are values that has none sense to 
query "ByAccount", but the Reserved could be seen as the time that an 
account has reserved the resources but still not allocated them; so, 
its in queue time?


Does it has sense to you?
Is it possible to implement?

Thanks!


Albert





[slurm-dev] WallClock time limit updated when scontrol update job xxx qos=yyy

2016-10-18 Thread Felip Moll

Hi,

I have a pending job with a time limit of 2 days, it is assigned by
default to "normal" qos that has a limit of 1 day.

When I realize that it is in PENDING
Reason=QOSMaxWallDurationPerJobLimit, i move it to lowprio qos:

$] scontrol update job 2767745 qos=lowprio

Then, I check the wall duration limit and it switched to 1-00:00:00
days. I wanted to mantain the original 2 days since I moved from a
1-day qos to a 7-day qos.

[fmoll@head1 submission-scripts]$ sacctmgr show qos -pn
normal|100|00:00:00||cluster|||1.00||1-00:00:00|
lowprio|10|00:00:00||cluster|||1.00||7-00:00:00|
highprio|1000|00:00:00||cluster|||1.00||1-00:00:00|
machine|1|00:00:00||cluster|||1.00|||


If I switch a 8 days job from lowprio to normal, it is switched to 1 day.
If I launch a 8 days job to "machine" qos and move it to lowprio,
wallclock is switched to 7 days.


In my opinion, timelimit should not be changed when updating qos job
unless explicitly told.


Is it a correct behaviour in slurm 15.08.10?

--
Felip Moll Marquès
Computer Science Engineer
E-Mail - lip...@gmail.com
WebPage - http://lipix.ciutadella.es

[slurm-dev] Reserved column on UserUtilizationByAccount sreports

2016-10-18 Thread Albert Gil Moreno
Hi,

It seems that right-now (or at least in version 15.08.9) the column Reserved
in a UserUtilizationByAccount  sreport is always 0, like Idle and Down.

For example:

sreport -T CPU -t MinPer cluster UserUtilizationByAccount
Format=TresName%4,Login,Used,Reserved,Idle,Down Start=`date -d "last month"
+%D` End=now

Cluster/User/Account Utilization 2016-09-18T00:00:00 - 2016-10-18T09:59:59
(2628000 secs)
Use reported in TRES Minutes/Percentage of Total

TRES Login  Used   Reserved Idle
Down
 - - -- 
--
 cpu   pbellot 266612(9.77%)   0(0.00%) 0(0.00%)
0(0.00%)
 cpumpomar 157124(5.76%)   0(0.00%) 0(0.00%)
0(0.00%)
 cpu  mbellver  61747(2.26%)   0(0.00%) 0(0.00%)
0(0.00%)



For me it's clear that Down and Idle are values that has none sense to
query "ByAccount", but the Reserved could be seen as the time that an
account has reserved the resources but still not allocated them; so, its in
queue time?

Does it has sense to you?
Is it possible to implement?


Thanks!


Albert