date:20151111

[slurm-dev] Re: jobs canceled by other user than owner

2015-11-11 Thread Dr . Markus Stöhr



Dear James,

no, it can be done by any user. We actually have defined only root as an 
administrative user.


br
Markus

On 11/11/2015 09:29 AM, James Oguya wrote:

Is the user '71187' a slurm administrator user ?

You can confirm that by listing all users:
sacctmgr list users format=User,Account,DefaultAccount,Admin where
User=71187

On Wed, Nov 11, 2015 at 11:13 AM, Dr. Markus Stöhr
> wrote:


Dear all,

sorry for the duplicate post. I will continue posting with this one.

As an addition: It seems that pending array jobs can be canceled by
any other user. Is this bug also found in other versions?

br
Markus


On 11/10/2015 11:30 PM, Dr. Markus Stöhr wrote:


Dear all,

we are experiencing some strange job cancellations from time to
time.
Some of our users report that their jobs have been cancelled,
but they
or an admin user did not execute such a command.

If I do a command like the following:

sacct -s CA -S 2015-04-01 -E 2015-11-30 -o
JobID%20,User,State%30 | grep
CANC x |grep by | awk '{ if ($2 != $5 ) print $1,$2,$5}'   |grep -v
CANCELLE |grep -v " 0"

I get the following output:

JobID   owner cancled_by
1730636 70531 70497
1730647 70531 71187
1730648 70531 71187
1740541 70531 71187
1740548 70531 71187
1741050 70531 71187
1741051 70531 71187
1742205 70531 71187
1742212 70032 71187


When switching to user '71187' and executing a 'scancel' to a
job from
from user '70032' (similar to the last entry of the output
above), it is
impossible to get a job cancelled.

Is this a known issue? We are running slurm version 14.11.8.

best regards,
Markus



--
=
Dr. Markus Stöhr
Zentraler Informatikdienst BOKU Wien / TU Wien
Wiedner Hauptstraße 8-10
1040 Wien

Tel. +43-1-58801-420754 
Fax +43-1-58801-9420754 

Email: markus.sto...@tuwien.ac.at 
=




--
/James Oguya/



--
=
Dr. Markus Stöhr
Zentraler Informatikdienst BOKU Wien / TU Wien
Wiedner Hauptstraße 8-10
1040 Wien

Tel. +43-1-58801-420754
Fax  +43-1-58801-9420754

Email: markus.sto...@tuwien.ac.at
=

[slurm-dev] Re: jobs canceled by other user than owner

2015-11-11 Thread James Oguya

Is the user '71187' a slurm administrator user ?

You can confirm that by listing all users:
sacctmgr list users format=User,Account,DefaultAccount,Admin where
User=71187

On Wed, Nov 11, 2015 at 11:13 AM, Dr. Markus Stöhr <
markus.sto...@tuwien.ac.at> wrote:

>
> Dear all,
>
> sorry for the duplicate post. I will continue posting with this one.
>
> As an addition: It seems that pending array jobs can be canceled by any
> other user. Is this bug also found in other versions?
>
> br
> Markus
>
>
> On 11/10/2015 11:30 PM, Dr. Markus Stöhr wrote:
>
>>
>> Dear all,
>>
>> we are experiencing some strange job cancellations from time to time.
>> Some of our users report that their jobs have been cancelled, but they
>> or an admin user did not execute such a command.
>>
>> If I do a command like the following:
>>
>> sacct -s CA -S 2015-04-01 -E 2015-11-30 -o JobID%20,User,State%30 | grep
>> CANC x |grep by | awk '{ if ($2 != $5 ) print $1,$2,$5}'   |grep -v
>> CANCELLE |grep -v " 0"
>>
>> I get the following output:
>>
>> JobID   owner cancled_by
>> 1730636 70531 70497
>> 1730647 70531 71187
>> 1730648 70531 71187
>> 1740541 70531 71187
>> 1740548 70531 71187
>> 1741050 70531 71187
>> 1741051 70531 71187
>> 1742205 70531 71187
>> 1742212 70032 71187
>>
>>
>> When switching to user '71187' and executing a 'scancel' to a job from
>> from user '70032' (similar to the last entry of the output above), it is
>> impossible to get a job cancelled.
>>
>> Is this a known issue? We are running slurm version 14.11.8.
>>
>> best regards,
>> Markus
>>
>>
>
> --
> =
> Dr. Markus Stöhr
> Zentraler Informatikdienst BOKU Wien / TU Wien
> Wiedner Hauptstraße 8-10
> 1040 Wien
>
> Tel. +43-1-58801-420754
> Fax  +43-1-58801-9420754
>
> Email: markus.sto...@tuwien.ac.at
> =
>



-- 
*James Oguya*

[slurm-dev] Re: Partition QoS

2015-11-11 Thread Bruce Roberts

I believe the method James describes only allows that one qos to be used 
by jobs on the partition, it does not set up a Partition QOS as Paul is 
looking for.


Create the QOS as James describes, but instead of AllowQOS use the QOS 
option as Danny eluded to, and as documented here 
http://slurm.schedmd.com/slurm.conf.html#OPT_QOS.


PartitionName=batch Nodes=compute[01-05] *QOS*=vip Default=NO 
MaxTime=UNLIMITED State=UP


This will allow for a partition QOS that isn't associated with the job 
as talked about in the SLUG15 presentation 
http://slurm.schedmd.com/SLUG15/Partition_QOS.pdf


On 11/11/15 08:09, Paul Edmon wrote:

Thanks for the insight.  I will try it out on my end.

-Paul Edmon-

On 11/11/2015 3:31 AM, Dennis Mungai wrote:

Re: [slurm-dev] Re: Partition QoS

Thanks a lot, James. Helped a bunch ;-)

*From:*James Oguya [mailto:oguyaja...@gmail.com]
*Sent:* Wednesday, November 11, 2015 11:21 AM
*To:* slurm-dev 
*Subject:* [slurm-dev] Re: Partition QoS

I was able to do this on slurm 15.08.3 on a test environment. Here 
are my notes:


- create qos

sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 
MaxTRESPerUser=cpu=128


- create partition in slurm.conf

PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO 
MaxTime=UNLIMITED State=UP


I believe the keyword here is to specify the QoS name when creating 
the partition, for my case, I used AllowQOS=vip


James

On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble > wrote:


I'm guessing you also added the qos to your partition line in
your slurm.conf as well.

On November 10, 2015 5:08:24 PM PST, Paul Edmon
> wrote:

I did that but it didn't pick it up.  I must need to
reconfigure again after I made the qos.  I will have to try
it again.  Let you know how it goes.

-Paul Edmon-

On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:

Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos
part_whatever

Then in slurm.conf setting qos=part_whatever in the
"whatever" partition definition.

scontrol reconfigure

finally, set the limits on the qos:

sacctmgr modify qos set MaxJobsPerUser=5 where
name=part_whatever

...

-Doug




Doug Jacobsen, Ph.D.

NERSC Computer Systems Engineer

National Energy Research Scientific Computing Center


- __o
-- _ '\<,_
--(_)/  (_)__

On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon
 wrote:


In 15.08 you are able to set QoS limits directly on a
partition. So how do you actually accomplish this? 
I've tried a couple of ways, but no luck.  I haven't

seen a demo of how to do this anywhere either.  My
goal is to set up a partition with the following QoS
parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-



--

/James Oguya/


__

This e-mail contains information which is confidential. It is 
intended only for the use of the named recipient. If you have 
received this e-mail in error, please let us know by replying to the 
sender, and immediately delete it from your system. Please note, that 
in these circumstances, the use, disclosure, distribution or copying 
of this information is strictly prohibited. KEMRI-Wellcome Trust 
Programme cannot accept any responsibility for the accuracy or 
completeness of this message as it has been transmitted over a public 
network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept 
responsibility for any loss or damage arising from the use of the 
email or attachments. Any views expressed in this message are those 
of the individual sender, except where the sender specifically states 
them to be the views of KEMRI-Wellcome Trust Programme.

__

[slurm-dev] Re: Partition QoS

2015-11-11 Thread Paul Edmon

Thanks for the insight.  I will try it out on my end.

-Paul Edmon-

On 11/11/2015 3:31 AM, Dennis Mungai wrote:

Re: [slurm-dev] Re: Partition QoS

Thanks a lot, James. Helped a bunch ;-)

*From:*James Oguya [mailto:oguyaja...@gmail.com]
*Sent:* Wednesday, November 11, 2015 11:21 AM
*To:* slurm-dev 
*Subject:* [slurm-dev] Re: Partition QoS

I was able to do this on slurm 15.08.3 on a test environment. Here are 
my notes:

- create qos

sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 
MaxTRESPerUser=cpu=128

- create partition in slurm.conf

PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO 
MaxTime=UNLIMITED State=UP

I believe the keyword here is to specify the QoS name when creating 
the partition, for my case, I used AllowQOS=vip

James

On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble > wrote:

I'm guessing you also added the qos to your partition line in your
slurm.conf as well.

On November 10, 2015 5:08:24 PM PST, Paul Edmon
> wrote:

I did that but it didn't pick it up.  I must need to
reconfigure again after I made the qos.  I will have to try it
again.  Let you know how it goes.

-Paul Edmon-

On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:

Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos
part_whatever

Then in slurm.conf setting qos=part_whatever in the
"whatever" partition definition.

scontrol reconfigure

finally, set the limits on the qos:

sacctmgr modify qos set MaxJobsPerUser=5 where
name=part_whatever

...

-Doug

Doug Jacobsen, Ph.D.

NERSC Computer Systems Engineer

National Energy Research Scientific Computing Center

- __o
-- _ '\<,_
--(_)/  (_)__

On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon
>
wrote:

In 15.08 you are able to set QoS limits directly on a
partition.  So how do you actually accomplish this? 
I've tried a couple of ways, but no luck.  I haven't

seen a demo of how to do this anywhere either. My goal
is to set up a partition with the following QoS
parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-

--

/James Oguya/

__

This e-mail contains information which is confidential. It is intended 
only for the use of the named recipient. If you have received this 
e-mail in error, please let us know by replying to the sender, and 
immediately delete it from your system. Please note, that in these 
circumstances, the use, disclosure, distribution or copying of this 
information is strictly prohibited. KEMRI-Wellcome Trust Programme 
cannot accept any responsibility for the accuracy or completeness of 
this message as it has been transmitted over a public network. 
Although the Programme has taken reasonable precautions to ensure no 
viruses are present in emails, it cannot accept responsibility for any 
loss or damage arising from the use of the email or attachments. Any 
views expressed in this message are those of the individual sender, 
except where the sender specifically states them to be the views of 
KEMRI-Wellcome Trust Programme.

__

[slurm-dev] Re: Partition QoS

2015-11-11 Thread Dennis Mungai

Thanks a lot, James. Helped a bunch ;-)

From: James Oguya [mailto:oguyaja...@gmail.com]
Sent: Wednesday, November 11, 2015 11:21 AM
To: slurm-dev 
Subject: [slurm-dev] Re: Partition QoS

I was able to do this on slurm 15.08.3 on a test environment. Here are my notes:

- create qos
sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5 
MaxTRESPerUser=cpu=128

- create partition in slurm.conf
PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO 
MaxTime=UNLIMITED State=UP

I believe the keyword here is to specify the QoS name when creating the 
partition, for my case, I used AllowQOS=vip

James

On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble 
> wrote:
I'm guessing you also added the qos to your partition line in your slurm.conf 
as well.

On November 10, 2015 5:08:24 PM PST, Paul Edmon 
> wrote:
I did that but it didn't pick it up.  I must need to reconfigure again after I 
made the qos.  I will have to try it again.  Let you know how it goes.

-Paul Edmon-
On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:
Hi Paul,

I did this by creating the qos, e.g. sacctmgr create qos part_whatever
Then in slurm.conf setting qos=part_whatever in the "whatever" partition 
definition.

scontrol reconfigure

finally, set the limits on the qos:
sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever
...

-Doug

Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center

- __o
-- _ '\<,_
--(_)/  (_)__

On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon 
> wrote:

In 15.08 you are able to set QoS limits directly on a partition.  So how do you 
actually accomplish this?  I've tried a couple of ways, but no luck.  I haven't 
seen a demo of how to do this anywhere either.  My goal is to set up a 
partition with the following QoS parameters:

MaxJobsPerUser=5
MaxSubmitJobsPerUser=5
MaxCPUsPerUser=128

Thanks for the info.

-Paul Edmon-

--
James Oguya

__

This e-mail contains information which is confidential. It is intended only for 
the use of the named recipient. If you have received this e-mail in error, 
please let us know by replying to the sender, and immediately delete it from 
your system.  Please note, that in these circumstances, the use, disclosure, 
distribution or copying of this information is strictly prohibited. 
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the  
accuracy or completeness of this message as it has been transmitted over a 
public network. Although the Programme has taken reasonable precautions to 
ensure no viruses are present in emails, it cannot accept responsibility for 
any loss or damage arising from the use of the email or attachments. Any views 
expressed in this message are those of the individual sender, except where the 
sender specifically states them to be the views of KEMRI-Wellcome Trust 
Programme.
__

[slurm-dev] Re: Big NHC Announcement and 1.4.2 Release

2015-11-11 Thread Michael Jennings

On Nov 11, 2015 8:52 PM, "Eliot Eshelman"  wrote:

> I'm certainly thrilled to hear this news!

:-)

> Hosting NHC on github will make patch submissions much easier for me.
Hopefully others agree.

I hope so too!  It will definitely be easier on me; I've been using Git
(via git-svn) for my development work for a very long time now, but I could
never truly take advantage of its capabilities (or GitHub's either).  It's
been a huge amount of work to migrate, but it will be worth every bit of it
just to finally be able to accept PRs!  ;-)

> NHC and SLURM have been playing together nicely for us.

Us too!  Almost all of our systems are SLURM now, and we use NHC heavily
for monitoring, patrolling, service management, enforcement, and cleanup.
I've also been working on a mechanism that uses squeue to generate a
node-specific $NHC_AUTH_USERS for enforcement/cleanup purposes which I'll
share when ready!

Thanks as always; I look forward to your first Pull Request! ;-)

Michael

[slurm-dev] Re: Big NHC Announcement and 1.4.2 Release

2015-11-11 Thread Eliot Eshelman

I'm certainly thrilled to hear this news!

Hosting NHC on github will make patch submissions much easier for me.
Hopefully others agree.

NHC and SLURM have been playing together nicely for us.

Eliot

On Wed, Nov 11, 2015 at 9:36 PM, Michael Jennings  wrote:

>
> (I know this is the SLURM list, but many of the folks here use NHC
> with SLURM, so I'm hoping it's not a problem.  If it is, please accept
> my humble apologies!)
>
>
> To all users of Warewulf NHC:
>
> [TL;DR:  NHC is now its own project with a new name, new Git repository
> (GitHub AND BitBucket), new mailing lists, and new real-time chat
> resources!  New version 1.4.2 has been released as well.  See below
> for details and URLs!]
>
>
> Since its initial public release in early 2012, our work on the Node
> Health Check (NHC) tools has been performed and published under the
> umbrella of the Warewulf Project (http://warewulf.lbl.gov/).  This was
> done for a number of reasons -- sharing of resources, mutual
> promotion, etc.  However, this created one very large and unexpected
> downside:  user confusion.
>
> You see, many users interpreted "Warewulf Node Health Check" to mean
> that it was only intended/suitable for use on cluster nodes managed by
> Warewulf, or that it required Warewulf in order to function properly,
> or any number of other misunderstandings, the end result of which was
> that potential users opted not to give it a try!
>
> Our #1 primary goal since the very start was to build a community of
> users around NHC to share and exchange ideas, health checks, tools,
> and other code to maximize our collective ability to deliver excellent
> service and unsurpassed availability to our customers -- the
> scientists and researchers tasked with no less than literally changing
> and saving our world, and anything that gets in the way of that goal
> is a problem.  A big problem.
>
> So today I am thrilled to announce the creation of an independent
> project:  Lawrence Berkeley National Laboratory (LBNL) Node Health
> Check.  All development work on what used to be Warewulf NHC has now
> moved over to LBNL NHC and will continue under that identity going
> forward.  As a result, NHC will no longer be making use of Warewulf
> project resources for its primary development activities.
>
> As such, new discussion forums have been created, and users interested
> in following or discussing the ongoing development of LBNL Node Health
> Check will want to join these groups, either by subscribing to them as
> mailing lists or by using the online web forums.  Users of NHC should
> join the Users' list (nhc+subscr...@lbl.gov or
> https://groups.google.com/a/lbl.gov/forum/?hl=en#!forum/nhc), and
> those interested in following development or contributing code should
> also join the Developers' list (nhc-devel+subscr...@lbl.gov or
> https://groups.google.com/a/lbl.gov/forum/?hl=en#!forum/nhc-devel).
> The Users' list will likely be very low-traffic; the Developers' list
> receives development activity notifications, and will therefore see a
> bit more traffic, but still no more than a handful of messages each
> day.
>
> Perhaps the most significant and most user-visible change is that the
> source code repository has been moved over to GitHub.  The front page
> for the repository (which also doubles as the project home page and
> documentation page) is now at https://github.com/mej/nhc.  (For the
> repository URL, just add ".git" on the end, or use
> git+ssh://g...@github.com/mej/nhc.git if you're a GitHub user.)  For
> those who prefer Atlassian's Bitbucket instead, we have an equivalent
> site for NHC at https://bitbucket.org/mej0/nhc (git repo
> https://bitbucket.org/mej0/nhc.git or
> git+ssh://g...@bitbucket.org:mej0/nhc.git).  This allows anyone wishing
> to contribute to NHC development -- by fixing bugs, adding new checks,
> updating documentation, writing unit tests, or even enhancing the
> default example configuration file -- can now do so quickly and easily
> using the facilities of Git and GitHub/Bitbucket rather than having to
> e-mail patches around everywhere.  We can finally take advantage of
> modern development technologies and collaboration features such as
> Issues, Pull Requests, Forks, and more...and I hope many of you will!
>
> Last, but certainly not least, we will be offering multiple options
> for those who like to use real-time chat for communications, Q, and
> troubleshooting assistance.  For the old-school folks who prefer
> strictly text-based chat, traditional IRC will continue to be
> available; we have created the channel #lbnl-nhc on irc.freenode.net.
> Users may also elect to use Gitter instead; GitHub users can access
> https://gitter.im/mej/nhc with their GitHub credentials.  For those
> familiar with Slack, invitations are available to the NHC Slack
> instance (at https://mej.slack.com/messages/nhc/) by contacting me
> privately.  And finally we are testing out Ryver (ryver.com) as a
> possible

[slurm-dev] Big NHC Announcement and 1.4.2 Release

2015-11-11 Thread Michael Jennings


(I know this is the SLURM list, but many of the folks here use NHC
with SLURM, so I'm hoping it's not a problem.  If it is, please accept
my humble apologies!)


To all users of Warewulf NHC:

[TL;DR:  NHC is now its own project with a new name, new Git repository
(GitHub AND BitBucket), new mailing lists, and new real-time chat
resources!  New version 1.4.2 has been released as well.  See below
for details and URLs!]


Since its initial public release in early 2012, our work on the Node
Health Check (NHC) tools has been performed and published under the
umbrella of the Warewulf Project (http://warewulf.lbl.gov/).  This was
done for a number of reasons -- sharing of resources, mutual
promotion, etc.  However, this created one very large and unexpected
downside:  user confusion.

You see, many users interpreted "Warewulf Node Health Check" to mean
that it was only intended/suitable for use on cluster nodes managed by
Warewulf, or that it required Warewulf in order to function properly,
or any number of other misunderstandings, the end result of which was
that potential users opted not to give it a try!

Our #1 primary goal since the very start was to build a community of
users around NHC to share and exchange ideas, health checks, tools,
and other code to maximize our collective ability to deliver excellent
service and unsurpassed availability to our customers -- the
scientists and researchers tasked with no less than literally changing
and saving our world, and anything that gets in the way of that goal
is a problem.  A big problem.

So today I am thrilled to announce the creation of an independent
project:  Lawrence Berkeley National Laboratory (LBNL) Node Health
Check.  All development work on what used to be Warewulf NHC has now
moved over to LBNL NHC and will continue under that identity going
forward.  As a result, NHC will no longer be making use of Warewulf
project resources for its primary development activities.

As such, new discussion forums have been created, and users interested
in following or discussing the ongoing development of LBNL Node Health
Check will want to join these groups, either by subscribing to them as
mailing lists or by using the online web forums.  Users of NHC should
join the Users' list (nhc+subscr...@lbl.gov or
https://groups.google.com/a/lbl.gov/forum/?hl=en#!forum/nhc), and
those interested in following development or contributing code should
also join the Developers' list (nhc-devel+subscr...@lbl.gov or
https://groups.google.com/a/lbl.gov/forum/?hl=en#!forum/nhc-devel).
The Users' list will likely be very low-traffic; the Developers' list
receives development activity notifications, and will therefore see a
bit more traffic, but still no more than a handful of messages each
day.

Perhaps the most significant and most user-visible change is that the
source code repository has been moved over to GitHub.  The front page
for the repository (which also doubles as the project home page and
documentation page) is now at https://github.com/mej/nhc.  (For the
repository URL, just add ".git" on the end, or use
git+ssh://g...@github.com/mej/nhc.git if you're a GitHub user.)  For
those who prefer Atlassian's Bitbucket instead, we have an equivalent
site for NHC at https://bitbucket.org/mej0/nhc (git repo
https://bitbucket.org/mej0/nhc.git or
git+ssh://g...@bitbucket.org:mej0/nhc.git).  This allows anyone wishing
to contribute to NHC development -- by fixing bugs, adding new checks,
updating documentation, writing unit tests, or even enhancing the
default example configuration file -- can now do so quickly and easily
using the facilities of Git and GitHub/Bitbucket rather than having to
e-mail patches around everywhere.  We can finally take advantage of
modern development technologies and collaboration features such as
Issues, Pull Requests, Forks, and more...and I hope many of you will!

Last, but certainly not least, we will be offering multiple options
for those who like to use real-time chat for communications, Q, and
troubleshooting assistance.  For the old-school folks who prefer
strictly text-based chat, traditional IRC will continue to be
available; we have created the channel #lbnl-nhc on irc.freenode.net.
Users may also elect to use Gitter instead; GitHub users can access
https://gitter.im/mej/nhc with their GitHub credentials.  For those
familiar with Slack, invitations are available to the NHC Slack
instance (at https://mej.slack.com/messages/nhc/) by contacting me
privately.  And finally we are testing out Ryver (ryver.com) as a
possible alternative to Slack, so if you're interested in
using/helping test our Ryver instance (at
https://lbnl.ryver.com/index.html#channel/4), let me know!

To top it all off, we've released version 1.4.2 of LBNL Node Health
Check to kick things off right!  This release offers a couple new
checks, new features for some of the existing checks, and of course
completely updated/refactored MarkDown-based documentation for the
move to GitHub.

[slurm-dev] Re: jobs canceled by other user than owner

2015-11-11 Thread Dr . Markus Stöhr



Dear all,

sorry for the duplicate post. I will continue posting with this one.

As an addition: It seems that pending array jobs can be canceled by any 
other user. Is this bug also found in other versions?


br
Markus

On 11/10/2015 11:30 PM, Dr. Markus Stöhr wrote:


Dear all,

we are experiencing some strange job cancellations from time to time.
Some of our users report that their jobs have been cancelled, but they
or an admin user did not execute such a command.

If I do a command like the following:

sacct -s CA -S 2015-04-01 -E 2015-11-30 -o JobID%20,User,State%30 | grep
CANC x |grep by | awk '{ if ($2 != $5 ) print $1,$2,$5}'   |grep -v
CANCELLE |grep -v " 0"

I get the following output:

JobID   owner cancled_by
1730636 70531 70497
1730647 70531 71187
1730648 70531 71187
1740541 70531 71187
1740548 70531 71187
1741050 70531 71187
1741051 70531 71187
1742205 70531 71187
1742212 70032 71187


When switching to user '71187' and executing a 'scancel' to a job from
from user '70032' (similar to the last entry of the output above), it is
impossible to get a job cancelled.

Is this a known issue? We are running slurm version 14.11.8.

best regards,
Markus




--
=
Dr. Markus Stöhr
Zentraler Informatikdienst BOKU Wien / TU Wien
Wiedner Hauptstraße 8-10
1040 Wien

Tel. +43-1-58801-420754
Fax  +43-1-58801-9420754

Email: markus.sto...@tuwien.ac.at
=

[slurm-dev] Re: Partition QoS

2015-11-11 Thread James Oguya

I was able to do this on slurm 15.08.3 on a test environment. Here are my
notes:

- create qos
sacctmgr add qos Name=vip MaxJobsPerUser=5 MaxSubmitJobsPerUser=5
MaxTRESPerUser=cpu=128

- create partition in slurm.conf
PartitionName=batch Nodes=compute[01-05] AllowQOS=vip Default=NO
MaxTime=UNLIMITED State=UP

I believe the keyword here is to specify the QoS name when creating the
partition, for my case, I used AllowQOS=vip


James


On Wed, Nov 11, 2015 at 4:16 AM, Danny Auble  wrote:

> I'm guessing you also added the qos to your partition line in your
> slurm.conf as well.
>
> On November 10, 2015 5:08:24 PM PST, Paul Edmon 
> wrote:
>>
>> I did that but it didn't pick it up.  I must need to reconfigure again
>> after I made the qos.  I will have to try it again.  Let you know how it
>> goes.
>>
>> -Paul Edmon-
>>
>> On 11/10/2015 5:40 PM, Douglas Jacobsen wrote:
>>
>> Hi Paul,
>>
>> I did this by creating the qos, e.g. sacctmgr create qos part_whatever
>> Then in slurm.conf setting qos=part_whatever in the "whatever" partition
>> definition.
>>
>> scontrol reconfigure
>>
>> finally, set the limits on the qos:
>> sacctmgr modify qos set MaxJobsPerUser=5 where name=part_whatever
>> ...
>>
>> -Doug
>>
>> 
>> Doug Jacobsen, Ph.D.
>> NERSC Computer Systems Engineer
>> National Energy Research Scientific Computing Center
>> 
>>
>> - __o
>> -- _ '\<,_
>> --(_)/  (_)__
>>
>>
>> On Tue, Nov 10, 2015 at 2:29 PM, Paul Edmon 
>> wrote:
>>
>>>
>>> In 15.08 you are able to set QoS limits directly on a partition.  So how
>>> do you actually accomplish this?  I've tried a couple of ways, but no
>>> luck.  I haven't seen a demo of how to do this anywhere either.  My goal is
>>> to set up a partition with the following QoS parameters:
>>>
>>> MaxJobsPerUser=5
>>> MaxSubmitJobsPerUser=5
>>> MaxCPUsPerUser=128
>>>
>>> Thanks for the info.
>>>
>>> -Paul Edmon-
>>>
>>
>>
>>


-- 
*James Oguya*

[slurm-dev] Re: jobs canceled by other user than owner

[slurm-dev] Re: jobs canceled by other user than owner

[slurm-dev] Re: Partition QoS

[slurm-dev] Re: Partition QoS

[slurm-dev] Re: Partition QoS

[slurm-dev] Re: Big NHC Announcement and 1.4.2 Release

[slurm-dev] Re: Big NHC Announcement and 1.4.2 Release

[slurm-dev] Big NHC Announcement and 1.4.2 Release

[slurm-dev] Re: jobs canceled by other user than owner

[slurm-dev] Re: Partition QoS

10 matches

Site Navigation

Mail list logo

Footer information