[slurm-users] Re: need to set From: address for slurm

2024-06-07 Thread Paul Edmon via slurm-users
There is no way to do it in slurm. You have to do it in the mail program 
you are using to send mail. In our case we use postfix and we set 
smtp_generic_maps to accomplish this.


-Paul Edmon-

On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote:


All,

When the slurm daemon is sending out emails, they are coming from 
“sl...@servername.subdomain.domain.edu”. This has worked okay in the 
past, but due to a recent mail server change (over which I have no 
control whatsoever) this will no longer work. Now, the From: address 
is going to have to be something like “slurm-servern...@domain.edu” , 
or, at least something that ends in “@domain.edu” (the subdomain being 
present will cause it to get rejected by the mail server.


I am not seeing in the documentation how to change the “From:” address 
tha slurm uses. Is there a way to do this and I’m just missing it?


---

Mike VanHorn

Senior Computer Systems Administrator

College of Engineering and Computer Science

Wright State University

265 Russ Engineering Center

937-775-5157

michael.vanh...@wright.edu


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: dynamical configuration || meta configuration mgmt

2024-05-29 Thread Paul Edmon via slurm-users
Many parameters in slurm can be changed via scontrol and sacctmgr 
commands without updating the conf itself. The thing is that scontrol 
commands are not durable across restarts. sacctmgr though update the 
slurmdb and thus will be sticky.


That's at least what I would do is that if you are using a QoS to manage 
this (which I am assuming you are), I would use sacctmgr.


As for a framework that does the state inspection, I'm not aware of one. 
You could do it via cron and batch scripts to do the state inspection. I 
don't know if some one has something more sophisticated though.


-Paul Edmon-

On 5/29/2024 11:05 AM, Heckes, Frank via slurm-users wrote:


Hello all,

I’m sorry if this has been asked and answered before, but I couldn’t 
find anything related.


Does anyone know whether a framework of sorts exists that allow to 
change certain SLURM configuration parameters provided some conditions 
in the batch system’s state are detected and of course are revert if 
the state became the old one again?


(To be more concrete: We like to raise or unset maxjobPU to run as 
much  small jobs as possible to allocate all nodes as soon as certain 
threshold of free nodes are available and of course some other scenarios)


Many thanks in advance.

Cheers,

-Frank

Max-Planck-Institut

für Sonnensystemforschung

Justus-von-Liebig-Weg 3

D-37077 Göttingen

Phone: [+49] 551 – 384 979 320

E-Mail: hec...@mps.mpg.de <mailto:hec...@mps.mpg.de>


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] HPC Principal System Engineer at the Broad

2024-04-25 Thread Paul Edmon via slurm-users
A friend ask me to pass this along. Figured some folks on this list 
might be interested.


https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. 
That will then force the scheduler to clear them.


-Paul Edmon-

On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:

We are running a slurm cluster with version `slurm 22.05.8`. One of our users 
has reported that their jobs have been stuck at the completion stage for a long 
time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide we 
found that indeed the batchhost for the job was removed from the cluster, 
perhaps without draining it first.

How do we cancel/delete the jobs ?

* We tried scancel on the batch and individual job ids from both the user and 
from SlurmUser



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
I wrote a little blog post on this topic a few years back: 
https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/



It's a vexing problem, but as noted by the other responders it is 
something that depends on your cluster policy and job performance needs. 
Well written MPI code should be able to scale well even when given 
non-optimal topologies.



You might also look at Node Weights 
(https://slurm.schedmd.com/slurm.conf.html#OPT_Weight). We use them on 
mosaic partitions so that the latest hardware is left available for 
larger jobs needing more performance.  You can also use it to force jobs 
to one side of the partition, though generally the scheduler does this 
automatically.



-Paul Edmon-


On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote:
Agree with that.   Plus, of course, even if the jobs run a bit slower 
by not having all the cores on a single node, they will be scheduled 
sooner, so the overall turnaround time for the user will be better, 
and ultimately that's what they care about. I've always been of the 
view, for any scheduler, that the less you try to constrain it the 
better.  It really depends on what you're trying to optimise for, but 
generally speaking I try to optimise for maximum utilisation and 
throughput, unless I have a specific business case that needs to 
prioritise particular workloads, and then I'll compromise on 
throughput to get the urgent workload through sooner.


Tun

*From:* Loris Bennett via slurm-users 
*Sent:* 09 April 2024 06:51
*To:* slurm-users@lists.schedmd.com 
*Cc:* Gerhard Strangar 
*Subject:* [slurm-users] Re: Avoiding fragmentation
Hi Gerhard,

Gerhard Strangar via slurm-users  writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
    general will start much faster if they don't specify the number of
    nodes and certainly don't request full nodes.  The overhead due to
    the jobs being scattered across nodes is often much lower than the
    additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
    node becomes unavailable to other jobs, regardless of how much
    memory is requested or indeed how much is actually used.  This holds
    up jobs with low CPU but high memory requirements and thus reduces
    the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales 
with registered number:03674842 and its registered office at 1 Francis 
Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.


This e-mail and its attachments are intended for the above named 
recipient only and may contain confidential and privileged 
information. If they have come to you in error, you must not copy or 
show them to anyone; instead, please reply to this e-mail, 
highlighting the error to the sender and then immediately delete the 
message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor 
communications, please see our privacy notice at www.astrazeneca.com 
<https://www.astrazeneca.com>



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Edmon via slurm-users
For this use case you probably want to go with Classic Fairshare 
(https://slurm.schedmd.com/classic_fair_share.html) rather than 
FairTree. Classic Fairshare behaves in a way similar to what you 
describe. You can set up different bins for fairshare and then the user 
can pull from them. So that would be my recommendation. This is how we 
handle fairshare at FASRC: https://docs.rc.fas.harvard.edu/kb/fairshare/ 
As we use Classic Fairshare. You will need to enable this: 
https://slurm.schedmd.com/slurm.conf.html#OPT_NO_FAIR_TREE as Fair Tree 
is on by default.


-Paul Edmon-

On 3/27/2024 9:22 AM, Long, Daniel S. via slurm-users wrote:


Hi,

I’m trying to set up multifactor priority on our cluster and am having 
some trouble getting it to behave the way I’d like. My main issues 
seem to revolve around FairShare.


We have multiple projects on our cluster and multiple users in those 
projects (and some users are in multiple projects, of course). I would 
like the FairShare to be based only on the project associated with the 
job; if user A and user B both submit jobs on project C, the FairShare 
should be identical. However, it looks like the FairShare is based on 
both the project and the user. Is there a way to get the behavior I’m 
looking for?


Thanks for any help you can provide.


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm Utilities

2024-03-13 Thread Paul Edmon via slurm-users
Just wanted to share some slurm utilities that we've written at Harvard 
FASRC that maybe useful to the community.


seff-account: https://github.com/fasrc/seff-account  Creates job 
statistics summaries for users and accounts similar to what seff and 
seff-array does.


showq: https://github.com/fasrc/slurm_showq  A slurm version of the Moab 
showq command


lsload: https://github.com/fasrc/lsload  A slurm version of the LSF 
lsload command


scalc: https://github.com/fasrc/scalc  A calculator for various 
fairshare related things


spart: https://github.com/fasrc/spart  A simplified output for slurm 
partition information


stdg: https://github.com/fasrc/stdg Slurm test deck generator

prometheus-slurm-exporter: 
https://github.com/fasrc/prometheus-slurm-exporter  Slurm exporters for 
prometheus


Hopefully people find these useful. Pull requests are always appreciated.

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
He's talking about recent versions of Slurm which now have this option: 
https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step


-Paul Edmon-

On 2/28/2024 10:46 AM, Paul Raines wrote:


What do you mean "operate via the normal command line"?  When
you salloc, you are still on the login node.

$ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G 
--time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash

salloc: Pending job allocation 3798364
salloc: job 3798364 queued and waiting for resources
salloc: job 3798364 has been allocated resources
salloc: Granted job allocation 3798364
salloc: Waiting for resource configuration
salloc: Nodes rtx-02 are ready for job
mesg: cannot open /dev/pts/91: Permission denied
mlsc-login[0]:~$ hostname
mlsc-login.nmr.mgh.harvard.edu
mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST
SLURM_JOB_NODELIST=rtx-02

Seems you MUST use srun


-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:

  External Email - Use Caution salloc is the currently 
recommended way for interactive sessions. srun is now intended for 
launching steps or MPI applications. So properly you would salloc and 
then srun inside the salloc.


As you've noticed with srun you tend lose control of your shell as it 
takes over so you have background the process unless it is the main 
process. We've hit this before when people use srun to subschedule in 
a salloc.


You can also just launch the salloc and then operate via the normal 
command line reserving srun for things like launching MPI.


The reason they changed from srun to salloc is that you can't srun 
inside a srun. So if you were a user who started a srun interactive 
session and then you tried to invoke MPI it would get weird as you 
would be invoking another srun. By using salloc you avoid this issue.


We used to use srun for interactive sessions as well but swapped to 
salloc a few years back and haven't had any issues.


-Paul Edmon-

On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:

 Hi list,

 In our institution, our instructions to users who want to spawn an
 interactive job (for us, a bash shell) have always been to do "srun 
..."
 from the login node, which has always been working well for us. But 
when

 we had a recent Slurm training, the SchedMD folks advised us to use
 "salloc" and then "srun" to do interactive jobs. I tried this today,
 "salloc" gave me a shell on a server, the same as srun does, but 
then when
 I tried to "srun [programname]" it hung there with no output. Of 
course

 when I tried "srun [programname] &" it spawned the background job, and
 gave me back a prompt. Either time I had to Ctrl-C the running srun 
job,

 and got no output other than the srun/slurmstepd termination output.

 I think I read somewhere that directly invoking srun creates an
 allocation; why then would I want to do an initial salloc, and then 
srun?

 (i the case that I want a foreground program, such as a bash shell)

 I have surveyed some other institution's Slurm interactive jobs
 documentation for users, I see both examples of advice to run srun
 directly, or salloc and then srun.

 Please help me to understand how this is intended to work, and if 
we are

 "doing it wrong" :)

 Thanks,
 Will



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




The information in this e-mail is intended only for the person to whom 
it is addressed.  If you believe this e-mail was sent to you in error 
and the e-mail contains patient information, please contact the Mass 
General Brigham Compliance HelpLine at 
https://www.massgeneralbrigham.org/complianceline 
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not 
wish to continue communication over unencrypted e-mail, please notify 
the sender of this message immediately.  Continuing to send or respond 
to e-mail after receiving this message means you understand and accept 
this risk and wish to continue to communicate over unencrypted e-mail.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
salloc is the currently recommended way for interactive sessions. srun 
is now intended for launching steps or MPI applications. So properly you 
would salloc and then srun inside the salloc.


As you've noticed with srun you tend lose control of your shell as it 
takes over so you have background the process unless it is the main 
process. We've hit this before when people use srun to subschedule in a 
salloc.


You can also just launch the salloc and then operate via the normal 
command line reserving srun for things like launching MPI.


The reason they changed from srun to salloc is that you can't srun 
inside a srun. So if you were a user who started a srun interactive 
session and then you tried to invoke MPI it would get weird as you would 
be invoking another srun. By using salloc you avoid this issue.


We used to use srun for interactive sessions as well but swapped to 
salloc a few years back and haven't had any issues.


-Paul Edmon-

On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:

Hi list,

In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun 
..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us 
to use "salloc" and then "srun" to do interactive jobs. I tried this today, "salloc" gave me a shell on a server, 
the same as srun does, but then when I tried to "srun [programname]" it hung there with no output. Of course when I tried "srun 
[programname] &" it spawned the background job, and gave me back a prompt. Either time I had to Ctrl-C the running srun job, and got 
no output other than the srun/slurmstepd termination output.

I think I read somewhere that directly invoking srun creates an allocation; why 
then would I want to do an initial salloc, and then srun? (i the case that I 
want a foreground program, such as a bash shell)

I have surveyed some other institution's Slurm interactive jobs documentation 
for users, I see both examples of advice to run srun directly, or salloc and 
then srun.

Please help me to understand how this is intended to work, and if we are "doing it 
wrong" :)

Thanks,
Will



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Paul Edmon via slurm-users
I concur with what folks have written so far, it really depends on your 
use case. For instance if you are looking at a cluster with GPU's and 
intend to do some serious computing there you are going to need RDMA of 
some sort. But it all depends on what you end up needing for your workflows.


For us we put most of our network traffic over the IB using IPoIB 
combined with aliasing all the nodes to their IB address. Thus all the 
internode network traffic spans the IB fabric rather than the ethernet. 
We then have 1GbE for our ethernet backend which we mainly use for 
management purposes. So we haven't heavily invested in a high speed 
ethernet backbone but instead invested in IB.


To invest in both seems to me to be overkill, you should focus on one or 
the other unless you have the cash to spend and a good use case.


-Paul Edmon-

On 2/26/24 7:07 AM, Dan Healy via slurm-users wrote:
I’m very appreciative for each person who’s provided some feedback, 
especially the lengthy replies.


Sounds like RoCE capable Ethernet backbone may be the default way to 
go /unless/ the end users have some specific requirements that might 
need IB. At this point, we wouldn’t be interested in anything slower 
than 200Gbps. So perhaps Eth and IB are equivalent in terms of latency 
and RDMA capabilities, except one is an open standard.


Thanks,

Daniel Healy


On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim  
wrote:


My view is that it depends entirely on the workload, and the
systems with which your compute needs to interact.  A few things
I’ve experienced before.

 1. Modern ethernet networks have pretty good latency these days,
and so MPI codes can run over them.   Whether IB is worth the
money is a cost/benefit calculation for the codes you want to
run.  The ethernet network we put in at Sanger in 2016 or so
we measured as having similar latency, in practice, as FDR
infiniband, if I remember correctly.  So it wasn’t as good as
state-of-the-art IB at the time, but not bad. Certainly good
enough for our purposes, and we gained a lot of flexibility
through software-defined networking, important if you have
workloads which require better security boundaries than just a
big shared network.
 2. If your workload is predominantly single node, embarrassingly
parallel, you might do better to go with ethernet and invest
the saved money in more compute nodes.
 3. If you only have ethernet, your cluster will be simpler, and
require less specialised expertise to run
 4. If your parallel filesystem is Lustre, IB seems to be the more
well-worn path than ethernet.  We encountered a few Lustre
bugs early on because of that.
 5. On the other hand, if you need to talk to Weka, ethernet is
the well-worn path.  Weka’s IB implementation requires the
dedication of some cores on every client node, so you lose
some compute capacity, which you don’t need to do if you’re
using ethernet.

So, as any lawyer would say “it depends”.  Most of my career has
been in genomics, where IB definitely wasn’t necessary.  Now that
I’m in pharma, there’s more MPI code, so there’s more of a case
for it.

Ultimately, I think you need to run the real benchmarks with real
code, and as Jason says, work out whether the additional
complexity and cost of the IB network is worth it for your
particular workload.  I don’t think the mantra “It’s HPC so it has
to be Infiniband” is a given.

Tim

-- 


*Tim Cutts*

Scientific Computing Platform Lead

AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can
support you by visiting ourService Catalogue
<https://azcollaboration.sharepoint.com/sites/CMU993>|

*From: *Jason Simms via slurm-users 
*Date: *Monday, 26 February 2024 at 01:13
*To: *Dan Healy 
*Cc: *slurm-users@lists.schedmd.com 
*Subject: *[slurm-users] Re: Question about IB and Ethernet networks

Hello Daniel,

In my experience, if you have a high-speed interconnect such as
IB, you would do IPoIB. You would likely still have a "regular"
Ethernet connection for management purposes, and yes that means
both an IB switch and an Ethernet switch, but that switch doesn't
have to be anything special. Any "real" traffic is routed over IB,
everything is mounted via IB, etc. That's how the last two
clusters I've worked with have been configured, and the next one
will be the same (but will use Omnipath rather than IB). We
likewise use BeeGFS.

These next comments are perhaps more likely to encounter
differences of opinion, but I would say that sufficiently fast
Ethernet is often "good enough" for most workloads (e.g., MPI).
I'd wager that for all but the most demanding of workloads, it's
entirely acc

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Paul Edmon via slurm-users
Are you using the job_script storage option? If so then you should be 
able to get at it by doing:


sacct -B j JOBID

https://slurm.schedmd.com/sacct.html#OPT_batch-script

-Paul Edmon-

On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote:

Hello all,

I've used the "scontrol write batch_script" command to output the job 
submission script from completed jobs in the past, but for some 
reason, no matter which job I specify, it tells me it is invalid. Any 
way to troubleshoot this? Alternatively, is there another way - even 
if a manual database query - to recover the job script, assuming it 
exists in the database?


sacct --jobs=38960
JobID           JobName  Partition    Account  AllocCPUS  State ExitCode
 -- -- -- -- -- 

38960        amr_run_v+ tsmith2lab tsmith2lab         72  COMPLETED   
   0:0
38960.batch       batch            tsmith2lab         40  COMPLETED   
   0:0
38960.extern     extern            tsmith2lab         72  COMPLETED   
   0:0
38960.0      hydra_pmi+            tsmith2lab         72  COMPLETED   
   0:0


scontrol write batch_script 38960
job script retrieval failed: Invalid job id specified

Warmest regards,
Jason

--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Edmon via slurm-users
You probably want the Prolog option: 
https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: 
https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail


-Paul Edmon-

On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:


Hi, I apologise if I’ve failed to find this in the documentation (and 
am happy to be told to RTFM) but a recent issue for one of my users 
resulted in a question I couldn’t answer.


LSF has a feature called a Pre-Exec where a script executes to check 
whether a node is ready to run a task.  So, you can run arbitrary 
checks and go back to the queue if they fail.


For example, if I have some automounted filesystems, and I want to be 
able to check for failure of the automounted, in an LSF world, I can do:


  bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh

What’s the equivalent in SLURM?

Thanks,

Tim

--

*Tim Cutts*

Scientific Computing Platform Lead

AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support 
you by visiting ourService Catalogue 
<https://azcollaboration.sharepoint.com/sites/CMU993>|




AstraZeneca UK Limited is a company incorporated in England and Wales 
with registered number:03674842 and its registered office at 1 Francis 
Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.


This e-mail and its attachments are intended for the above named 
recipient only and may contain confidential and privileged 
information. If they have come to you in error, you must not copy or 
show them to anyone; instead, please reply to this e-mail, 
highlighting the error to the sender and then immediately delete the 
message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor 
communications, please see our privacy notice at www.astrazeneca.com 
<https://www.astrazeneca.com>



-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Two jobs each with a different partition running on same node?

2024-01-29 Thread Paul Edmon
That certainly isn't the case in our configuration. We have multiple 
overlapping partitions and our nodes have a mix of jobs from all 
different partitions.  So the default behavior is to have a mixing of 
partitions on a node governed by the Priority Tier of the partition. 
Namely the highest priority tier always goes first but jobs from the 
lower tiers can fill in the gaps on a node.


Having multiple partitions and then having only one own a node if it 
happens to have a job running isn't a standard option to my knowledge. 
You can accomplish this though with MCS which I know can lock down nodes 
to specific users and groups. But what you describe sounds more like you 
are locking down based on partition not on user or group, which I'm not 
how to accomplish in the current version of slurm.


Doesn't mean its not possible, I just don't know how unless it is some 
obscure option.


-Paul Edmon-

On 1/29/2024 9:25 AM, Loris Bennett wrote:

Hi,

I seem to remember that in the past, if a node was configured to be in
two partitions, the actual partition of the node was determined by the
partition associated with the jobs running on it.  Moreover, at any
instance where the node was running one or more jobs, the node could
only actually be in a single partition.

Was this indeed the case and is it still the case with version Slurm
23.02.7?

Cheers,

Loris





Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
My concern was you config inadvertantly having that line commented out 
and then seeing problems. If it wasn't then no worries at this point.


We run using preempt/partition_prio on our cluster and have a mix of 
partitions using PreemptMode=OFF and PreemptMode=REQUEUE. So I know that 
combination works. I would be surprised if PreemptMode=CANCEL did not 
work as that's a valid option.


Something we do have set though is what the default mode is. We have set:

### Govern's default preemption behavior
PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

So you might try setting that default of PreemptMode=CANCEL and then set 
specific PreemptModes for all your partitions. That's what we do and it 
works for us.


-Paul Edmon-

On 1/12/2024 10:33 AM, Davide DelVento wrote:

Thanks Paul,

I don't understand what you mean by having a typo somewhere. I mean, 
that configuration works just fine right now, whereas if I add the 
commented out line any slurm command will just abort with the error 
"PreemptType and PreemptMode values incompatible". So, assuming there 
is a typo, it should be in the commented line right? Or are you saying 
that having that line makes slurm sensitive to a typo somewhere else 
that would be otherwise ignored? Obviously I can't exclude that 
option, but it seems unlikely to me. Also because it does say these 
two things are incompatible.


It would obviously much better if the error would say what EXACTLY is 
incompatible with what, but the documentation at 
https://slurm.schedmd.com/preempt.html I see many clues of what that 
could be, and hence I am asking people here who may have deployed 
preemption already on their system. Some excerpts from that URL:



*PreemptType*: Specifies the plugin used to identify which jobs can be 
preempted in order to start a pending job.


  * /preempt/none/: Job preemption is disabled (default).
  * /preempt/partition_prio/: Job preemption is based upon partition
/PriorityTier/. Jobs in higher PriorityTier partitions may preempt
jobs from lower PriorityTier partitions. This is not compatible
with /PreemptMode=OFF/.


which somewhat make it sounds like all partitions should have 
preemption set and not only some? I obviously have some "off" 
partitions. However elsewhere in that document it says


*PreemptMode*: Mechanism used to preempt jobs or enable gang 
scheduling. When the /PreemptType/ parameter is set to enable 
preemption, the /PreemptMode/ in the main section of slurm.conf 
selects the default mechanism used to preempt the preemptable jobs for 
the cluster.
/PreemptMode/ may be specified on a per partition basis to override 
this default value if /PreemptType=preempt/partition_prio/.


which kind of sounds like it should be okay (unless it means 
**everything** must be different than OFF). Yet still elsewhere in 
that same page it says


On the other hand, if you want to use 
/PreemptType=preempt/partition_prio/ to allow jobs from higher 
PriorityTier partitions to Suspend jobs from lower PriorityTier 
partitions, then you will need overlapping partitions, and 
/PreemptMode=SUSPEND,GANG/ to use Gang scheduler to resume the 
suspended job(s). In either case, time-slicing won't happen between 
jobs on different partitions.


Which somewhat sounds like only suspend and gang can be used as 
preemption modes, and not cancel (my preference) or requeue (perhaps 
acceptable, if I jump through some hoops).


So to me the documentation is highly confusing about what can or 
cannot be used together with what else, and the examples at the bottom 
of the page are nice, but they do not specify the full settings. 
Particularly this one https://slurm.schedmd.com/preempt.html#example2 
is close enough to mine, but it does not tell what PreemptType has 
been chosen (nor if "cancel" would be allowed or not in that setup).


Thanks again!

On Fri, Jan 12, 2024 at 7:22 AM Paul Edmon  wrote:

At least in the example you are showing you have PreemptType
commented out, which means it will return the default. PreemptMode
Cancel should work, I don't see anything in the documentation that
indicates it wouldn't. So I suspect you have a typo somewhere in
    your conf.

-Paul Edmon-

On 1/11/2024 6:01 PM, Davide DelVento wrote:

I would like to add a preemptable queue to our cluster. Actually
I already have. We simply want jobs submitted to that queue be
preempted if there are no resources available for jobs in other
(high priority) queues. Conceptually very simple, no
conditionals, no choices, just what I wrote.
However it does not work as desired.

This is the relevant part:

grep -i Preemp /opt/slurm/slurm.conf
#PreemptType = preempt/partition_prio
PartitionName=regular DefMemPerCPU=4580 Default=True
Nodes=node[01-12] State=UP PreemptMode=off PriorityTier=200
PartitionName=All DefMemPerCPU=4580 Nodes=node[01-36] State=UP
PreemptMode=off Prio

Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
At least in the example you are showing you have PreemptType commented 
out, which means it will return the default. PreemptMode Cancel should 
work, I don't see anything in the documentation that indicates it 
wouldn't.  So I suspect you have a typo somewhere in your conf.


-Paul Edmon-

On 1/11/2024 6:01 PM, Davide DelVento wrote:
I would like to add a preemptable queue to our cluster. Actually I 
already have. We simply want jobs submitted to that queue be preempted 
if there are no resources available for jobs in other (high priority) 
queues. Conceptually very simple, no conditionals, no choices, just 
what I wrote.

However it does not work as desired.

This is the relevant part:

grep -i Preemp /opt/slurm/slurm.conf
#PreemptType = preempt/partition_prio
PartitionName=regular DefMemPerCPU=4580 Default=True Nodes=node[01-12] 
State=UP PreemptMode=off PriorityTier=200
PartitionName=All DefMemPerCPU=4580 Nodes=node[01-36] State=UP 
PreemptMode=off PriorityTier=500
PartitionName=lowpriority DefMemPerCPU=4580 Nodes=node[01-36] State=UP 
PreemptMode=cancel PriorityTier=100



That PreemptType setting (now commented) fully breaks slurm, 
everything refuses to run with errors like


$ squeue
squeue: error: PreemptType and PreemptMode values incompatible
squeue: fatal: Unable to process configuration file

If I understand correctly the documentation at 
https://slurm.schedmd.com/preempt.html that is because preemption 
cannot cancel jobs based on partition priority, which (if true) is 
really unfortunate. I understand that allowing 
cross-partition time-slicing could be tricky and so I understand why 
that isn't allowed, but cancelling? Anyway, I have to questions:


1) is that correct and so should I avoid using either partition 
priority or cancelling?
2) is there an easy way to trick slurm into requeing and then have 
those jobs cancelled instead?
3) I guess the cleanest option would be to implement QoS, but I've 
never done it and we don't really need it for anything else other than 
this. The documentation looks complicated, but is it? The great Ole's 
website is unavailable at the moment...


Thanks!!

Re: [slurm-users] Beginner admin question: Prioritization within a partition based on time limit

2024-01-09 Thread Paul Edmon
Yeah, that's sort of the job of the backfill scheduler, as smaller jobs 
will fit better into the gaps. There are several options with in the 
priority framework that you can use to dial in which jobs get which 
priority. I recommend reading through all those and finding the options 
that will work best for the policy you want to implement.


-Paul Edmon-

On 1/9/2024 10:43 AM, Kenneth Chiu wrote:
I'm just learning about slurm. I understand that different different 
partitions can be prioritized separately, and can have different max 
time limits. I was wondering whether or not there was a way to have a 
finer-grained prioritization based on the time limit specified by a 
job, within a single partition. Or perhaps this is already happening 
by default? Would the backfill scheduler be best for this?




Re: [slurm-users] GPU Card Reservation?

2023-12-15 Thread Paul Edmon
I believe the 23.11 version of slurm will allow you to reserve specific 
cards as part of a Reservation.  That won't do preemption though as a 
reservation just takes the card and dedicates it to the user.  I don't 
know if a QoS could pull that off, I haven't experimented with it. A 
partition would be all or nothing for a node so that would not work.


-Paul Edmon-

On 12/15/23 12:16 PM, Jason Simms wrote:

Hello all,

At least at one point, I understood that it was not particularly 
possible, or at least not elegant, to provide priority preempt access 
to a specific GPU card. So, if a node has 4 GPUs, a researcher can 
preempt as needed one or more of them.


Is this still the case? Or is there a reasonable way to facilitate this?

Warmest regards,
Jason




Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-11 Thread Paul Edmon
We've been running for years with out swap on with no issues. You may 
want to set MemSpecLimit in your config to reserve memory for the OS, so 
that way you don't OOM the system with user jobs: 
https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit


-Paul Edmon-

On 12/11/2023 11:19 AM, Davide DelVento wrote:
A little late here, but yes everything Hans said is correct and if you 
are worried about slurm (or other critical system software) getting 
killed by OOM, you can workaround it by properly configuring cgroup.


On Wed, Dec 6, 2023 at 2:06 AM Hans van Schoot  wrote:

Hi Joseph,

This might depend on the rest of your configuration, but in
general swap should not be needed for anything on Linux.
BUT: you might get OOM killer messages in your system logs, and
SLURM might fall victim to the OOM killer (OOM = Out Of Memory) if
you run applications on the compute node that eat up all your RAM.
Swap does not prevent against this, but makes it less likely to
happen. I've seen OOM kill slurm daemon processes on compute nodes
with swap, usually slurm recovers just fine after the application
that ate up all the RAM ends up getting killed by the OOM killer.
My compute nodes are not configured to monitor memory usage of
jobs. If you have memory configured as a managed resource in your
SLURM setup, and you leave a bit of headroom for the OS itself
(e.g. only hand our a maximum of 250GB RAM to jobs on your 256GB
RAM nodes), you should be fine.

cheers,
Hans


ps. I'm just a happy slurm user/admin, not an expert, so I might
be wrong about everything :-)



On 06-12-2023 05:57, John Joseph wrote:

Dear All,
Good morning
We have 4 node   [256 GB Ram in each node] SLURM instance  with
which we installed and it is working fine.
We have 2 GB of SWAP space on each node,  for some purpose  to
make the system in full use want to disable the SWAP memory,

Like to know if I am disabling the SWAP partition will it efffect
SLURM  functionality .

Advice requested
Thanks
Joseph John



Re: [slurm-users] enabling job script archival

2023-10-03 Thread Paul Edmon

You will probably need to.

The way we handle it is that we add users when the first submit a job 
via the job_submit.lua script. This way the database autopopulates with 
active users.


-Paul Edmon-

On 10/3/23 9:01 AM, Davide DelVento wrote:
By increasing the slurmdbd verbosity level, I got additional 
information, namely the following:


slurmdbd: error: couldn't get information for this user (null)(xx)
slurmdbd: debug: accounting_storage/as_mysql: 
as_mysql_jobacct_process_get_jobs: User xx  has no associations, 
and is not admin, so not returning any jobs.


again where x is the posix ID of the user who's running the query 
in the slurmdbd logs.


I suspect this is due to the fact that our userbase is small enough 
(we are a department HPC) that we don't need to use allocation and the 
like, so I have not configured any association (and not even studied 
its configuration, since when I was at another place which did use 
associations, someone else took care of slurm administration).


Anyway, I read the fantastic document by our own member at 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations 
and in fact I have not even configured slurm users:


# sacctmgr show user
      User   Def Acct     Admin
-- -- -
      root       root Administ+
#

So is that the issue? Should I just add all users? Any suggestions on 
the minimal (but robust) way to do that?


Thanks!


On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento 
 wrote:


Thanks Paul, this helps.

I don't have any PrivateData line in either config file. According
to the docs, "By default, all information is visible to all users"
so this should not be an issue. I tried to add a line with
"PrivateData=jobs" to the conf files, just in case, but that
didn't change the behavior.

On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon 
wrote:

At least in our setup, users can see their own scripts by
doing sacct -B -j JOBID

I would make sure that the scripts are being stored and how
you have PrivateData set.

-Paul Edmon-

On 10/2/2023 10:57 AM, Davide DelVento wrote:

I deployed the job_script archival and it is working, however
it can be queried only by root.

A regular user can run sacct -lj towards any jobs (even those
by other users, and that's okay in our setup) with no
problem. However if they run sacct -j job_id --batch-script
even against a job they own themselves, nothing is returned
and I get a

slurmdbd: error: couldn't get information for this user
(null)(xx)

where x is the posix ID of the user who's running the
query in the slurmdbd logs.

Both configure files slurmdbd.conf and slurm.conf do not have
any "permission" setting. FWIW, we use LDAP.

Is that the expected behavior, in that by default only root
can see the job scripts? I was assuming the users themselves
should be able to debug their own jobs... Any hint on what
could be changed to achieve this?

Thanks!



On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento
 wrote:

Fantastic, this is really helpful, thanks!

On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
 wrote:

Yes it was later than that. If you are 23.02 you are
good.  We've been running with storing job_scripts on
for years at this point and that part of the database
only uses up 8.4G.  Our entire database takes up 29G
on disk. So its about 1/3 of the database.  We also
have database compression which helps with the on
disk size. Raw uncompressed our database is about
90G.  We keep 6 months of data in our active database.

-Paul Edmon-

On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

Sorry for the duplicate e-mail in a short time: do
you know (or anyone) when the hashing was added? Was
planning to enable this on 21.08, but we then had to
delay our upgrade to it. I’m assuming later than
that, as I believe that’s when the feature was added.


On Sep 28, 2023, at 13:55, Ryan Novosielski

<mailto:novos...@rutgers.edu> wrote:

Thank you; we’ll put in a feature request for
improvements in that area, and also thanks for the
warning? I thought of that in passing, but the real
world experience is really useful. I could easily
see wanting that stuff to be retained less often
than the main records, which is what I’d ask for.

I assume that archiving, in general, would also
 

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Paul Edmon
At least in our setup, users can see their own scripts by doing sacct -B 
-j JOBID


I would make sure that the scripts are being stored and how you have 
PrivateData set.


-Paul Edmon-

On 10/2/2023 10:57 AM, Davide DelVento wrote:
I deployed the job_script archival and it is working, however it can 
be queried only by root.


A regular user can run sacct -lj towards any jobs (even those by other 
users, and that's okay in our setup) with no problem. However if they 
run sacct -j job_id --batch-script even against a job they own 
themselves, nothing is returned and I get a


slurmdbd: error: couldn't get information for this user (null)(xx)

where x is the posix ID of the user who's running the query in the 
slurmdbd logs.


Both configure files slurmdbd.conf and slurm.conf do not have any 
"permission" setting. FWIW, we use LDAP.


Is that the expected behavior, in that by default only root can see 
the job scripts? I was assuming the users themselves should be able to 
debug their own jobs... Any hint on what could be changed to achieve this?


Thanks!



On Fri, Sep 29, 2023 at 5:48 AM Davide DelVento 
 wrote:


Fantastic, this is really helpful, thanks!

On Thu, Sep 28, 2023 at 12:05 PM Paul Edmon
 wrote:

Yes it was later than that. If you are 23.02 you are good. 
We've been running with storing job_scripts on for years at
this point and that part of the database only uses up 8.4G. 
Our entire database takes up 29G on disk. So its about 1/3 of
the database.  We also have database compression which helps
with the on disk size. Raw uncompressed our database is about
90G.  We keep 6 months of data in our active database.

    -Paul Edmon-

On 9/28/2023 1:57 PM, Ryan Novosielski wrote:

Sorry for the duplicate e-mail in a short time: do you know
(or anyone) when the hashing was added? Was planning to
enable this on 21.08, but we then had to delay our upgrade to
it. I’m assuming later than that, as I believe that’s when
the feature was added.


On Sep 28, 2023, at 13:55, Ryan Novosielski
 <mailto:novos...@rutgers.edu> wrote:

Thank you; we’ll put in a feature request for improvements
in that area, and also thanks for the warning? I thought of
that in passing, but the real world experience is really
useful. I could easily see wanting that stuff to be retained
less often than the main records, which is what I’d ask for.

I assume that archiving, in general, would also remove this
stuff, since old jobs themselves will be removed?

--
#BlackLivesMatter

|| \\UTGERS,
|---*O*---
||_// the State |         Ryan Novosielski -
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922)
~*~ RBHS Campus
||  \\    of NJ | Office of Advanced Research Computing -
MSB A555B, Newark
     `'


On Sep 28, 2023, at 13:48, Paul Edmon
 <mailto:ped...@cfa.harvard.edu> wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our
database size ballooned to be so massive that it actually
prevented us from upgrading and we had to drop the columns
containing the job_script and job_env.  This was back
before slurm started hashing the scripts so that it would
only store one copy of duplicate scripts.  After this point
we found that the job_script database stayed at a fairly
reasonable size as most users use functionally the same
script each time. However the job_env continued to grow
like crazy as there are variables in our environment that
change fairly consistently depending on where the user is.
Thus job_envs ended up being too massive to keep around and
so we had to drop them. Frankly we never really used them
for debugging. The job_scripts though are super useful and
not that much overhead.

In summary my recommendation is to only store job_scripts.
job_envs add too much storage for little gain, unless your
job_envs are basically the same for each user in each location.

Also it should be noted that there is no way to prune out
job_scripts or job_envs right now. So the only way to get
rid of them if they get large is to 0 out the column in the
table. You can ask SchedMD for the mysql command to do this
as we had to do it here to our job_envs.

-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:

In my current slurm installation, (recently upgraded to
slurm v23.02.3), I only have

AccountingStoreFlags=job_comment

I 

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Paul Edmon
This is one of the reasons we stick with using RPM's rather than the 
symlink process. It's just cleaner and avoids the issue of having the 
install on shared storage that may get overwhelmed with traffic or 
suffer outages. Also the package manager automatically removes the 
previous versions and local installs stuff. I've never been a fan of the 
symlink method has it runs counter to the entire point and design of 
Linux and package managers which are supposed to do this heavy lifting 
for you.



Rant aside :). Generally for minor upgrades the process is less touchy. 
For our setup we follow the following process that works well for us, 
but does create an outage for the period of the upgrade.



1. Set all partitions to down: This makes sure no new jobs are scheduled.

2. Suspend all jobs: This makes sure jobs aren't running while we upgrade.

3. Stop slurmctld and slurmdbd.

4. Upgrade the slurmdbd. Restart slurmdbd

5. Upgrade the slurmd and slurmctld across the cluster.

6. Restart slurmd and slurmctld simultaneously using choria.

7. Unsuspend all jobs

8. Reopen all partitions.


For major upgrades we always take a mysqldump and backup the spool for 
the slurmctld before upgrading just in case something goes wrong. We've 
had this happen before when the slurmdbd upgrade cut out early (note, 
always run the slurmdbd and slurmctld upgrades in -D mode and not via 
systemctl as systemctl can timeout and kill the upgrade midway for large 
upgrades).



That said I've also skipped steps 1, 2, 7, and 8 before for minor 
upgrades and it works fine. The slurmd, slurmctld, and slurmdbd can all 
run on different versions so long as the slurmdbd > slurmctld > slurmd.  
So if you want to do a live upgrade you can do it. However out paranoia 
we general stop everything. The entire process takes about an hour start 
to finish, with the longest part being the pausing of all the jobs.



-Paul Edmon-


On 9/29/2023 9:48 AM, Groner, Rob wrote:
I did already see the upgrade section of Jason's talk, but it wasn't 
much about the mechanics of the actual upgrade process, more of a big 
picture it seemed.  It dealt a lot with different parts of slurm at 
different versions, which is something we don't have.


One little wrinkle here is that while, yes, we're using a symlink to 
point to what version of slurm is the current one...it's all on a 
shared filesystem.  So, ALL nodes, slurmdb, slurmctld are using that 
same symlink.  There is no means to upgrade one component at a time.  
That means to upgrade, EVERYTHING has to come down before it could 
come back up.  Jason's slides seemed to indicate that, if there were 
separate symlinks, then I could focus on just the slurmdb first and 
upgrade it...then focus on slurmctld and upgrade it, and then finally 
the nodes (take down their slurmd, upgrade the link, bring up slurmd). 
So maybe that's what I'm missing.


Otherwise, I think what I'm saying is that I see references to a 
"rolling upgrade", but I don't see any guide to a rolling upgrade.  I 
just see the 14 steps  in 
https://slurm.schedmd.com/quickstart_admin.html#upgrade 
<https://slurm.schedmd.com/quickstart_admin.html#upgrade>, and I guess 
I'd always thought of that as the full octane, high fat upgrade.  I've 
only ever done upgrades during one of our many scheduled downtimes, 
because the upgrades were always to a new major version, and because 
I'm a scared little chicken, so I figured there were maybe some 
smaller subset of steps if only upgrading a patchlevel change.  
Smaller change, less risk, less precautionary steps...?  I'm seeing 
now that's not the case.


Thank you all for the suggestions!

Rob



*From:* slurm-users  on behalf 
of Ryan Novosielski 

*Sent:* Friday, September 29, 2023 2:48 AM
*To:* Slurm User Community List 
*Subject:* Re: [slurm-users] Steps to upgrade slurm for a patchlevel 
change?



You don't often get email from novos...@rutgers.edu. Learn why this is 
important <https://aka.ms/LearnAboutSenderIdentification>



I started off writing there’s really no particular process for 
these/just do your changes and start the new software (be mindful of 
any PATH that might contain data that’s under your software tree, if 
you have that setup), and that you might need to watch the timeouts, 
but I figured I’d have a look at the upgrade guide to be sure.


There’s really nothing onerous in there. I’d personally back up my 
database and state save directories just because I’d rather be safe 
than sorry, or for if have to go backwards and want to be sure. You 
can run SlurmCtld for a good while with no database (note that -M on 
the command line will be broken during that time), just being mindful 
of the RAM on the SlurmCtld machine/don’t restart it before the DB is 
back up, and backing up our fairly large database doesn’t take all 
that long. Whether or not 5 is require

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
Yes it was later than that. If you are 23.02 you are good.  We've been 
running with storing job_scripts on for years at this point and that 
part of the database only uses up 8.4G.  Our entire database takes up 
29G on disk. So its about 1/3 of the database. We also have database 
compression which helps with the on disk size. Raw uncompressed our 
database is about 90G.  We keep 6 months of data in our active database.


-Paul Edmon-

On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
Sorry for the duplicate e-mail in a short time: do you know (or 
anyone) when the hashing was added? Was planning to enable this on 
21.08, but we then had to delay our upgrade to it. I’m assuming later 
than that, as I believe that’s when the feature was added.



On Sep 28, 2023, at 13:55, Ryan Novosielski  wrote:

Thank you; we’ll put in a feature request for improvements in that 
area, and also thanks for the warning? I thought of that in passing, 
but the real world experience is really useful. I could easily see 
wanting that stuff to be retained less often than the main records, 
which is what I’d ask for.


I assume that archiving, in general, would also remove this stuff, 
since old jobs themselves will be removed?


--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ 
RBHS Campus
||  \\    of NJ | Office of Advanced Research Computing - MSB 
A555B, Newark

     `'


On Sep 28, 2023, at 13:48, Paul Edmon  wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size 
ballooned to be so massive that it actually prevented us from 
upgrading and we had to drop the columns containing the job_script 
and job_env.  This was back before slurm started hashing the scripts 
so that it would only store one copy of duplicate scripts.  After 
this point we found that the job_script database stayed at a fairly 
reasonable size as most users use functionally the same script each 
time. However the job_env continued to grow like crazy as there are 
variables in our environment that change fairly consistently 
depending on where the user is. Thus job_envs ended up being too 
massive to keep around and so we had to drop them. Frankly we never 
really used them for debugging. The job_scripts though are super 
useful and not that much overhead.


In summary my recommendation is to only store job_scripts. job_envs 
add too much storage for little gain, unless your job_envs are 
basically the same for each user in each location.


Also it should be noted that there is no way to prune out 
job_scripts or job_envs right now. So the only way to get rid of 
them if they get large is to 0 out the column in the table. You can 
ask SchedMD for the mysql command to do this as we had to do it here 
to our job_envs.


-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm 
v23.02.3), I only have


AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care 
of the additional tables if needed?


Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I 
know about the additional diskspace and potentially load needed, 
and with our resources and typical workload I should be okay with that.


Thanks!






Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
No, all the archiving does is remove the pointer.  What slurm does right 
now is that it creates a hash of the job_script/job_env and then checks 
and sees if that hash matches one on record. If not then it adds it to 
the record, if it does match then it adds a pointer to the appropriate 
record.  So you can think of the job_script/job_env as an internal 
database of all the various scripts and envs that slurm has ever seen 
and then what ends up in the Job record is a pointer to that database.  
This way slurm can deduplicate scripts/envs that are the same. This 
works great for job_scripts as they are functionally the same and thus 
you have many jobs pointed to the same script, but less so for job_envs.


-Paul Edmon-

On 9/28/2023 1:55 PM, Ryan Novosielski wrote:
Thank you; we’ll put in a feature request for improvements in that 
area, and also thanks for the warning? I thought of that in passing, 
but the real world experience is really useful. I could easily see 
wanting that stuff to be retained less often than the main records, 
which is what I’d ask for.


I assume that archiving, in general, would also remove this stuff, 
since old jobs themselves will be removed?


--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ 
RBHS Campus
||  \\    of NJ | Office of Advanced Research Computing - MSB 
A555B, Newark

     `'


On Sep 28, 2023, at 13:48, Paul Edmon  wrote:

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size 
ballooned to be so massive that it actually prevented us from 
upgrading and we had to drop the columns containing the job_script 
and job_env.  This was back before slurm started hashing the scripts 
so that it would only store one copy of duplicate scripts.  After 
this point we found that the job_script database stayed at a fairly 
reasonable size as most users use functionally the same script each 
time. However the job_env continued to grow like crazy as there are 
variables in our environment that change fairly consistently 
depending on where the user is. Thus job_envs ended up being too 
massive to keep around and so we had to drop them. Frankly we never 
really used them for debugging. The job_scripts though are super 
useful and not that much overhead.


In summary my recommendation is to only store job_scripts. job_envs 
add too much storage for little gain, unless your job_envs are 
basically the same for each user in each location.


Also it should be noted that there is no way to prune out job_scripts 
or job_envs right now. So the only way to get rid of them if they get 
large is to 0 out the column in the table. You can ask SchedMD for 
the mysql command to do this as we had to do it here to our job_envs.


-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm 
v23.02.3), I only have


AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care 
of the additional tables if needed?


Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I 
know about the additional diskspace and potentially load needed, and 
with our resources and typical workload I should be okay with that.


Thanks!




Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon

Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size 
ballooned to be so massive that it actually prevented us from upgrading 
and we had to drop the columns containing the job_script and job_env.  
This was back before slurm started hashing the scripts so that it would 
only store one copy of duplicate scripts.  After this point we found 
that the job_script database stayed at a fairly reasonable size as most 
users use functionally the same script each time. However the job_env 
continued to grow like crazy as there are variables in our environment 
that change fairly consistently depending on where the user is. Thus 
job_envs ended up being too massive to keep around and so we had to drop 
them. Frankly we never really used them for debugging. The job_scripts 
though are super useful and not that much overhead.


In summary my recommendation is to only store job_scripts. job_envs add 
too much storage for little gain, unless your job_envs are basically the 
same for each user in each location.


Also it should be noted that there is no way to prune out job_scripts or 
job_envs right now. So the only way to get rid of them if they get large 
is to 0 out the column in the table. You can ask SchedMD for the mysql 
command to do this as we had to do it here to our job_envs.


-Paul Edmon-

On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm 
v23.02.3), I only have


AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care of 
the additional tables if needed?


Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I 
know about the additional diskspace and potentially load needed, and 
with our resources and typical workload I should be okay with that.


Thanks!




Re: [slurm-users] Submitting hybrid OpenMPI and OpenMP Jobs

2023-09-22 Thread Paul Edmon
You might also try swapping to use srun instead of mpiexec as that way 
slurm can give more direction as to what cores have been allocated to 
what. I've found it in the past that mpiexec will ignore what Slurm 
tells it.


-Paul Edmon-

On 9/22/23 8:24 AM, Lambers, Martin wrote:

Hello,

for this setup it typically helps to disable MPI process binding with 
"mpirun --bind-to none ..." (or similar) so that OpenMP can use all 
cores.


Best,
Martin

On 22/09/2023 13:57, Selch, Brigitte (FIDD) wrote:

Hello,

one of our applications need hybrid OpenMPI and OpenMP Job-Submit.

Only one task is allowed on one node, but this task should use all 
cores of the node.


So, for example I made:

/#!/bin/bash/

//

/#SBATCH --nodes=5/

/#SBATCH --ntasks=5/

/#SBATCH --cpus-per-task=44/

/#SBATCH --export=ALL/

//

/export OMP_NUM_THREADS=44/

/mpiexec PreonNode test.prscene/

But the job does not take more than one  thread:

…

/Thread binding will be disabled because the full machine is not 
available for the process./


*/Detected 44 CPU threads/*/, 2 l3 caches and 2 packages on the 
machine./


*/Number of CPU processors reported by OpenMP: 1/*

*/Maximum number of CPU threads reported by OpenMP: 44/*

//

/Warning: *OMP_NUM_THREADS was set to 44, which is higher than the 
number of available processors of *1. Will use 1 threads now./


/…/

What did I wrong?

Does anyone have any idea why OpenMP thinks it can only use one 
thread per node?


Thanks !

Best regards,

Brigitte Selch

**

*MAN Truck & Bus SE*

IT Produktentwicklung Simulation (FIDD)

Vogelweiher Str. 33

90441 Nürnberg




MAN Truck & Bus SE
Sitz der Gesellschaft: München
Registergericht: Amtsgericht München, HRB 247520
Vorsitzender des Aufsichtsrats: Christian Levin, Vorstand: Alexander 
Vlaskamp (Vorsitzender), Murat Aksel, Friedrich-W. Baumann, Michael 
Kobriger, Inka Koljonen, Arne Puls, Dr. Frederik Zohm


You can find information about how we process your personal data and 
your rights in our data protection notice: 
www.man.eu/data-protection-notice


This e-mail (including any attachments) is confidential and may be 
privileged.
If you have received it by mistake, please notify the sender by 
e-mail and delete this message from your system.
Any unauthorised use or dissemination of this e-mail in whole or in 
part is strictly prohibited.

Please note that e-mails are susceptible to change.
MAN Truck & Bus SE (including its group companies) shall not be 
liable for the improper or incomplete transmission of the information 
contained in this communication nor for any delay in its receipt.
MAN Truck & Bus SE (or its group companies) does not guarantee that 
the integrity of this communication has been maintained nor that this 
communication is free of viruses, interceptions or interference.








Re: [slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

2023-05-08 Thread Paul Edmon
I would recommend standing up an instance of XDMod as it handles most of 
this for you in its summary reports.



https://open.xdmod.org/10.0/index.html


-Paul Edmon-


On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote:

Good morning,

We have at least one billed account right now, where the associated 
researchers are able to submit jobs that run against our normal queue 
with fairshare, but not for an academic research purpose. So we'd like 
to accurately calculate their CPU hours. We are currently using a 
script to query the db with sacct and sum up the value of ElapsedRaw * 
AllocCPUS for all jobs. But this seems limited, because requeueing 
will create what the sacct man page calls duplicates. By default jobs 
normally get requeued only if there's something outside of the user's 
control like a NODE_FAIL or an scontrol command to requeue it 
manually, though I think users can requeue things themselves, it's not 
a feature we've seen our researchers use.


However with the new scrontab feature, whenever the cron is executed 
more than once, sacct reports that the previous jobs are "requeued" 
and are only visible by looking up duplicates. I haven't seen any 
billed account use requeueing or scrontab yet, but it's clear to me 
that it could be significant once researchers start using scrontab 
more. Scrontab has existed since one of the releases from 2020 I 
believe, but we enabled it this year and see it as much more powerful 
than the traditional linux crontab.


What would be the best way to more thoroughly calculate ElapsedRaw * 
AllocCPUS, to account for duplicates, but optionally ignore 
unintentional requeueing like from a NODE_FAIL?


Here's the main loop of the simple bash script I have now:

while IFS='|' read -r end elapsed cpus; do
    # if a job crosses the month barrier
    # the entire bill will be put under the 2nd month
    year_month="${end:0:7}"
    if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
        continue
    fi
    core_seconds["$year_month"]=$(( core_seconds["$year_month"] + 
(elapsed * cpus) ))

done < <(sacct -a -A "$SLURM_ACCOUNT" \
               -S "$START_DATE" \
               -E "$END_DATE" \
               -o End,ElapsedRaw,AllocCPUS -X -P --noheader)

Our slurmdbd is configured to keep 6 months of data.

It make senses to loop through the jobids instead, using sacct's 
-D/--duplicates option each time to reveal the hidden duplicates in 
the REQUEUED state, but I'm interested if there are alternatives or if 
I'm missing anything here.


Thanks,

Joseph

--
Joseph F. Guzman - ITS (Advanced Research Computing)

Northern Arizona University

joseph.f.guz...@nau.edu


Re: [slurm-users] changing the operational network in slurm setup

2023-03-14 Thread Paul Edmon
We do this for our Infiniband set up.  What we do is that we populate 
/etc/hosts with the hostname mapped to the IP we want Slurm to use.  
This way you get IP traffic traversing the address you want between 
nodes while not having to mess with DNS.


-Paul Edmon-

On 3/14/2023 12:19 AM, Purvesh Parmar wrote:

Thank you.

It would be helpful if you can elaborate on this. We had hostnames 
given according to interfaces. Now that also needs to be changed,


Thanks,
P. parmar

On Tue, 14 Mar 2023 at 07:58, Steven Hood  wrote:

Set dns server to use the ip address of the 10g





Sent from my T-Mobile 4G LTE Device



 Original message 
From: Purvesh Parmar 
Date: 3/13/23 7:05 PM (GMT-08:00)
To: Slurm User Community List 
Subject: Re: [slurm-users] changing the operational network in
slurm setup

CAUTION: This email originated from outside of the organization.
Do not click links or open attachments unless you recognize the
sender and know the content is safe.

Hi,

No, its an additional network enabled on all the nodes and now
slurm services we want to migrate from 1 GbE network to 10 GbE
network. Yes, we have assigned different ip addresses on the 10
GbE network

On Tue, 14 Mar 2023 at 07:22, Steven Hood  wrote:

Have you changed the IP assignment to use the 10GB interface?


-Original Message-
From: Purvesh Parmar 
Reply-To: Slurm User Community List

To: Slurm User Community List 
Subject: [slurm-users] changing the operational network in
slurm setup
Date: 03/13/2023 06:19:13 PM

CAUTION: This email originated from outside of the
organization. Do not click links or open attachments unless you
recognize the sender and know the content is safe.

hi,

We have slurm 22.08 running on ethernet  (1 GbE) network
(slurmdbd, slurmctld and slurmd on compute nodes) on
ubuntu 20.04. We want to migrate the slurm services on the 10
gbe network, which is present on all the nodes and on the
master server as well. How to proceed for this?

Thanks,
P. Parmar


Re: [slurm-users] linting slurm.conf files

2023-01-27 Thread Paul Edmon
We have a gitlab runner that fires up a docker container that basically 
starts up a mini scheduler (slurmdbd and slurmctld) to confirm that both 
can start. It covers most bases but we would like to see an official 
syntax checker (https://bugs.schedmd.com/show_bug.cgi?id=3435).


-Paul Edmon-

On 1/27/23 2:36 PM, Kevin Broch wrote:
I'm wondering what others use to lint their slurm.conf files to give 
more confidence that the changes are valid.


I came across https://github.com/appeltel/slurmlint which was somewhat 
functional
but since it hasn't been updated since 2019, when I ran it against a 
valid slurm.conf file based on a later slurm rev. it flagged a bunch 
of false positives that were simply new valid options.
On the plus side it was able to flag an example of a misconfigured 
node/partition.


Any ideas would be greatly appreciated.

Best, /

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Paul Edmon
The symlink method for slurm.conf is what we do as well. We have a NFS 
mount from the slurm master that we host the slurm.conf on that we then 
symlink slurm.conf to that NFS share.



-Paul Edmon-


On 1/4/2023 1:53 PM, Brian Andrus wrote:


One of the simple ways I have dealt with different configs is to 
symlink /etc/slurm/slurm.conf to the appropriate file (eg: 
slurm-dev.conf and slurm-prod.conf)



In fact, I use the symlink for my dev and nothing (configless) for 
prod. Then I can change a running node to/from dev/prod by merely 
creating/deleting the symlink and restarting slurmd.



Just an option that may work for you.

I also use separate repos for prod/dev when I am working on 
packages/testing. I rather prefer that separation so I don't have 
someone accidentally update to a package that is not production-ready.



Brian Andrus


On 1/4/2023 9:22 AM, Groner, Rob wrote:
We currently have a test cluster and a production cluster, all on the 
same network.  We try things on the test cluster, and then we gather 
those changes and make a change to the production cluster.  We're 
doing that through two different repos, but we'd like to have a 
single repo to make the transition from testing configs to publishing 
them more seamless.  The problem is, of course, that the test cluster 
and production clusters have different cluster names, as well as 
different nodes within them.


Using the include directive, I can pull all of the NodeName lines out 
of slurm.conf and put them into %c-nodes.conf files, one for 
production, one for test.  That still leaves me with two problems:


  * The clustername itself will still be a problem.  I WANT the same
slurm.conf file between test and production...but the clustername
line will be different for them both.  Can I use an env var in
that cluster name, because on production there could be a
different env var value than on test?
  * The gres.conf file.  I tried using the same "include" trick that
works on slurm.conf, but it failed because it did not know what
the "ClusterName" was.  I think that means that either it doesn't
work for anything other than slurm.conf, or that the clustername
will have to be defined in gres.conf as well?

Any other suggestions of how to keep our slurm files in a single 
source control repo, but still have the flexibility to have them run 
elegantly on either test or production systems?


Thanks.


Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Paul Edmon

The seff utility (in slurm-contribs) also gives good summary info.

You can also you --parsable to make things more managable.

-Paul Edmon-

On 12/14/22 3:41 PM, Ross Dickson wrote:
I wrote a simple Python script to transpose the output of sacct from a 
row into a column.  See if it meets your needs.


https://github.com/ComputeCanada/slurm_utils/blob/master/sacct-all.py

- Ross Dickson
Dalhousie University  /  ACENET  /  Digital Research Alliance of Canada


On Wed, Dec 14, 2022 at 1:16 PM Davide DelVento 
 wrote:


It would be very useful if there were a way (perhaps a custom script
parsing the sacct output) to provide the information in the same
format as "scontrol show job"

Has anybody attempted to do that?


Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon
Yeah, our spec is based off of their spec with our own additional 
features plugged in.


-Paul Edmon-

On 12/2/22 2:12 PM, David Thompson wrote:


Hi Paul, thanks for passing that along. The error I saw was coming 
from the rpmbuild %check stage in the el9/fc38 builds, which your 
.spec file doesn’t run (likewise the spec file included in the schedmd 
tarball). Certainly one way to avoid failing a check is to not run it.


Regardless, I appreciate the help.

David Thompson

University of Wisconsin – Madison

Social Science Computing Cooperative

*From:* slurm-users  *On Behalf 
Of *Paul Edmon

*Sent:* Friday, December 2, 2022 11:26 AM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] Slurm v22 for Alma 8

Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8.

-Paul Edmon-

On 12/2/22 12:21 PM, David Thompson wrote:

Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma
8 Slurm cluster. We would like to be able to use the sbatch
–prefer option, which isn’t present in the current EPEL el8 rpms
(version 20.11.9). Rebuilding from either the el9 or fc38 SRPM or
fails on a protocol test in
testsuite/slurm_unit/common/slurm_protocol_defs:

FAIL: slurm_addto_id_char_list-test

Before I start digging in, I thought I would check here and see if
anyone has a successful RHEL/Alma/Rocky 8 slurm v22 SRPM they’d be
willing to share. Thanks much!

David Thompson

University of Wisconsin – Madison

Social Science Computing Cooperative


Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon

Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8.

-Paul Edmon-

On 12/2/22 12:21 PM, David Thompson wrote:


Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 
Slurm cluster. We would like to be able to use the sbatch –prefer 
option, which isn’t present in the current EPEL el8 rpms (version 
20.11.9). Rebuilding from either the el9 or fc38 SRPM or fails on a 
protocol test in testsuite/slurm_unit/common/slurm_protocol_defs:


FAIL: slurm_addto_id_char_list-test

Before I start digging in, I thought I would check here and see if 
anyone has a successful RHEL/Alma/Rocky 8 slurm v22 SRPM they’d be 
willing to share. Thanks much!


David Thompson

University of Wisconsin – Madison

Social Science Computing Cooperative
Name:		slurm
Version:	22.05.6
%define rel	1
Release:	%{rel}fasrc01%{?dist}
Summary:	Slurm Workload Manager

Group:		System Environment/Base
License:	GPLv2+
URL:		https://slurm.schedmd.com/

# when the rel number is one, the directory name does not include it
%if "%{rel}" == "1"
%global slurm_source_dir %{name}-%{version}
%else
%global slurm_source_dir %{name}-%{version}-%{rel}
%endif

Source:		%{slurm_source_dir}.tar.bz2

# build options		.rpmmacros options	change to default action
#   	
# --prefix		%_prefix path		install path for commands, libraries, etc.
# --with cray		%_with_cray 1		build for a Cray Aries system
# --with cray_network	%_with_cray_network 1	build for a non-Cray system with a Cray network
# --with cray_shasta	%_with_cray_shasta 1	build for a Cray Shasta system
# --with slurmrestd	%_with_slurmrestd 1	build slurmrestd
# --with slurmsmwd  %_with_slurmsmwd 1  build slurmsmwd
# --without debug	%_without_debug 1	don't compile with debugging symbols
# --with hdf5		%_with_hdf5 path	require hdf5 support
# --with hwloc		%_with_hwloc 1		require hwloc support
# --with lua		%_with_lua path		build Slurm lua bindings
# --with mysql		%_with_mysql 1		require mysql/mariadb support
# --with numa		%_with_numa 1		require NUMA support
# --without pam		%_without_pam 1		don't require pam-devel RPM to be installed
# --without x11		%_without_x11 1		disable internal X11 support
# --with ucx		%_with_ucx path		require ucx support
# --with pmix		%_with_pmix path	require pmix support
# --with nvml		%_with_nvml path	require nvml support
#

%define _with_slurmrestd 1

#  Options that are off by default (enable with --with )
%bcond_with cray
%bcond_with cray_network
%bcond_with cray_shasta
%bcond_with slurmrestd
%bcond_with slurmsmwd
%bcond_with multiple_slurmd
%bcond_with ucx

# These options are only here to force there to be these on the build.
# If they are not set they will still be compiled if the packages exist.
%bcond_with hwloc
%bcond_with mysql
%bcond_with hdf5
%bcond_with lua
%bcond_with numa
%bcond_with pmix
%bcond_with nvml

# Use debug by default on all systems
%bcond_without debug

# Options enabled by default
%bcond_without pam
%bcond_without x11

# Disable hardened builds. -z,now or -z,relro breaks the plugin stack
%undefine _hardened_build
%global _hardened_cflags "-Wl,-z,lazy"
%global _hardened_ldflags "-Wl,-z,lazy"

# Disable Link Time Optimization (LTO)
%define _lto_cflags %{nil}

Requires: munge

%{?systemd_requires}
BuildRequires: systemd
BuildRequires: munge-devel munge-libs
BuildRequires: python3
BuildRequires: readline-devel
Obsoletes: slurm-lua <= %{version}
Obsoletes: slurm-munge <= %{version}
Obsoletes: slurm-plugins <= %{version}

# fake systemd support when building rpms on other platforms
%{!?_unitdir: %global _unitdir /lib/systemd/systemd}

%define use_mysql_devel %(perl -e '`rpm -q mariadb-devel`; print $?;')

%if %{with mysql}
%if %{use_mysql_devel}
BuildRequires: mysql-devel >= 5.0.0
%else
BuildRequires: mariadb-devel >= 5.0.0
%endif
%endif

%if %{with cray}
BuildRequires: cray-libalpscomm_cn-devel
BuildRequires: cray-libalpscomm_sn-devel
BuildRequires: libnuma-devel
BuildRequires: libhwloc-devel
BuildRequires: cray-libjob-devel
BuildRequires: gtk2-devel
BuildRequires: glib2-devel
BuildRequires: pkg-config
%endif

%if %{with cray_network}
%if %{use_mysql_devel}
BuildRequires: mysql-devel
%else
BuildRequires: mariadb-devel
%endif
BuildRequires: cray-libalpscomm_cn-devel
BuildRequires: cray-libalpscomm_sn-devel
BuildRequires: hwloc-devel
BuildRequires: gtk2-devel
BuildRequires: glib2-devel
BuildRequires: pkgconfig
%endif

BuildRequires: perl(ExtUtils::MakeMaker)
BuildRequires: libcurl-devel
BuildRequires: numactl-devel
BuildRequires: json-c-devel
BuildRequires: infiniband-diags-devel
BuildRequires: rdma-core-devel
BuildRequires: lz4-devel
BuildRequires: man2html
BuildRequires: http-parser-devel
BuildRequires: libyaml-devel
BuildRequires: hdf5-devel
BuildRequires: freeipmi-devel
BuildRequires: rrdtool-devel
BuildRequires: hwloc-devel
BuildRequires: lua-devel
BuildRequires: mysql-devel
BuildRequires: gtk2-dev

Re: [slurm-users] slurm 22.05 "hash_k12" related upgrade issue

2022-10-24 Thread Paul Edmon
It only happens for versions on the 22.05 series prior to the latest 
release (22.05.5).  So the 21 version isn't impacted and you should be 
fine to upgrade from 21 to 22.05.5 and not see the hash_k12 issue.  If 
you upgrade to any prior minor version though you will hit this issue.


-Paul Edmon-

On 10/24/2022 3:13 PM, Marko Markoc wrote:

Hi All,

Regarding 
https://lists.schedmd.com/pipermail/slurm-users/2022-September/009222.html 
.


Question for all of you that might have done this upgrade recently, 
does this happen during the major version ( 21->22 in my case ) 
upgrade also ? All of the discussion I found online about it only 
mentions minor version upgrades.


Thanks,
Marko




Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Paul Edmon
HA for slurmctld is not multidatacenter HA but rather a traditional HA 
setup where you have two server heads off of one storage brick 
(connected by SAS cables or other fast interconnect).  Multidatacenter 
HA has issues with keeping things in sync due to latency and IOPs (as 
noted below).


So the HA setup for slurmctld will protect you from the server hosting 
the slurmctld getting hosed, not the entire rack going down or the 
datacenter going down.


-Paul Edmon-

On 10/24/2022 4:14 AM, Ole Holm Nielsen wrote:

On 10/24/22 09:57, Diego Zuccato wrote:

Il 24/10/2022 09:32, Ole Holm Nielsen ha scritto:

 > It is definitely a BAD idea to store Slurm StateSaveLocation on a 
slow

 > NFS directory!  SchedMD recommends to use local NVME or SSD disks
 > because there will be many IOPS to this file system!

IIUC it does have to be shared between controllers, right?

Possibly use NVME-backed (or even better NVDIMM-backed) NFS share. Or 
replica-3 Gluster volume with NVDIMMs for the bricks, for the 
paranoid  :)


IOPS is the key parameter!  Local NVME or SSD should beat any 
networked storage.  The original question refers to having 
StateSaveLocation on a standard (slow) NFS drive, AFAICT.


I don't know how many people prefer using 2 slurmctld hosts (primary 
and backup)?  I certainly don't do that.  Slurm does have a 
configurable SlurmctldTimeout parameter so that you can reboot the 
server quickly when needed.


It would be nice if people with experience in HA storage for slurmctld 
could comment.


/Ole





Re: [slurm-users] Check consistency

2022-10-07 Thread Paul Edmon
The slurmctld log will print out if hosts are out of sync with the 
slurmctld slurm.conf.  That said it doesn't report on cgroup consistency 
changes like that.  It's possible that dialing up the verbosity on the 
slurmd logs may give that info but I haven't seen it in normal operating.


-Paul Edmon-

On 10/6/22 5:47 PM, Davide DelVento wrote:

Is there a simple way to check that whas slurm is running is what the
config say it should be?

For example, my understanding is that changing cgroup.conf should be
followed by 'systemctl stop slurmd' on all compute nodes, then
'systemctl restart slurmctld' on the head node, then 'systemctl start
slurmd' on the compute nodes.

Assuming this is correct, is there a way to query the nodes and ask if
they are indeed running what the config is saying (or alternatively
have them dump their config files somewhere for me to manually run a
diff on)?

Thanks,
Davide





Re: [slurm-users] Recommended amount of memory for the database server

2022-09-26 Thread Paul Edmon
It should generally be as much as you need to hold the full database in 
memory.  That said if you are storing Job Envs and Scripts that will be 
a lot of data, even with the deduping they are doing.  We've generally 
done about 90 GB buffer size here with out much of any issue even though 
our database is bigger than that.


-Paul Edmon-

On 9/25/22 5:18 PM, byron wrote:

Hi

Does anyone know what is the recommended amount of memory to give 
slurms mariadb database server?


I seem to remember reading a simple estimate based on the size of 
certain tables (or something along those lines) but I can't find it now.


Thanks





Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-16 Thread Paul Edmon
We also call scontrol in our scripts (a little as we can manage) and we 
run at the scale of 1500 nodes.  It hasn't really caused many issues, 
but we try to limit it as much as we possibly can.


-Paul Edmon-

On 9/16/22 9:41 AM, Sebastian Potthoff wrote:

Hi Hermann,

So you both are happily(?) ignoring this warning the "Prolog and 
Epilog Guide",

right? :-)

"Prolog and Epilog scripts [...] should not call Slurm commands 
(e.g. squeue,

scontrol, sacctmgr, etc)."


We have probably been doing this since before the warning was added to
the documentation.  So we are "ignorantly ignoring" the advice :-/


Same here :) But if $SLURM_JOB_STDOUT is not defined as documented … 
what can you do.


May I ask how big your clusters are (number of nodes) and how 
heavily they are

used (submitted jobs per hour)?


We have around 500 nodes (mostly 2x18 cores). Jobs ending (i.e. 
calling the epilog script) varies quite a lot between 1000 and 15k a 
day, so something in between 40 and 625 Jobs/hour. During those peaks 
Slurm can become noticeably slower, however usually it runs fine.


Sebastian

Am 16.09.2022 um 15:15 schrieb Loris Bennett 
:


Hi Hermann,

Hermann Schwärzler  writes:


Hi Loris,
hi Sebastian,

thanks for the information on how you are doing this.
So you both are happily(?) ignoring this warning the "Prolog and 
Epilog Guide",

right? :-)

"Prolog and Epilog scripts [...] should not call Slurm commands 
(e.g. squeue,

scontrol, sacctmgr, etc)."


We have probably been doing this since before the warning was added to
the documentation.  So we are "ignorantly ignoring" the advice :-/

May I ask how big your clusters are (number of nodes) and how 
heavily they are

used (submitted jobs per hour)?


We have around 190 32-core nodes.  I don't know how I would easily find
out the average number of jobs per hour.  The only problems we have had
with submission have been when people have written their own mechanisms
for submitting thousands of jobs.  Once we get them to use job array,
such problems generally disappear.

Cheers,

Loris


Regards,
Hermann

On 9/16/22 9:09 AM, Loris Bennett wrote:

Hi Hermann,
Sebastian Potthoff  writes:


Hi Hermann,

I happened to read along this conversation and was just solving 
this issue today. I added this part to the epilog script to make 
it work:


# Add job report to stdout
StdOut=$(/usr/bin/scontrol show job=$SLURM_JOB_ID | /usr/bin/grep 
StdOut | /usr/bin/xargs | /usr/bin/awk 'BEGIN { FS = "=" } ; { 
print $2 }')


NODELIST=($(/usr/bin/scontrol show hostnames))

# Only add to StdOut file if it exists and if we are the first node
if [ "$(/usr/bin/hostname -s)" = "${NODELIST[0]}" -a ! -z 
"${StdOut}" ]

then
  echo "# JOB REPORT 
##" >> $StdOut

  /usr/bin/seff $SLURM_JOB_ID >> $StdOut
  echo 
"###" 
>> $StdOut

fi

We do something similar.  At the end of our script pointed to by
EpilogSlurmctld we have
  OUT=`scontrol show jobid ${job_id} | awk -F= '/ StdOut/{print $2}'`
  if [ ! -f "$OUT" ]; then
exit
  fi
  printf "\n== Epilog Slurmctld
==\n\n" >>  ${OUT}
  seff ${SLURM_JOB_ID} >> ${OUT}
  printf
"\n==\n"

${OUT}

  chown ${user} ${OUT}
Cheers,
Loris

  Contrary to what it says in the slurm docs 
https://slurm.schedmd.com/prolog_epilog.html  I was not able to 
use the env var SLURM_JOB_STDOUT, so I had to fetch it via 
scontrol. In addition I had to
make sure it is only called by the „leading“ node as the epilog 
script will be called by ALL nodes of a multinode job and they 
would all call seff and clutter up the output. Last thing was to 
check if StdOut is
not of length zero (i.e. it exists). Interactive jobs would 
otherwise cause the node to drain.


Maybe this helps.

Kind regards
Sebastian

PS: goslmailer looks quite nice with its recommendations! Will 
definitely look into it.


--
Westfälische Wilhelms-Universität (WWU) Münster
WWU IT
Sebastian Potthoff (eScience / HPC)

 Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler 
:


 Hi Ole,

 On 9/15/22 5:21 PM, Ole Holm Nielsen wrote:

 On 15-09-2022 16:08, Hermann Schwärzler wrote:

 Just out of curiosity: how do you insert the output of seff into 
the out-file of a job?


 Use the "smail" tool from the slurm-contribs RPM and set this in 
slurm.conf:

 MailProg=/usr/bin/smail

 Maybe I am missing something but from what I can tell smail sends 
an email and does *not* change or append to the .out file of a job...


 Regards,
 Hermann





--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin emailloris.benn...@fu-berlin.de


Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
But not any 20.  There are 20 versions, 20.02 and 20.11, and there was a 
previous 19.05.  So two versions for 18.08 would be 20.02 not 20.11



-Paul Edmon-


On 9/8/22 12:14 PM, Wadud Miah wrote:
The previous version was 18 and now I am trying to upgrade to 20, so I 
am well within 2 major versions.


Regards,

*From:* slurm-users  on behalf 
of Paul Edmon 

*Sent:* Thursday, September 8, 2022 4:44:36 PM
*To:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
*CAUTION:* This e-mail originated outside the University of Southampton.

Typically slurm only supports upgrading between 2 major versions 
ahead.  If you are on 18.08 you likely can only go to 20.02. Then 
after you upgrade to 20.02 you can go to 20.11 or 21.08.



-Paul Edmon-


On 9/8/22 11:38 AM, Wadud Miah wrote:

hi Mick,

I have checked that all the compute nodes and controllers all have 
the same version of SLURM (20.11.9). I am indeed trying to upgrade 
SlurmDB first, and am getting the errors in the slurmdbd.log:


[2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started
[2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 
not supported

[2022-09-08T15:33:57.001] unpacking header
[2022-09-08T15:33:57.001] error: destroy_forward: no init
[2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message 
receive failure
[2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack 
SLURM_PERSIST_INIT message


Regards,
Wadud.


*From:* slurm-users  
<mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Timony, 
Mick  
<mailto:michael_tim...@hms.harvard.edu>

*Sent:* 08 September 2022 16:24
*To:* Slurm User Community List  
<mailto:slurm-users@lists.schedmd.com>

*Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
*CAUTION:* This e-mail originated outside the University of Southampton.
This thread on the forums may help:

https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FYB55Ru9rvD4=05%7C01%7Cw.miah%40soton.ac.uk%7Cfd25248a7e6a4fa729d308da91b20c1a%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982491141437024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=THo3JUObIzF6EWcIlQ1OsJwUwxEAGUFeMdLuvlEKhzA%3D=0>



It looks like you have something on your network with an older 
version of slurm installed. I'd check the Slurm version installed on 
your compute nodes and controllers.


The recommended approach to upgrading is to upgrade the SlurmDB 
first, then the controllers, then the compute nodes. More info here:


https://slurm.schedmd.com/quickstart_admin.html#upgrade 
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23upgrade=05%7C01%7Cw.miah%40soton.ac.uk%7Cfd25248a7e6a4fa729d308da91b20c1a%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982491141437024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=Evu1PdyvAeinb0W11Ia6NxUgOvfaITmJiVau8nak%2Fac%3D=0>


Regards
--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


*From:* slurm-users  
<mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Wadud 
Miah  <mailto:w.m...@soton.ac.uk>

*Sent:* Thursday, September 8, 2022 10:47 AM
*To:* slurm-users@lists.schedmd.com 
<mailto:slurm-users@lists.schedmd.com> 
 <mailto:slurm-users@lists.schedmd.com>

*Subject:* [slurm-users] Upgrading SLURM from 18 to 20.11.9
Hi,

I am attempting to upgrade from SLURM 18 to 20.11.9 and when I 
attempt to start slurmdbd (version 20.11.9), I get the following 
error messages in /var/log/slurm/slurmdbd.log:


[2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started
[2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 
not supported

[2022-09-08T15:33:57.001] unpacking header
[2022-09-08T15:33:57.001] error: destroy_forward: no init
[2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message 
receive failure
[2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack 
SLURM_PERSIST_INIT message


Any help will be greatly appreciated.

Regards,

--
Wadud Miah
Research Computing Support
University of Southampton

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
Typically slurm only supports upgrading between 2 major versions ahead.  
If you are on 18.08 you likely can only go to 20.02. Then after you 
upgrade to 20.02 you can go to 20.11 or 21.08.



-Paul Edmon-


On 9/8/22 11:38 AM, Wadud Miah wrote:

hi Mick,

I have checked that all the compute nodes and controllers all have the 
same version of SLURM (20.11.9). I am indeed trying to upgrade SlurmDB 
first, and am getting the errors in the slurmdbd.log:


[2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started
[2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 
not supported

[2022-09-08T15:33:57.001] unpacking header
[2022-09-08T15:33:57.001] error: destroy_forward: no init
[2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message 
receive failure
[2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack 
SLURM_PERSIST_INIT message


Regards,
Wadud.


*From:* slurm-users  on behalf 
of Timony, Mick 

*Sent:* 08 September 2022 16:24
*To:* Slurm User Community List 
*Subject:* Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
*CAUTION:* This e-mail originated outside the University of Southampton.
This thread on the forums may help:

https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fg%2Fslurm-users%2Fc%2FYB55Ru9rvD4=05%7C01%7Cw.miah%40soton.ac.uk%7C13f4b2b736764041dc9d08da91af4672%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982479244856364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=cQGagihxp%2BD2JTZZY%2BMKVH5I%2B386oZIXbCZT9eyfTlg%3D=0>



It looks like you have something on your network with an older version 
of slurm installed. I'd check the Slurm version installed on your 
compute nodes and controllers.


The recommended approach to upgrading is to upgrade the SlurmDB first, 
then the controllers, then the compute nodes. More info here:


https://slurm.schedmd.com/quickstart_admin.html#upgrade 
<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fquickstart_admin.html%23upgrade=05%7C01%7Cw.miah%40soton.ac.uk%7C13f4b2b736764041dc9d08da91af4672%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637982479244856364%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=BvJQSt4tfJY616T%2BTzfbGzw4nrTFCuZTbjyuThpssnQ%3D=0>


Regards
--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


*From:* slurm-users  on behalf 
of Wadud Miah 

*Sent:* Thursday, September 8, 2022 10:47 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] Upgrading SLURM from 18 to 20.11.9
Hi,

I am attempting to upgrade from SLURM 18 to 20.11.9 and when I attempt 
to start slurmdbd (version 20.11.9), I get the following error 
messages in /var/log/slurm/slurmdbd.log:


[2022-09-08T15:45:11.115] slurmdbd version 20.11.9 started
[2022-09-08T15:45:23.001] error: unpack_header: protocol_version 8448 
not supported

[2022-09-08T15:33:57.001] unpacking header
[2022-09-08T15:33:57.001] error: destroy_forward: no init
[2022-09-08T15:33:57.001] error: slurm_unpack_received_msg: Message 
receive failure
[2022-09-08T15:33:57.011] error: CONN:11 Failed to unpack 
SLURM_PERSIST_INIT message


Any help will be greatly appreciated.

Regards,

--
Wadud Miah
Research Computing Support
University of Southampton

Re: [slurm-users] maridb version compatibility with Slurm version

2022-08-24 Thread Paul Edmon
I've regularly upgraded the mariadb version with out upgrading the slurm 
version with no issue. We are currently running 10.6.7 for MariaDB on 
CentOS 7.9 with Slurm 22.05.2.  So long as you do the mysql_upgrade 
after the upgrade and have a backup just in case you should be fine.


-Paul Edmon-

On 8/24/22 1:58 AM, navin srivastava wrote:

Hi,

I have a question related to the mariadb vs slurm version compatibility.
Is there any matrix available?

We are running with slurm version 20.02 in our environment on 
SLES15SP3 and with mariadb 10.5.x . We are upgrading the OS from 
SLES15SP3 to SP4 and with this we see the mariadb version is 10.6.x. 
and we are not upgrading the Slurm version.


What is the best way to deal with this as we patch the server 
quarterly and keep the slurm version unchanged as I locked this at os 
level  but the mariadb version update happens and as far as i see it 
has no impact.
is it good idea to keep the mariadb version also intact with the slurm 
version?


Regards
Navin.




Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Paul Edmon
True.  Though be aware that Slurm will by default map the environment 
from login nodes to compute.  That's the real thing that matters.  So as 
long as the environment is setup properly, any filesystems excluding the 
home directory do not need to be mounted on login.


-Paul Edmon-

On 8/2/2022 9:56 AM, Brian Andrus wrote:

A quick nuance:
We only have home directories on the login node. Our software 
installations are not accessible from there to prevent users from 
running things there (you must have a running job to access the 
software packages).


So your login node does not necessarily need everything the compute 
nodes do.


Brian Andrus

On 8/2/2022 6:45 AM, Paul Edmon wrote:
No, the node running the slurmctld does not need access to any of the 
customer facing filesystems or home directories.  While all the login 
and client nodes do, the slurmctld does not.


-Paul Edmon-

On 8/2/2022 9:30 AM, Richard Chang wrote:

Hi,

I am new to SLURM, so please bear with me.

I need to understand whether the Server/Node running the slurmctld 
daemon will need access to the Parallel file system, and if it will 
need all the SW run time libraries installed, as in the compute nodes.


The users will login to the Login/submission nodes with their home 
mounted from say PFS1 and change directory to the PFS2 mount point 
and then submit/run their jobs.


Does it mean the Server/node running the slurmctld daemon will also 
need access to both the PFS1 and PFS2 mount points ? I am not sure.


The server running the slurmctld daemon will be exclusively for that 
and is not a login node.


Thanks & regards,

Richard.










Re: [slurm-users] SlurmDB Archive settings?

2022-07-18 Thread Paul Edmon

Sure.  Here are our settings:

ArchiveJobs=yes
ArchiveDir="/slurm/archive"
ArchiveSteps=yes
ArchiveResvs=yes
ArchiveEvents=yes
ArchiveSuspend=yes
ArchiveTXN=yes
ArchiveUsage=yes
PurgeEventAfter=6month
PurgeJobAfter=6month
PurgeResvAfter=6month
PurgeStepAfter=6month
PurgeSuspendAfter=6month
PurgeTXNAfter=6month
PurgeUsageAfter=6month

-Paul Edmon-

On 7/15/2022 2:08 AM, Ole Holm Nielsen wrote:

Hi Paul,

On 7/14/22 15:10, Paul Edmon wrote:
We just use the Archive function built into slurm.  That has worked 
fine for us for the past 6 years. We keep 6 months of data in the 
active archive.


Could you kindly share your Archive* settings in slurmdbd.conf? I've 
never tried to use this, but it sounds like a good idea.


Thanks,
Ole





Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
Yeah, a word of warning about going from 21.08 to 22.05, make sure you 
have enough storage on the database host you are doing the work on and 
budget a long enough time for the upgrade.  We just converted our 198 GB 
(compressed, 534 GB raw) database this week.  The initial attempt failed 
(after running for 8 hours) because we ran out of disk space (part of 
the reason we had to compress is that the server we use for our slurm 
master only has 800 GB of SSD on it).  That meant we had to reimport our 
DB, which took 8 hours, plus then we had to drop the job scripts and job 
envs, which took another 5 hours, to then attempt the upgrade which took 
2 hours.



Moral of the story, make sure you have enough space and budget 
sufficient time.  You may want to consider nulling out the job scripts 
and envs for the upgrade as they complete redo the way those are stored 
in the database in 22.05 so that it is more efficient but getting from 
here to there is the trick.



For details see the bug report we filed: 
https://bugs.schedmd.com/show_bug.cgi?id=14514



-Paul Edmon-


On 7/14/2022 2:34 PM, Timony, Mick wrote:



What I can tell you is that we have never had a problem
reimporting the data back in that was dumped from older versions
into a current version database.  So the import using sacctmgr
must do the conversion from the older formats to the newer formats
and handle the schema changes.

​That's the bit of info I was missing, I didn't realise that it 
outputs the data in a format that sacctmgr can read.


I will note that if you are storing job_scripts and envs those can
eat up a ton of space in 21.08.  It looks like they've solved that
problem in 22.05 but the archive steps on 21.08 took forever due
to those scripts and envs.

​Yes, we are storing job_scripts with:

AccountingStoreFlags=job_script

I think when we made that decision, we decided that also saving 
the job_env would take up too much room as our DB is pretty big at the 
moment, at approx. 300GB with the o2_step_table and the o2_job_table 
taking up the most space for obvious reasons:


++---+
| Table                      | Size (GB) |
++---+
| o2_step_table              |    183.83 |
| o2_job_table               |    128.18 |


That's good advice Paul, much appreciated.

>took forever and actually caused issues with the archive process
I think that should be highlighted for other users!

For those interested, to find the tables sizes I did this:

SELECT table_name AS "Table", ROUND(((data_length + index_length) / 
1024 / 1024 / 1024), 2) AS "Size (GB)" FROM information_schema.TABLES 
WHERE table_schema = "slurmdbd" ORDER BY (data_length + index_length) 
DESC;


Replace slurmdbdwith the name of your database.

Cheers
--Mick



Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
We just use the Archive function built into slurm.  That has worked fine 
for us for the past 6 years.  We keep 6 months of data in the active 
archive.



If you have 6 years worth of data and you want to prune down to 2 years, 
I recommend going month by month rather than doing it in one go.  When 
we initially started archiving data several years back our first pass at 
archiving (which at that time had 2 years of data in it) took forever 
and actually caused issues with the archive process.  We worked with 
SchedMD, improved the archive script built into Slurm but also decided 
to only archive one month at a time which allowed it to get done in a 
reasonable amount of time.



The archived data can be pulled into a different slurm database, which 
is what we do for importing historic data into our XDMod instance.



-Paul Edmon-


On 7/13/2022 4:55 PM, Timony, Mick wrote:

Hi Slurm Users,

Currently we don't archive our SlurmDB and have 6 years' worth of data 
in our SlurmDB. We are looking to start archiving our database as it 
starting to get rather large, and we have decided to keep 2 years' 
worth of data. I'm wondering what approaches or scripts other groups use.


The docs refer to the ArchiveScript setting at:
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveScript

I've seen suggestions to import into another database that will 
require keeping the schema up-to-date which seems like a possible 
maintenance issue or nightmare if one forgets to update the schema 
after updating Slurmdb. We also have most of the information in an 
Elasticsearch <https://slurm.schedmd.com/elasticsearch.html> instance, 
which will likely suite our needs for long term historical information.



What do you use to archive this information? CSV files, SQL dumps or 
something else?



Regards
--
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
Database upgrades can also take a while if your database is large.  
Definitely recommend backing up prior to upgrade as well as running 
slurmdbd -Dv and not the systemd daemon as if the upgrade takes a 
long time it will kill it preemptively due to unresponsiveness which 
will create all sorts of problems.


-Paul Edmon-

On 5/17/22 2:50 PM, Ole Holm Nielsen wrote:

Hi,

You can upgrade from 19.05 to 20.11 in one step (2 major releases), 
skipping 20.02.  When that is completed, it is recommended to upgrade 
again from 20.11 to 21.08.8 in order to get the current major version. 
The 22.05 will be out very soon, but you may want to wait a couple of 
minor releases before upgrading to 22.05.


I have collected much detailed information about Slurm upgrades in my 
Wiki page:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

It is strongly recommended to make the dry-run test of the database 
upgrade, just to be sure your database won't cause problems.


/Ole



On 17-05-2022 18:13, byron wrote:
Sorry, I should have been clearer.   I understand that with regards 
to slurmd / slurmctld you can skip a major release without impacting 
running jobs etc.  My questions was about upgrading slurmdbd and 
whether it was necessary to upgrade through the intermediate major 
releases (which I know understand is necessary).


Thanks


On Tue, May 17, 2022 at 4:49 PM Paul Edmon <mailto:ped...@cfa.harvard.edu>> wrote:


    The slurm docs say you can do two major releases at a time
    (https://slurm.schedmd.com/quickstart_admin.html
    <https://slurm.schedmd.com/quickstart_admin.html>):

    "Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x)
    involves changes to the state files with new data structures, new
    options, etc. Slurm permits upgrades to a new major release from the
    past two major releases, which happen every nine months (e.g.
    20.02.x or 20.11.x to 21.08.x) without loss of jobs or other state
    information."

    As for old versions of slurm I think at this point you would need to
    contact SchedMD.  I'm sure they have past releases they can hand out
    if you are bootstrapping to a newer release.

    -Paul Edmon-

    On 5/17/22 11:42 AM, byron wrote:

    Thanks Brian for the speedy responce.

    Am I not correct in thinking that if I just go from 19.05 to 20.11
    then there is the advantage that I can upgrade slurmd and
    slurmctld in one go and it won't affect the running jobs since
    upgrading to a new major release from the past two major releases
    doesn't affect the state information.  Or are you saying that  in
    this case (19.05  direct to 21.08) there isn't any impact to
    running jobs either.  Or did you step through all the versions
    when upgrading slurmd and slurmctld also?

    Also where do I get a copy of 20.2 from if schedMD aren't
    providing it as a download.

    Thanks




    On Tue, May 17, 2022 at 4:05 PM Brian Andrus mailto:toomuc...@gmail.com>> wrote:

    You need to step upgrade through major versions (not minor).

    So 19.05=>20.x

    I would highly recommend going to 21.08 while you are at it.
    I just did the same migration (although they started at 18.x)
    with no
    issues. Running jobs were not impacted and users didn't even
    notice.

    Brian Andrus


    On 5/17/2022 7:35 AM, byron wrote:
    > Hi
    >
    > I'm looking at upgrading our install of slurm from 19.05 to
    20.11 in
    > responce to the recenty announced security vulnerabilities.
    >
    > I've been through the documentation / forums and have
    managed to find
    > the answers to most of my questions but am still unclear
    about the
    > following
    >
    >  - In upgrading the slurmdbd from 19.05 to 20.11 do I need
    to go
    > through all the versions (19.05 => 20.2 => 20.11)? From
    reading the
    > forums it look as though it is necesary
    >
https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
<https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ>
    >
https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
<https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ>
    >    However if that is the case it would seem strange that
    SchedMD have
    > removed 20.2 from the downloads page (I understand the
    reason is that
    > it contains the exploit) if it is still required for the
    upgrade.
    >
    > - We are running version 5.5.68 of the MariaDB, the version
    that comes
    > with centos7.9.   I've seen a few references to upgrading
    v5.5 but
    > they were in the context of upgrading from slurm 17 to 
18.  I'm

    > wondering if its ok to stick with this version since we're
    already on
    > slurm 19.05.
    >
    > Any help much appreciated.






Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon

I think it should be, but you should be able to run a test and find out.

-Paul Edmon-

On 5/17/22 12:13 PM, byron wrote:
Sorry, I should have been clearer.   I understand that with regards to 
slurmd / slurmctld you can skip a major release without impacting 
running jobs etc.  My questions was about upgrading slurmdbd and 
whether it was necessary to upgrade through the intermediate major 
releases (which I know understand is necessary).


Thanks


On Tue, May 17, 2022 at 4:49 PM Paul Edmon  wrote:

The slurm docs say you can do two major releases at a time
(https://slurm.schedmd.com/quickstart_admin.html):

"Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x)
involves changes to the state files with new data structures, new
options, etc. Slurm permits upgrades to a new major release from
the past two major releases, which happen every nine months (e.g.
20.02.x or 20.11.x to 21.08.x) without loss of jobs or other state
information."

As for old versions of slurm I think at this point you would need
to contact SchedMD.  I'm sure they have past releases they can
hand out if you are bootstrapping to a newer release.

-Paul Edmon-

On 5/17/22 11:42 AM, byron wrote:

Thanks Brian for the speedy responce.

Am I not correct in thinking that if I just go from 19.05 to
20.11 then there is the advantage that I can upgrade slurmd and
slurmctld in one go and it won't affect the running jobs since
upgrading to a new major release from the past two major releases
doesn't affect the state information.  Or are you saying that in
this case (19.05  direct to 21.08) there isn't any impact to
running jobs either.  Or did you step through all the versions
when upgrading slurmd and slurmctld also?

Also where do I get a copy of 20.2 from if schedMD aren't
providing it as a download.

Thanks




On Tue, May 17, 2022 at 4:05 PM Brian Andrus
 wrote:

You need to step upgrade through major versions (not minor).

So 19.05=>20.x

I would highly recommend going to 21.08 while you are at it.
I just did the same migration (although they started at 18.x)
with no
issues. Running jobs were not impacted and users didn't even
notice.

Brian Andrus


On 5/17/2022 7:35 AM, byron wrote:
> Hi
>
> I'm looking at upgrading our install of slurm from 19.05 to
20.11 in
> responce to the recenty announced security vulnerabilities.
>
> I've been through the documentation / forums and have
managed to find
> the answers to most of my questions but am still unclear
about the
> following
>
>  - In upgrading the slurmdbd from 19.05 to 20.11 do I need
to go
> through all the versions (19.05 => 20.2 => 20.11)?  From
reading the
> forums it look as though it is necesary
>
https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
>
https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
>    However if that is the case it would seem strange that
SchedMD have
> removed 20.2 from the downloads page (I understand the
reason is that
> it contains the exploit) if it is still required for the
upgrade.
>
> - We are running version 5.5.68 of the MariaDB, the version
that comes
> with centos7.9.   I've seen a few references to upgrading
v5.5 but
> they were in the context of upgrading from slurm 17 to 18. 
I'm
> wondering if its ok to stick with this version since we're
already on
> slurm 19.05.
>
> Any help much appreciated.
>
>
>
>


Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
The slurm docs say you can do two major releases at a time 
(https://slurm.schedmd.com/quickstart_admin.html):


"Almost every new major release of Slurm (e.g. 20.02.x to 20.11.x) 
involves changes to the state files with new data structures, new 
options, etc. Slurm permits upgrades to a new major release from the 
past two major releases, which happen every nine months (e.g. 20.02.x or 
20.11.x to 21.08.x) without loss of jobs or other state information."


As for old versions of slurm I think at this point you would need to 
contact SchedMD.  I'm sure they have past releases they can hand out if 
you are bootstrapping to a newer release.


-Paul Edmon-

On 5/17/22 11:42 AM, byron wrote:

Thanks Brian for the speedy responce.

Am I not correct in thinking that if I just go from 19.05 to 20.11 
then there is the advantage that I can upgrade slurmd and slurmctld in 
one go and it won't affect the running jobs since upgrading to a new 
major release from the past two major releases doesn't affect the 
state information.  Or are you saying that  in this case (19.05  
direct to 21.08) there isn't any impact to running jobs either.  Or 
did you step through all the versions when upgrading slurmd and 
slurmctld also?


Also where do I get a copy of 20.2 from if schedMD aren't providing it 
as a download.


Thanks




On Tue, May 17, 2022 at 4:05 PM Brian Andrus  wrote:

You need to step upgrade through major versions (not minor).

So 19.05=>20.x

I would highly recommend going to 21.08 while you are at it.
I just did the same migration (although they started at 18.x) with no
issues. Running jobs were not impacted and users didn't even notice.

Brian Andrus


On 5/17/2022 7:35 AM, byron wrote:
> Hi
>
> I'm looking at upgrading our install of slurm from 19.05 to
20.11 in
> responce to the recenty announced security vulnerabilities.
>
> I've been through the documentation / forums and have managed to
find
> the answers to most of my questions but am still unclear about the
> following
>
>  - In upgrading the slurmdbd from 19.05 to 20.11 do I need to go
> through all the versions (19.05 => 20.2 => 20.11)? From reading the
> forums it look as though it is necesary
> https://groups.google.com/g/slurm-users/c/fftVPaHvTzQ/m/YTWo1mRjAwAJ
> https://groups.google.com/g/slurm-users/c/kXtepX8-L7I/m/udwySA3bBQAJ
>    However if that is the case it would seem strange that
SchedMD have
> removed 20.2 from the downloads page (I understand the reason is
that
> it contains the exploit) if it is still required for the upgrade.
>
> - We are running version 5.5.68 of the MariaDB, the version that
comes
> with centos7.9.   I've seen a few references to upgrading v5.5 but
> they were in the context of upgrading from slurm 17 to 18.  I'm
> wondering if its ok to stick with this version since we're
already on
> slurm 19.05.
>
> Any help much appreciated.
>
>
>
>


Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Paul Edmon
They fix this in newer versions of Slurm.  We had the same issue with 
older versions so we hard to run with the config_override option on to 
keep the logs quiet.  They changed the way logging was done in the more 
recent releases and its not as chatty.


-Paul Edmon-

On 5/12/22 7:35 AM, Per Lönnborg wrote:

Greetings,

is there a way to lower the log rate on error messages in slurmctld 
for nodes with hardware errors?


We see for example this for a node that has DIMM errors:

[2022-05-12T07:07:34.757] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:35.760] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:36.763] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:37.766] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:38.769] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:39.773] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:40.776] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:41.779] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:42.781] error: Node node37 has low real_memory size 
(257642 < 257660)
[2022-05-12T07:07:45.143] error: Node node37 has low real_memory size 
(257642 < 257660)


The log warning is correct, the node has DIMM errors, but that´s one 
log entry per second. That doesn´t seem right with such high log rate?


Thanks,
/ Per Lonnborg




___
Annons: Handla enkelt och smidigt hos Clas Ohlson 
<http://www.dpbolvw.net/click-5762941-10771045>


Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-06 Thread Paul Edmon
We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we 
saw the communications issues described by Tim W.  We upgraded to 
21.08.8-2 this morning and that did the trick to resolve all the 
communications problems we were having.


-Paul Edmon-

On 5/6/2022 4:38 AM, Ole Holm Nielsen wrote:

Hi Juergen,

My upgrade report: We upgraded from 21.08.7 to 21.08.8-1 yesterday for 
the entire cluster, and we didn't have any issues.  I built RPMs from 
the tar-ball and simply did "yum update" on the nodes (one partition 
at a time) while the cluster was running in full production mode.  All 
slurmd get restarted during the yum update, and this happens within 
1-2 minutes per partition.


Today I upgraded from 21.08.1-1 to 21.08.8-2 for the entire cluster, 
and again we have not seen any issues.


We also do *not* setting CommunicationParameters=block_null_hash until 
a later date when there are no more old versions of slurmstepd 
running.  We did however see RPC errors with "Protocol authentication 
error" while block_null_hash was enabled briefly, see 
https://bugs.schedmd.com/show_bug.cgi?id=14002, and so we turned it 
off again.  It hasn't happened since.


Best regards,
Ole

On 5/6/22 01:57, Juergen Salk wrote:

Hi John,

this is really bad news. We have stopped our rolling update from Slurm
21.08.6 to Slurm 21.08.8-1 today for exactly that reason: State of
compute nodes already running slurmd 21.08.8-1 suddenly started
flapping between responding and not responding but all other nodes
that were still running version 21.08.6 slurmd were not affected.

For the affected nodes we did not see any obvious reason in slurmd.log
even with SlurmdDebug set to debug3 but we noticed the following
in slurmctld.log with SlurmctldDebug=debug and DebugFlags=route
enabled.

[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1423 
RPC:REQUEST_PING : Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1424 
RPC:REQUEST_PING : Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1425 
RPC:REQUEST_PING : Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1426 
RPC:REQUEST_PING : Protocol authentication error
[2022-05-05T20:37:40.449] agent/is_node_resp: node:n1811 
RPC:REQUEST_PING : Protocol authentication error

[2022-05-05T20:37:41.397] error: Nodes n[1423-1426,1811] not responding

So you seen this as well with 21.08.8-2?

We didn't have CommunicationParameters=block_null_hash set, btw.

Actually, after Tim's last announcement, I was hoping that we can start
over tomorrow morning with 21.08.8-2 to resolve this issue. Therefore,
I would also be highly interested what others can say about rolling 
updates from

Slurm 21.08.6 to Slurm 21.08.8-2 which, at least temporarily, entails a
mix of patched and unpatched slurmd versions on the compute nodes.

If 21.08.8-2 slurmd still does not work together with 21.08.6 slurmd
we may have to drain the whole cluster for updating Slurm, which
is something that I'd actually wished to avoid.

Best regards
Jürgen



* Legato, John (NIH/NHLBI) [E]  [220505 22:30]:

Hello,

We are in the process of upgrading from Slurm 21.08.6 to Slurm 
21.08.8-2. We’ve upgraded the controller and a few partitions worth 
of nodes. We notice the nodes are
losing contact with the controller but slurmd is still up. We 
thought that this issue was fixed in -2 based on this bug report:


https://bugs.schedmd.com/show_bug.cgi?id=14011

However we are still seeing the same behavior. I note that nodes 
running 21.08.6 are having no issues with communication. I could
upgrade the remaining 21.08.6 nodes but hesitate to do that as it 
seems like it would completely kill the functioning nodes.


Is anyone else still seeing this in -2?






Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Paul Edmon
We've invoked scontrol in our epilog script for years to close off nodes 
with out any issue.  What the docs are really referring to is gratuitous 
use of those commands.  If you have those commands well circumscribed 
(i.e. only invoked when you have to actually close a node) and only use 
them when you absolutely have no other work around then you should be fine.


-Paul Edmon-

On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote:


Hi, all:

We need to detect some problem at job end timepoint, so we write some 
detection script in slurm epilog, which should drain the node if check 
is not passed.


I know exit epilog with non-zero code will make slurm automatically 
drain the node. But in such way, drain reason will all be marked as 
*“Epilog error”*. Then our auto-repair program will have trouble to 
determine how to repair the node.


Another way is call *scontrol* directly from epilog to drain the node, 
but from official doc https://slurm.schedmd.com/prolog_epilog.html it 
wrote:


/Prolog and Epilog scripts should be designed to be as short as 
possible and should not call Slurm commands (e.g. squeue, scontrol, 
sacctmgr, etc). … Slurm commands in these scripts can potentially lead 
to performance issues and should not be used./


So what is the best way to drain node from epilog with a self-defined 
reason, or tell slurm to add more verbose message besides *“Epilog 
error” *reason?


Re: [slurm-users] non-historical scheduling

2022-04-12 Thread Paul Edmon
So you want a purely fractional usage of the cluster.  That's hard to do 
via fairshare or with out fairshare as the scheduler will usually fill 
up all the nodes with the top priority job.  If you don't have fairshare 
running or any historical data it will revert to FIFO.  So which ever 
user got in first will go first, no matter how many jobs there are.  
Fairshare can accomplish what you want above but it takes time for it to 
settle into a steady state due to behavior above.  If you chart the 
usage over time with fairshare you will see it even out, but at any 
given immediate time you will have one user dominating over another one.


You could probably achieve a pure fractional usage model by utilizing 
hard limits for each user in terms of number of cores. The problem is 
that you will leave parts of the cluster open and idle.  If that is fine 
then I recommend setting hard limits for each user.


-Paul Edmon-

On 4/12/2022 8:55 AM, Chagai Nota wrote:

Hi Loris

Thanks for your answer.
I tired to configure it and I didn't get desired results.
This is my configuration:
PriorityType=priority/multifactor
PriorityDecayHalfLife=0
PriorityUsageResetPeriod=DAILY
PriorityFavorSmall=yes
PriorityWeightFairshare=10
PriorityWeightAge=0
PriorityWeightPartition=0
PriorityWeightJobSize=10
PriorityMaxAge=1-0
PriorityCalcPeriod=1

The desired result its that when 2 users A and B send jobs they will have equal 
number of jobs to each of them.
Lets say all grid have 12 slots so user A and B each one of them will get 6, 
but when happen that user A get 12 and after sometime user B get 12



-Original Message-
From: slurm-users  On Behalf Of Loris 
Bennett
Sent: Tuesday, April 12, 2022 12:06 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] non-historical scheduling

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Hi Chagai,

Chagai Nota 
<https://urldefense.proofpoint.com/v2/url?u=http-3A__chagai.nota-40altair-2Dsemi.com=DwIFaQ=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM=jm7M7VuAC03WZiP8QPjQXbsP_SRYyhc66dx6T2rYKGk=mC6-yDte_BkF_egdAiZhLfKbIi-zhwylR5b6AOgnfEo=aOWXcTJqFuopg_IznzSJXY_GKgxYv-0FAFrZrQBDpyA=>
 writes:


Hi



I would like to ask if there is any option that slurm scheduler will consider 
only running jobs and not historical data.

We don't care about how many jobs users was running in the past but only the 
current usage.

Look at

   
https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_priority-5Fmultifactor.html=DwIFaQ=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM=jm7M7VuAC03WZiP8QPjQXbsP_SRYyhc66dx6T2rYKGk=mC6-yDte_BkF_egdAiZhLfKbIi-zhwylR5b6AOgnfEo=ez1DlsO7KAqb6GyBvfk6PoSMDxjckA26SrtvBRwOPtc=

You probably need to set

   PriorityDecayHalfLife=0

and then, say,

   PriorityUsageResetPeriod=DAILY

Cheers,

Loris


Thanks

Chagai Nota

--
--
--


Important Notice: This email message and any attachments thereto are
confidential and/or privileged and/or subject to privacy laws and are
intended only for use by the addressee(s) named above. If you are not the 
intended addressee, you are hereby kindly notified that any dissemination, 
distribution, copying or use of this email and any attachments thereto is 
strictly prohibited. If you have received this email in error, kindly delete it 
from your computer system and notify us at the telephone number or email 
address appearing above. The writer asserts in respect of this message and 
attachments all rights for confidentiality, privilege or privacy to the fullest 
extent permitted by law.


--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



--

This email has been scanned for spam and viruses by Proofpoint Essentials. 
Visit the following link to report this email as spam:
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Feu1.proofpointessentials.com%2Findex01.php%3Fmod_id%3D11%26mod_option%3Dlogitem%26mail_id%3D1649754490-sOxgKUMnXxFb%26r_address%3Dchagai.nota%2540altair-semi.com%26report%3D1data=04%7C01%7C%7Cd5bb86cab5754f0b297008da1c63f87f%7Cd97b0f906803449a923b749ca7eedb2b%7C0%7C0%7C637853512947856000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=I%2FKmvMAweaHTndtIydlYiBJqyylKa58JlSOYCXSzJnI%3Dreserved=0



Important Notice: This email message and any attachments thereto are 
confidential and/or privileged and/or subject to privacy laws and are intended 
only for use by the addressee(s) named above. If you are not the intended 
addressee, you are hereby kindly notified that any dissemi

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Paul Edmon
I think you could do this by clever use of a partition level QoS but I 
don't have an obvious way of doing this.


-Paul Edmon-

On 3/22/2022 11:40 AM, Russell Jones wrote:

Hi all,

For various reasons, we need to limit a partition to being able to run 
max 1 job at a time. Not 1 job per user, but 1 job total at a time, 
while queuing any other jobs to run after this one is complete.


I am struggling to figure out how to do this. Any tips?

Thanks!




Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Edmon
We also noticed the same thing with 21.08.5.  In the 21.08 series 
SchedMD changed the way they handle cgroups to set the stage for cgroups 
v2 (see: https://slurm.schedmd.com/SLUG21/Roadmap.pdf). The 21.08.5 
introduced a bug fix which then caused mpirun to not pin properly 
(particularly for older versions of MPI): 
https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS  What we've 
recommended to users who have hit this was to swap over to using srun 
instead of mpirun and the situation clears up.


-Paul Edmon-

On 2/10/2022 8:59 AM, Ward Poelmans wrote:

Hi Paul,

On 10/02/2022 14:33, Paul Brunk wrote:


Now we see a problem in which the OOM killer is in some cases

predictably killing job steps who don't seem to deserve it.  In some

cases these are job scripts and input files which ran fine before our

Slurm upgrade.  More details follow, but that's it the issue in a

nutshell.

I'm not sure if this is the case but it might help to keep in mind the 
difference between mpirun and srun.


With srun you let slurm create tasks with the appropriate mem/cpu etc 
limits and the mpi ranks will run directly in a task.


With mpirun you usually let your MPI distribution start on task per 
node which will spawn the mpi manager which will start the actual mpi 
program.


You might very well end up with different memory limits per process 
which could be the cause of your OOM issue. Especially if not all MPI 
ranks use the same amount of memory.


Ward


Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon

Also I recommend setting:

*CoreSpecCount*
   Number of cores reserved for system use. These cores will not be
   available for allocation to user jobs. Depending upon the
   *TaskPluginParam* option of *SlurmdOffSpec*, Slurm daemons (i.e.
   slurmd and slurmstepd) may either be confined to these resources
   (the default) or prevented from using these resources. Isolation of
   the Slurm daemons from user jobs may improve application
   performance. If this option and *CpuSpecList* are both designated
   for a node, an error is generated. For information on the algorithm
   used by Slurm to select the cores refer to the core specialization
   documentation ( https://slurm.schedmd.com/core_spec.html ). 


and

*MemSpecLimit*
   Amount of memory, in megabytes, reserved for system use and not
   available for user allocations. If the task/cgroup plugin is
   configured and that plugin constrains memory allocations (i.e.
   *TaskPlugin=task/cgroup* in slurm.conf, plus *ConstrainRAMSpace=yes*
   in cgroup.conf), then Slurm compute node daemons (slurmd plus
   slurmstepd) will be allocated the specified memory limit. Note that
   having the Memory set in *SelectTypeParameters* as any of the
   options that has it as a consumable resource is needed for this
   option to work. The daemons will not be killed if they exhaust the
   memory allocation (ie. the Out-Of-Memory Killer is disabled for the
   daemon's memory cgroup). If the task/cgroup plugin is not
   configured, the specified memory will only be unavailable for user
   allocations. 

These will restrict specific memory and cores for system use. This is 
probably the best way to go rather than spoofing your config.


-Paul Edmon-


On 1/7/2022 2:36 AM, Rémi Palancher wrote:

Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer  
a écrit :


All,

When my team used PBS, we had several nodes that had a TON of CPUs, so many, in 
fact, that we ended up setting np to a smaller value, in order to not starve 
the system of memory.

What is the best way to do this with Slurm? I tried modifying # of CPUs in the slurm.conf file, but 
I noticed that Slurm enforces that "CPUs" is equal to Boards * SocketsPerBoard * 
CoresPerSocket * ThreadsPerCore. This left me with having to "fool" Slurm into thinking 
there were either fewer ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a 
less than ideal solution, it seems to me. At least, it left me feeling like there has to be a 
better way.

I'm not sure you can lie to Slurm about the real number of CPUs on the nodes.

If you want to prevent Slurm from allocating more than n CPUs below the total 
number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n 
at the partition level.

You can also mask "system" CPUs with CpuSpecList at node level.

The later is better if you need fine grain control over the exact list of 
reserved CPUs regarding NUMA topology or whatever.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon
You can actually spoof the number of cores and RAM on a node by using 
the config_override option.  I've used that before for testing 
purposes.  Mind you core binding and other features like that will not 
work if you start spoofing the number of cores and ram, so use with caution.


-Paul Edmon-

On 1/7/2022 2:36 AM, Rémi Palancher wrote:

Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer  
a écrit :


All,

When my team used PBS, we had several nodes that had a TON of CPUs, so many, in 
fact, that we ended up setting np to a smaller value, in order to not starve 
the system of memory.

What is the best way to do this with Slurm? I tried modifying # of CPUs in the slurm.conf file, but 
I noticed that Slurm enforces that "CPUs" is equal to Boards * SocketsPerBoard * 
CoresPerSocket * ThreadsPerCore. This left me with having to "fool" Slurm into thinking 
there were either fewer ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a 
less than ideal solution, it seems to me. At least, it left me feeling like there has to be a 
better way.

I'm not sure you can lie to Slurm about the real number of CPUs on the nodes.

If you want to prevent Slurm from allocating more than n CPUs below the total 
number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n 
at the partition level.

You can also mask "system" CPUs with CpuSpecList at node level.

The later is better if you need fine grain control over the exact list of 
reserved CPUs regarding NUMA topology or whatever.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io







Re: [slurm-users] export qos

2021-12-17 Thread Paul Edmon
Just of our curiosity is there a reason you aren't just doing a 
mysqldump of the extant DB and then reimporting it?


I'm not aware of a way to dump just the qos settings for import other than:

sacctmgr show qos

-Paul Edmon-

On 12/17/2021 10:24 AM, Williams, Jenny Avis wrote:


Sacctmgr dump gets the user listings, but I do not see how to dump qos 
settings.


Does anyone know of a quick way to export qos settings for import to a 
new sched box?


Jenny


Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-13 Thread Paul Edmon
I haven't tested with super ancient versions of Slurm but I know we have 
uploaded past versions before so we could scrape the data for XDMod.  So 
as far as I'm aware there is no version limitation, but your mileage may 
vary with very old versions of Slurm.  To make sure I would probably 
ping SchedMD as to any limitations they are aware of.  Usually they are 
pretty good about being comprehensive in their docs so they would have 
probably mentioned it if there was one.


-Paul Edmon-

On 12/13/2021 5:07 AM, Loris Bennett wrote:

Hi Paul,

Am I right in assuming that there are going to be some limitations to
loading archived data w.r.t. version of slurmdbd used to create the
archive and that used to read it?

Cheers,

Loris

Paul Edmon  writes:


Files generated by the slurmdbd archive are read back into the live database by 
sacctmgr.  See:

archive load

Load in to the database previously archived data. The archive file will not be 
loaded if the records already exist in the database - therefore, trying to load 
an archive file more than once will result in an error. When this data is again 
archived and
purged from the database, if the old archive file is still in the directory 
ArchiveDir, a new archive file will be created (see ArchiveDir in the 
slurmdbd.conf man page), so the old file will not be overwritten and these 
files will have duplicate records.

File=
  File to load into database. The specified file must exist on the slurmdbd 
host, which is not necessarily the machine running the command.
Insert=
  SQL to insert directly into the database. This should be used very cautiously 
since this is writing your sql into the database.

So you could set up a full mirror and then read the old archives into that.  
You just want to make sure that mirror has archiving/purging turned off so it 
won't rearchive the data you restored.

-Paul Edmon-

On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote:

   


  Hello

 Our slurmdbd database is getting rather large and affecting performance, 
but we want to keep usage data around for a few years for metric purposes in 
order to figure out how our users work. I read a suggestion to have a backup DB
  which has all the usage data synced to it for metric purposes and a main 
slurmdbd setup for the cluster to use that cleans out old data based on your 
user working needs.

   


  Is there any documentation suggesting how to set up a second slurmdbd server 
that will receive a copy of all the main slurmdbd entries without purging so we 
can start purging on the in use slurmdbd service to keep short term
  performance snappy? Presumably the upgrade process will be complicated by 
this as well since we have to keep the archive slurmdbd setup in sync with the 
cluster slurmdbd.

   


  Thanks.

   


  *EDIT before hitting send*   I was re-reading the slurmdbd.conf man page and 
just saw the Archive* options and this sounds like it would work to implement 
something like this.

  Are archive files readable by sacct and sreport, or easily manually parseable?

  I am going to turn these on in my test cluster, but hearing about other 
peoples experiences with this would probably be helpful.





Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-10 Thread Paul Edmon
Files generated by the slurmdbd archive are read back into the live 
database by sacctmgr.  See:



 archive load

Load in to the database previously archived data. The archive file will 
not be loaded if the records already exist in the database - therefore, 
trying to load an archive file more than once will result in an error. 
When this data is again archived and purged from the database, if the 
old archive file is still in the directory ArchiveDir, a new archive 
file will be created (see ArchiveDir in the slurmdbd.conf man page), so 
the old file will not be overwritten and these files will have duplicate 
records.


/File=/
   File to load into database. The specified file must exist on the
   slurmdbd host, which is not necessarily the machine running the
   command. 
/Insert=/

   SQL to insert directly into the database. This should be used very
   cautiously since this is writing your sql into the database. 

So you could set up a full mirror and then read the old archives into 
that.  You just want to make sure that mirror has archiving/purging 
turned off so it won't rearchive the data you restored.


-Paul Edmon-

On 12/10/2021 1:28 PM, Ransom, Geoffrey M. wrote:


Hello

   Our slurmdbd database is getting rather large and affecting 
performance, but we want to keep usage data around for a few years for 
metric purposes in order to figure out how our users work. I read a 
suggestion to have a backup DB which has all the usage data synced to 
it for metric purposes and a main slurmdbd setup for the cluster to 
use that cleans out old data based on your user working needs.


Is there any documentation suggesting how to set up a second slurmdbd 
server that will receive a copy of all the main slurmdbd entries 
without purging so we can start purging on the in use slurmdbd service 
to keep short term performance snappy? Presumably the upgrade process 
will be complicated by this as well since we have to keep the archive 
slurmdbd setup in sync with the cluster slurmdbd.


Thanks.

**EDIT* before hitting send*   I was re-reading the slurmdbd.conf man 
page and just saw the Archive* options and this sounds like it would 
work to implement something like this.


Are archive files readable by sacct and sreport, or easily manually 
parseable?


I am going to turn these on in my test cluster, but hearing about 
other peoples experiences with this would probably be helpful.


Re: [slurm-users] Database Compression

2021-12-09 Thread Paul Edmon
Just to put a resolution on this.  I did some testing and compression 
does work but to get extant tables to compress you have to reimport your 
database.  So the procedure would be to:


1. Turn on compression in my.cnf following the doc.

2. mysqldump the database you want to compress

3. recreate that database (drop and remake it)

4. reimport the database

This can take a bit if your database is large.  However when I tested 
this with our production database it went from 130G on disk to 29G, a 
factor of 4.5 improvement (this is using the default settings and zlib).


I haven't had time to actually do it for real on our live system and see 
if there is a performance hit in terms of scheduling but we keep a 
sizable buffer in memory so I'm not anticipating any thing.


My verdict then is that if you are going to do it, do it before your 
database grows too big as doing the dump and reimport will take a while 
(for me it was about 4 hours start to finish on my test system).


-Paul Edmon-

On 12/2/2021 1:06 PM, Baer, Troy wrote:

My site has just updated to Slurm 21.08 and we are looking at moving to the 
built-in job script capture capability, so I'm curious about this as well.

--Troy

-Original Message-
From: slurm-users  On Behalf Of Paul 
Edmon
Sent: Thursday, December 2, 2021 10:30 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Database Compression

With the advent of the ability to store jobscripts in the slurmdb, our
db is growing at a fairly impressive rate (which is expected).  That
said I've noticed that our database backups are highly compressible
(factor of 24).  Not being a mysql expert I hunted around to see if it
could do native compression and it can:
https://urldefense.com/v3/__https://mariadb.com/kb/en/innodb-page-compression/__;!!KGKeukY!jgWBQWFG0eUgbghBOl8d4w1h4_lBv5VNloBkfgr5pOYVq0V1xjV-hToRXTQZ$

My question is if anyone has had any experience with using page
compression for mariadb and if there are any hitches I should be aware of?

-Paul Edmon-






Re: [slurm-users] A Slurm topological scheduling question

2021-12-07 Thread Paul Edmon
This should be fine assuming you don't mind the mismatch in CPU speeds.  
Unless the codes are super sensitive to topology things should be okay 
as modern IB is wicked fast.



In our environment here we have a variety of different hardware types 
all networked together on the same IB fabric.  That said we create 
partitions for different hardware types and we don't have a queue that 
schedules across both, though we do have a backfill serial queue that 
underlies everything.  All of that though is scheduled via a single 
scheduler with a single topology.conf governing it all.  We also have 
all our internode IP comms going over our IB fabric and it works fine.



-Paul Edmon-


On 12/7/2021 11:05 AM, David Baker wrote:

Hello,

These days we have now enabled topology aware scheduling on our Slurm 
cluster. One part of the cluster consists of two racks of AMD compute 
nodes. These racks are, now, treated as separate entities by Slurm. 
Soon, we may add another set of AMD nodes with slightly difference CPU 
specs to the existing nodes. We'll aim to balance the new nodes across 
the racks re cooling/heating requirements. The new nodes will be 
controlled by a new partition.


Does anyone know if it is possible to regard the two racks as a single 
entity (by connecting the InfiniBand switches together), and so 
schedule jobs across the resources in the racks with no loss 
efficiency. I would be grateful for your comments and ideas, please. 
The alternative is to put all the new nodes in a completely new rack, 
but that does mean that we'll have purchase some new Ethernet and IB 
switches. We are not happy, by the way, to have node/switch 
connections across racks.


Best regards,
David

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Paul Edmon
I would check that you have MariaDB-shared installed too on the host you 
build on prior to your build.  The changed the way the packaging is done 
in MariaDB and Slurm needs to detect the files in MariaDB-shared to 
actually trigger the configure to build the mysql libs.


-Paul Edmon-

On 12/3/2021 7:40 PM, Giuseppe G. A. Celano wrote:

10.4.22


On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus  wrote:

Which version of Mariadb are you using?

Brian Andrus

On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:

After installation of libmariadb-dev, I have reinstalled the
entire slurm with ./configure + options, make, and make install.
Still, accounting_storage_mysql.so is missing.



On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby
 wrote:

Did you run

./configure (with any other options you normally use)
make
make install

on your DBD server after you installed the mariadb-devel package?


*From:* slurm-users 
on behalf of Giuseppe G. A. Celano 
*Sent:* Saturday, 4 December 2021 10:07
*To:* Slurm User Community List 
*Subject:* [EXT] Re: [slurm-users] slurmdbd does not work
*
*External email: *Please exercise caution

*

The problem is the lack of
/usr/lib/slurm/accounting_storage_mysql.so

I have installed many mariadb-related packages, but that file
is not created by slurm after installation: is there a point
in the documentation where the installation procedure for the
database is made explicit?



On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus
 wrote:

You will need to also reinstall/restart slurmdbd with the
updated binary.

Look in the slurmdbd logs to see what is happening there.
I suspect it had errors updating/creating the database
and tables. If you have no data in it yet, you can just
DROP the database and restart slurmdbd.

Brian Andrus

On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote:

Thanks for the answer, Brian. I now added
--with-mysql_config=/etc/mysql/my.cnf, but the problem
is still there and now also slurmctld does not work,
with the error:

[2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
clusteracct_storage_p_register_ctld: Registering
slurmctld at port 6817 with slurmdbd
[2021-12-03T15:36:41.019] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.019] error:
_slurm_persist_recv_msg: only read 150 of 2613 bytes
[2021-12-03T15:36:41.019] error: Sending PersistInit
msg: No error
[2021-12-03T15:36:41.020] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.020] error:
_slurm_persist_recv_msg: only read 150 of 2613 bytes
[2021-12-03T15:36:41.020] error: Sending PersistInit
msg: No error
[2021-12-03T15:36:41.020] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.020] error:
_slurm_persist_recv_msg: only read 150 of 2613 bytes
[2021-12-03T15:36:41.020] error: Sending PersistInit
msg: No error
[2021-12-03T15:36:41.020] error: DBD_GET_TRES failure:
No error
[2021-12-03T15:36:41.021] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.021] error:
_slurm_persist_recv_msg: only read 0 of 2613 bytes
[2021-12-03T15:36:41.021] error: Sending PersistInit
msg: No error
[2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No
error
[2021-12-03T15:36:41.021] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.021] error:
_slurm_persist_recv_msg: only read 150 of 2613 bytes
[2021-12-03T15:36:41.021] error: Sending PersistInit
msg: No error
[2021-12-03T15:36:41.021] error: DBD_GET_USERS failure:
No error
[2021-12-03T15:36:41.022] error: _conn_readable:
persistent connection for fd 9 experienced error[104]:
Connection reset by peer
[2021-12-03T15:36:41.022] error

Re: [slurm-users] Preferential scheduling on a subset of nodes

2021-12-01 Thread Paul Edmon
If you set up a higher priority partition with Preemption OFF on the 
lower priority partition you should be able to accomplish this.  If you 
have preemption turned off for the specific partitions in question Slurm 
will not preempt but will schedule jobs from the higher priority 
partition first regardless of current fairshare scores. See:


*PreemptMode*
   Mechanism used to preempt jobs or enable gang scheduling for this
   partition when *PreemptType=preempt/partition_prio* is configured.
   This partition-specific *PreemptMode* configuration parameter will
   override the cluster-wide *PreemptMode* for this partition. It can
   be set to OFF to disable preemption and gang scheduling for this
   partition. See also *PriorityTier* and the above description of the
   cluster-wide *PreemptMode* parameter for further details. 


This is at least how we manage that.

-Paul Edmon-

On 12/1/2021 11:32 AM, Sean McGrath wrote:

Hi,

Apologies for having to ask such a basic question.

We want to be able to give some users preferential access to some
nodes. They bought the nodes which are currently in a 'long' partition
as their jobs need a longer walltime.

When the purchasing users group is not using the nodes I would like
other users to be able to run jobs on those nodes but when the owners
group submit jobs I want those jobs to be queued as soon as currently
running jobs on those nodes are finished. My understanding is that
preemption won't work in these circumstances as it will either cancel or
suspend currently running jobs, I want the currently running jobs to
finish before the preferential ones start.

I'm wondering if QOS could do what we need here. Can the following be
sanity checked please.

Put the specific nodes in both the long and the compute (standard)
partition. Then restrict access to the long partition to specified users
so that all users can access them in the compute queue but only a subset
of users can use the longer wall time queue.

$ scontrol update PartitionName=long Users=user1,user2

We currently don't have QOS enabled so change that in slurm.conf and
restart the slurmctld.
-PriorityWeightQOS=0
+PriorityWeightQOS=1

Then create a qos and modify its priority
$ sacctmgr add qos boost
$ sacctmgr modify qos boost set priority=10
$ sacctmgr modify user user1 set qos=boost

Will that do what I expect please?

Many thanks and again apologies for the basic question.

Sean


Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Paul Edmon
I think it depends on the filesystem type.  Lustre generally fails over 
nicely and handles reconnections with out much of a problem.  We've done 
this before with out any hitches, even with the jobs being live.  
Generally the jobs just hang and then resolve once the filesystem comes 
back.  On a live system you will end up with a completion storm as jobs 
are always exiting and thus while the filesystem is gone the jobs 
dependent on it will just hang and if they are completing they will just 
stall on the completion step.  Once it returns then all that traffic 
flushes. This can create issues where a bunch of nodes get closed due to 
Kill task fail or other completion flags.  Generally these are harmless 
though I have seen stuck processes on nodes and have had to reboot them 
to clear, so you should check any node before putting it back in action.


That said if you are pausing all the jobs and scheduling this is some 
what mitigated, though jobs will still exit due to timeout.


-Paul Edmon-

On 10/25/2021 4:47 AM, Alan Orth wrote:

Dear Jurgen and Paul,

This is an interesting strategy, thanks for sharing. So if I read the 
scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all 
job processes. The processes remain in memory, but are paused. What 
happens to open file handles, since the underlying filesystem goes 
away and comes back?


Thank you,

On Sat, Oct 23, 2021 at 1:10 AM Juergen Salk  
wrote:


Thanks, Paul, for confirming our planned approach. We did it that way
and it worked very well. I have to admit that my fingers were a bit
wet when suspending thousands of running jobs, but it worked without
any problems. I just didn't dare to resume all suspended jobs at
once, but did that in a staggered manner.

Best regards
Jürgen

* Paul Edmon  [211019 15:15]:
> Yup, we follow the same process for when we do Slurm upgrades,
this looks
> analogous to our process.
>
    > -Paul Edmon-
>
> On 10/19/2021 3:06 PM, Juergen Salk wrote:
> > Dear all,
> >
> > we are planning to perform some maintenance work on our Lustre
file system
> > which may or may not harm running jobs. Although failover
functionality is
> > enabled on the Lustre servers we'd like to minimize risk for
running jobs
> > in case something goes wrong.
> >
> > Therefore, we thought about suspending all running jobs and resume
> > them as soon as file systems are back again.
> >
> > The idea would be to stop Slurm from scheduling new jobs as a
first step:
> >
> > # for p in foo bar baz; do scontrol update PartitionName=$p
State=DOWN; done
> >
> > with foo, bar and baz being the configured partitions.
> >
> > Then suspend all running jobs (taking job arrays into account):
> >
> > # squeue -ho %A -t R | xargs -n 1 scontrol suspend
> >
> > Then perform the failover of OSTs to another OSS server.
> > Once done, verify that file system is fully back and all
> > OSTs are in place again on the client nodes.
> >
> > Then resume all suspended jobs:
> >
> > # squeue -ho %A -t S | xargs -n 1 scontrol resume
> >
> > Finally bring back the partitions:
> >
> > # for p in foo bar baz; do scontrol update PartitionName=$p
State=UP; done
> >
> > Does that make sense? Is that common practice? Are there any
caveats that
> > we must think about?
> >
> > Thank you in advance for your thoughts.
> >
> > Best regards
> > Jürgen
> >



--
Alan Orth
alan.o...@gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-19 Thread Paul Edmon
Yup, we follow the same process for when we do Slurm upgrades, this 
looks analogous to our process.


-Paul Edmon-

On 10/19/2021 3:06 PM, Juergen Salk wrote:

Dear all,

we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although failover functionality is
enabled on the Lustre servers we'd like to minimize risk for running jobs
in case something goes wrong.

Therefore, we thought about suspending all running jobs and resume
them as soon as file systems are back again.

The idea would be to stop Slurm from scheduling new jobs as a first step:

# for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done

with foo, bar and baz being the configured partitions.

Then suspend all running jobs (taking job arrays into account):

# squeue -ho %A -t R | xargs -n 1 scontrol suspend

Then perform the failover of OSTs to another OSS server.
Once done, verify that file system is fully back and all
OSTs are in place again on the client nodes.

Then resume all suspended jobs:

# squeue -ho %A -t S | xargs -n 1 scontrol resume

Finally bring back the partitions:

# for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done

Does that make sense? Is that common practice? Are there any caveats that
we must think about?

Thank you in advance for your thoughts.

Best regards
Jürgen







Re: [slurm-users] slurm.conf syntax checker?

2021-10-13 Thread Paul Edmon
Sadly no.  There is a feature request for one though: 
https://bugs.schedmd.com/show_bug.cgi?id=3435


What we've done in the meantime is put together a gitlab runner which 
basically starts up a mini instance of the scheduler and runs slurmctld 
on the slurm.conf we want to put in place.  We then have it reject any 
changes that cause failure.  It's not perfect but it works.  A real 
syntax checker would be better.


-Paul Edmon-

On 10/12/2021 4:08 PM, bbenede...@goodyear.com wrote:

Is there any sort of syntax checker that we could run our slurm.conf file
through before committing it?  (And sometimes crashing slurmctld in the
process...)

Thanks!





[slurm-users] Using Nice to Break Ties

2021-09-14 Thread Paul Edmon
We use the classic fairshare algorithm here with users having their 
shares set to to parent and pulling from the group pool rather than 
having each user have their own fairshare (you can see our doc here: 
https://docs.rc.fas.harvard.edu/kb/fairshare/). This has worked very 
well for us for many years.  However, there is a use case where this 
doesn't work namely breaking ties internal to a group.  We have a lot of 
private partitions owned by a specific group and when you have a bunch 
of users in that group the queue turns into FIFO instead of letting 
lower usage users go first due to the parent flag on the fairshare.  Now 
this is obviously solved by giving every user their own fairshare but 
this has the downside of impacting the users priority back on the shared 
partitions with other groups where they will not be able to use their 
groups full fairshare but instead are stuck with their own.  Thus their 
total group fairshare may be something like 0.4 but their personal is 
stuck at 0 because they are one of the heaviest users in the lab.


Now I get the feeling that Fair Tree might solve this but I can't move 
to it as it's taken years for our users to even understand and accept 
the classic fairshare model.  As such I'm trying to come up with 
solutions that work with in the model.  One option I have been 
considering is using the job_submit.lua script to set a Nice value for 
all the jobs based on that users usage.  Basically the nice value would 
break the internal ties of the group and allow non-FIFO scheduling 
internal to accounts with out impacting their overall fairshare relative 
to other groups.


Before I start messing around with this though I wanted to ping this 
wisdom of the group and see how others handle tie breaking internal to 
an account/group/lab?  What solutions have people used for this?


-Paul Edmon-




Re: [slurm-users] User CPU limit across partitions?

2021-08-03 Thread Paul Edmon
I think you can accomplish this by setting Partition QoS and defining it 
to hook into the same QoS for all there.  I believe that would force it 
to share the same pool.


That said I don't know if that would work properly, its worth a test.  
That is my first guess though.


-Paul Edmon-

On 8/3/2021 2:35 PM, bbenede...@goodyear.com wrote:

Good day.

Is is possible to have a user limits ACROSS partitions?

Say I have three partitions, large, medium, and small.

I would like my users to have a 1000 cpu limit across all three partitions.
So that they could use up to 1000 cpus in any combination of large, medium,
and small.  But I don't want to limit them to 333 in each parition, but rather
total up across all of the partitions to be no more than 1000 cpus.

Is that possible?

Thanks!





Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon

From that page:

*GrpTRES=* The total count of TRES able to be used at any given time 
from jobs running from an association and its children or QOS. If this 
limit is reached new jobs will be queued but only allowed to run after 
resources have been relinquished from this group.


So basically its the sum total of all the TRES a Group could run in a 
partition at one time.


-Paul Edmon-

On 8/2/2021 12:05 PM, Adrian Sevcenco wrote:

On 8/2/21 6:26 PM, Paul Edmon wrote:

Probably more like

MaxTRESPERJob=cpu=8


i see, thanks!!

i'm still searching for the definition of GrpTRES :)

Thanks a lot!
Adrian




You would need to specify how much TRES you need for each job in the 
normal tres format.


-Paul Edmon-

On 8/2/2021 11:24 AM, Adrian Sevcenco wrote:

On 8/2/21 5:44 PM, Paul Edmon wrote:
You can set up a Partition based QoS that can set this limit: 
https://slurm.schedmd.com/resource_limits.html  See the 
MaxTRESPerJob limit.

oh, thanks a lot!!

would something like this work/be in line with your indication? :

add qos 8cpu GrpTRES=cpu=1 MaxTRESPerJob=8

modify account blah DefaultQOS=8cpu


Thanks a lot!
Adrian



-Paul Edmon-


On 8/2/2021 10:40 AM, Adrian Sevcenco wrote:

Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default (as i see no limit regarding this .. ) 
.. i just have MaxNodes=1

this is CR_CPU alocator

Thank you!
Adrian














Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon

Probably more like

MaxTRESPERJob=cpu=8

You would need to specify how much TRES you need for each job in the 
normal tres format.


-Paul Edmon-

On 8/2/2021 11:24 AM, Adrian Sevcenco wrote:

On 8/2/21 5:44 PM, Paul Edmon wrote:
You can set up a Partition based QoS that can set this limit: 
https://slurm.schedmd.com/resource_limits.html  See the MaxTRESPerJob 
limit.

oh, thanks a lot!!

would something like this work/be in line with your indication? :

add qos 8cpu GrpTRES=cpu=1 MaxTRESPerJob=8

modify account blah DefaultQOS=8cpu


Thanks a lot!
Adrian



-Paul Edmon-


On 8/2/2021 10:40 AM, Adrian Sevcenco wrote:

Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default (as i see no limit regarding this .. ) 
.. i just have MaxNodes=1

this is CR_CPU alocator

Thank you!
Adrian











Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
You can set up a Partition based QoS that can set this limit: 
https://slurm.schedmd.com/resource_limits.html  See the MaxTRESPerJob limit.


-Paul Edmon-


On 8/2/2021 10:40 AM, Adrian Sevcenco wrote:

Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default (as i see no limit regarding this .. ) .. 
i just have MaxNodes=1

this is CR_CPU alocator

Thank you!
Adrian






Re: [slurm-users] Can I get the original sbatch command, after the fact?

2021-07-16 Thread Paul Edmon
Not in the current version of Slurm.  In the next major version long 
term storage of job scripts will be available.


-Paul Edmon-

On 7/16/2021 2:16 PM, David Henkemeyer wrote:
If I execute a bunch of sbatch commands, can I use sacct (or something 
else) to show me the original sbatch command line for a given job ID?


Thanks
David




Re: [slurm-users] MinJobAge

2021-07-06 Thread Paul Edmon

The documentation indicates that's what should happen with MinJobAge:

*MinJobAge*
   The minimum age of a completed job before its record is purged from
   Slurm's active database. Set the values of *MaxJobCount* and to
   ensure the slurmctld daemon does not exhaust its memory or other
   resources. The default value is 300 seconds. A value of zero
   prevents any job record purging. Jobs are not purged during a
   backfill cycle, so it can take longer than MinJobAge seconds to
   purge a job if using the backfill scheduling plugin. In order to
   eliminate some possible race conditions, the minimum non-zero value
   for *MinJobAge* recommended is 2. 

From my experience this does work.  We've been running with 
MinJobAge=600 for years with out any problems to my knowledge


-Paul Edmon-

On 7/6/2021 8:59 AM, Emre Brookes wrote:


  Brian Andrus

Nov 23, 2020, 1:55:54 PM
to slurm...@lists.schedmd.com
All,

I always thought that MinJobAge affected how long a job will show up
when doing 'squeue'

That does not seem to be the case for me.

I have MinJobAge=900, but if I do 'squeue --me' as soon as I finish an
interactive job, there is nothing in the queue.

I swear I used to see jobs in a completed state for a period of time,
but they are not showing up at all on our cluster.


How does one have jobs show up that are completed?
I'm using slurm 20.02.7 & have the same issue (except I am running 
batch jobs).
Does MinJobAge work to keep completed jobs around for the specified 
duration in squeue output?


Thanks,
Emre




Re: [slurm-users] Long term archiving

2021-06-28 Thread Paul Edmon
We keep 6 months in our active database and then we archive and purge 
anything older than that.  The archive data itself is available for 
reimport and historical investigation.  We've done this when importing 
historical data into XDMod.


-Paul Edmon-

On 6/28/2021 10:43 AM, Yair Yarom wrote:

Hi list,

I was wondering if you could share your long term archiving practices.

We currently purge and archive the jobs after 31 days, and keep the 
usage data without purging. This gives us a reasonable history, and a 
downtime of "only" a few hours on database upgrade. We currently don't 
load the archives into a secondary db.


We now have a use-case which might require us to save job information 
for more than that, and we're considering how to do that.


Thanks in advance,


--
   /||
   \/|Yair Yarom | System Group (DevOps)
   []|The Rachel and Selim Benin School
   []  /\ |of Computer Science and Engineering
   []//\\/   |The Hebrew University of Jerusalem
   [//   \\   |T +972-2-5494522 | F +972-2-5494522
   // \   |ir...@cs.huji.ac.il <mailto:ir...@cs.huji.ac.il>
  // |


Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Paul Edmon
We generally pause scheduling during upgrades out of paranoia more than 
anything.  What that means is that we set all our partitions to DOWN and 
suspend all the jobs.  Then we do the upgrade.  That said I know of 
people who do it live with out much trouble.


The risk is more substantial for major version upgrades than minors. So 
if you are doing a minor version upgrade its likely fine to do live.  
For major version I would recommend at least pausing all the jobs.


-Paul Edmon-

On 5/26/2021 2:48 PM, Ole Holm Nielsen wrote:

On 26-05-2021 20:23, Will Dennis wrote:
About to embark on my first Slurm upgrade (building from source now, 
into a versioned path /opt/slurm// which is then symlinked to 
/opt/slurm/current/ for the “in-use” one…) This is a new cluster, 
running 20.11.5 (which we now know has a CVE that was fixed in 
20.11.7) but I have researchers running jobs on it currently. As I’m 
still building out the cluster, I found today that all Slurm source 
tarballs before 20.11.7 were withdrawn by SchedMD. So, need to 
upgrade at least the -ctld and -dbd nodes before I can roll any new 
nodes out on 20.11.7…


As I have at least one researcher that is running some long multi-day 
jobs, can I down the -dbd and -ctld nodes and upgrade them, then put 
them back online running the new (latest) release, without munging 
the jobs on the running worker nodes?


I recommend strongly to read the SchedMD presentations in the 
https://slurm.schedmd.com/publications.html page, especially the 
"Field notes" documents.  The latest one is "Field Notes 4: From The 
Frontlines of Slurm Support", Jason Booth, SchedMD.


We upgrade Slurm continuously while the nodes are in production mode. 
There's a required order of upgrading: first slurmdbd, then slurmctld, 
then slurmd nodes, and finally login nodes, see

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

The detailed upgrading commands for CentOS are in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-on-centos-7 



We don't have any problems with running jobs across upgrades, but 
perhaps others can share their experiences?


/Ole





Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Paul Edmon
XDMod can give these sorts of stats.  I also have some diamond 
collectors we use in concert with grafana to pull data and plot it which 
is useful for seeing large scale usage trends:


https://github.com/fasrc/slurm-diamond-collector

-Paul Edmon-

On 5/13/2021 6:08 PM, Sid Young wrote:


Hi All,

Is there a way to define an effective "usage rate" of a HPC Cluster 
using the data captured in the slurm database.


Primarily I want to see if it can be helpful in presenting to the 
business a case for buying more hardware for the HPC :)


Sid Young


Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Paul Edmon

Yup, we use XDMod for this sort of data as well.

-Paul Edmon-

On 5/11/2021 8:52 AM, Renfro, Michael wrote:
XDMoD [1] is useful for this, but it’s not a simple script. It does 
have some user-accessible APIs if you want some report automation. I’m 
using that to create a lightning-talk-style slide at [2].


[1] https://open.xdmod.org/ <https://open.xdmod.org/>
[2] https://github.com/mikerenfro/one-page-presentation-hpc 
<https://github.com/mikerenfro/one-page-presentation-hpc>


On May 11, 2021, at 5:18 AM, Diego Zuccato  
wrote:


Il 11/05/21 11:21, Ole Holm Nielsen ha scritto:

Tks for the very fast answer.


I have written some accounting tools which are in
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOleHolmNielsen%2FSlurm_tools%2Ftree%2Fmaster%2Fslurmacctdata=04%7C01%7Crenfro%40tntech.edu%7C18a22c9efe664a5841ca08d9146602bc%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C0%7C637563250978957632%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=IN73eK1jj6OByPmObJyDjAHxnxe9ONVgjEKPsSgnXi8%3Dreserved=0
Maybe you can use the "topreports" tool?

Testing it just now. I'll probably have to do some changes (re field
witdh: our usernames are quite long, being from AD), but first I have to
check if it extracts the info our users want to see :)

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Paul Edmon
We go the route of having a test cluster and vetting our lua scripts 
there before putting them in the production environment.


-Paul Edmon-

On 5/6/2021 1:23 PM, Renfro, Michael wrote:
I’ve used the structure at 
https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 
<https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5> to 
handle basic test/production branching. I can isolate the new behavior 
down to just a specific set of UIDs that way.


Factoring out code into separate functions helps, too.

I’ve seen others go so far as to put the functions into separate 
files, but I haven’t needed that yet.



On May 6, 2021, at 12:11 PM, Michael Robbert  wrote:



*External Email Warning*

*This email originated from outside the university. Please use 
caution when opening attachments, clicking links, or responding to 
requests.*




I’m wondering if others in the Slurm community have any tips or best 
practices for the development and testing of Lua job submit plugins. 
Is there anything that can be done prior to deployment on a 
production cluster that will help to ensure the code is going to do 
what you think it does or at the very least not prevent any jobs from 
being submitted? I realize that any configuration change in 
slurm.conf could break everything, but I feel like adding Lua code 
adds enough complexity that I’m a little more hesitant to just throw 
it in. Any way to run some kind of linting or sanity tests on the Lua 
script? Additionally, does the script get read in one time at startup 
or reconfig or can it be changed on the fly just by editing the file?


Maybe a separate issue, but does anybody have an recipes to build a 
local test cluster in Docker that could be used to test this? I was 
working on one, but broke my local Docker install and thought I’d 
send this note out while I was working on rebuilding it.


Thanks in advance,

Mike Robbert



[slurm-users] Replacement for diamond

2021-05-04 Thread Paul Edmon
Python diamond has historically been really useful for shipping data to 
graphite.  We have a bunch of diamond collectors we wrote for slurm as a 
result: https://github.com/fasrc/slurm-diamond-collector  However with 
python 2 being end of life and diamond being unavailable for python 3 we 
need a new option. So what do people use for shipping various slurm 
stats to graphite?


-Paul Edmon-




Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon
Since you can run an arbitrary script as a node health checker I might 
add a script that counts failures and then closes if it hits a 
threshold.  The script shouldn't need to talk to the slurmctld or 
slurmdbd as it should be able to watch the log on the node and see the fail.


-Paul Edmon-

On 5/4/2021 12:09 PM, Gerhard Strangar wrote:

Hello,

how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.

Gerhard





Re: [slurm-users] Fairshare config change affect on running/queued jobs?

2021-04-30 Thread Paul Edmon
It shouldn't impact running jobs, all it should really do is impact 
pending jobs as it will order them by their relative priority scores.


-Paul Edmon-

On 4/30/2021 12:39 PM, Walsh, Kevin wrote:

Hello everyone,

We wish to deploy "fair share" scheduling configuration and would like 
to inquire if we should be aware of effects this might have on jobs 
already running or already queued when the config is changed.


The proposed changes are from the example at 
https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config 
<https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config> 
:


# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor
# 2 week half-life
PriorityDecayHalfLife=14-0
# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO
# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0
# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=1000
PriorityWeightFairshare=1
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor

We're running SLURM 18.08.8 on CentOS Linux 7.8.2003. The current 
slurm.conf is defaults as far as fair share is concerned:


EnforcePartLimits=ALL
GresTypes=gpu
MpiDefault=pmix
ProctrackType=proctrack/cgroup
PrologFlags=x11,contain
PropagateResourceLimitsExcept=MEMLOCK,STACK
RebootProgram=/sbin/reboot
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdSyslogDebug=verbose
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/cgroup,task/affinity
TaskPluginParam=Sched
HealthCheckInterval=300
HealthCheckProgram=/usr/sbin/nhc
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1024
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageHost=sched-db.lan
AccountingStorageLoc=slurm_acct_db
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
AccountingStorageTRES=gres/gpu
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmdDebug=info
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=1

Node and partition configs are omitted above.

Any and all advice will be greatly appreciated.

Best wishes,

~Kevin

Kevin Walsh
Senior Systems Administration Specialist
New Jersey Institute of Technology
Academic & Research Computing Systems




Re: [slurm-users] OpenMPI interactive change in behavior?

2021-04-28 Thread Paul Edmon
I haven't experienced this issue here.  Then again we've been using PMIx 
for launching MPI for a while now, thus we may have circumvented this 
particular issue.


-Paul Edmon-

On 4/28/2021 9:41 AM, John DeSantis wrote:

Hello all,

Just an update, the following URL almost mirrors the issue we're seeing: 
https://github.com/open-mpi/ompi/issues/8378

But, SLURM 20.11.3 was shipped with the fix.  I've verified that the changes 
are in the source code.

We don't want to have to downgrade SLURM to 20.02.x, but it seems that this 
behaviour still exists.  Are no other sites on fresh installs of >= SLURM 
20.11.3 experiencing this problem?

I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, 
which is why 20.11.3 was selected.

Thanks,
John DeSantis

On 4/26/21 5:12 PM, John DeSantis wrote:

Hello all,

We've recently (don't laugh!) updated two of our SLURM installations from 
16.05.10-2 to 20.11.3 and 17.11.9, respectively.  Now, OpenMPI doesn't seem to 
function in interactive mode across multiple nodes as it did previously on the 
latest version 20.11.3;  using `srun` and `mpirun` on a single node gives 
desired results, while using multiple nodes causes a hang. Jobs submitted via 
`sbatch` do _work as expected_.

[desantis@sclogin0 ~]$ scontrol show config |grep VERSION; srun -n 2 -N 2-2 -t 
00:05:00 --pty /bin/bash
SLURM_VERSION   = 17.11.9
[desantis@sccompute0 ~]$ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 
mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 
compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; mpirun 
hostname; module purge; echo; done
/apps/openmpi/1.8.5/bin/mpirun
sccompute0
sccompute1

/apps/openmpi/2.0.4/bin/mpirun
sccompute1
sccompute0

/apps/openmpi/2.0.4-psm2/bin/mpirun
sccompute1
sccompute0

/apps/openmpi/2.1.6/bin/mpirun
sccompute0
sccompute1

/apps/openmpi/3.1.6/bin/mpirun
sccompute0
sccompute1

/apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
sccompute1
sccompute0


15:58:28 Mon Apr 26 <0>
desantis@itn0
[~] $ scontrol show config|grep VERSION; srun -n 2 -N 2-2 --qos=devel 
--partition=devel -t 00:05:00 --pty /bin/bash
SLURM_VERSION   = 20.11.3
srun: job 1019599 queued and waiting for resources
srun: job 1019599 has been allocated resources
15:58:46 Mon Apr 26 <0>
desantis@mdc-1057-30-1
[~] $ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 mpi/openmpi/2.0.4-psm2 
mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 compilers/intel/2020_cluster_xe; do module 
load $OPENMPI ; which mpirun; mpirun hostname; module purge; echo; done
/apps/openmpi/1.8.5/bin/mpirun
^C
/apps/openmpi/2.0.4/bin/mpirun
^C
/apps/openmpi/2.0.4-psm2/bin/mpirun
^C
/apps/openmpi/2.1.6/bin/mpirun
^C
/apps/openmpi/3.1.6/bin/mpirun
^C
/apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
^C[mpiexec@mdc-1057-30-1] Sending Ctrl-C to processes as requested
[mpiexec@mdc-1057-30-1] Press Ctrl-C again to force abort
^C

Our SLURM installations are fairly straight forward.  We `rpmbuild` directly from the bzip2 files 
without any additional arguments.  We've done this since we first started using SLURM with version 
14.03.3-2 and through all upgrades.  Due to SLURM's awesomeness(!), we've simply used the same 
configuration files between version changes, with the only changes being made to parameters which 
have been deprecated/renamed.  Our "Mpi{Default,Params}" have always been sent to 
"none".  The only real difference we're able to ascertain is that the MPI plugin for 
openmpi has been removed.

svc-3024-5-2: SLURM_VERSION   = 16.05.10-2
svc-3024-5-2: srun: MPI types are...
svc-3024-5-2: srun: mpi/openmpi
svc-3024-5-2: srun: mpi/mpich1_shmem
svc-3024-5-2: srun: mpi/mpichgm
svc-3024-5-2: srun: mpi/mvapich
svc-3024-5-2: srun: mpi/mpich1_p4
svc-3024-5-2: srun: mpi/lam
svc-3024-5-2: srun: mpi/none
svc-3024-5-2: srun: mpi/mpichmx
svc-3024-5-2: srun: mpi/pmi2

viking: SLURM_VERSION   = 20.11.3
viking: srun: MPI types are...
viking: srun: cray_shasta
viking: srun: pmi2
viking: srun: none

sclogin0: SLURM_VERSION   = 17.11.9
sclogin0: srun: MPI types are...
sclogin0: srun: openmpi
sclogin0: srun: none
sclogin0: srun: pmi2
sclogin0:

As far as building OpenMPI, we've always withheld any SLURM specific flags, i.e. 
"--with-slurm", although during the build process SLURM is detected.

Because OpenMPI was always built using this method, we never had to recompile OpenMPI 
after subsequent SLURM upgrades, and no cluster ready applications had to be rebuilt.  
The only time OpenMPI had to be rebuilt was due to OPA hardware which was a simple 
addition of the "--with-psm2" flag.

It is my understanding that the openmpi plugin "never really did anything" (per perusing 
the mailing list), which is why it was removed.  Furthermore, searching the mailing list suggests 
that the appropriate method is t

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-04-27 Thread Paul Edmon
1. Part of the communications for slurm is hierarchical.  Thus nodes 
need to know about other nodes so they can talk to each other and 
forward messages to the slurmctld.


2. Yes, this is what we do.  We have our slurm.conf shared via NFS from 
our slurm master and then we just update that single conf.  After that 
update we then use salt to issue a global restart to all the slurmd's 
and slurmctld to pick up the new config.  scontrol reconfigure is not 
enough when adding new nodes, you have to issue a global restart.


3. It's pretty straight forward all told.  You just need to update the 
slurm.conf and do a restart.  You need to be careful that the names you 
enter into the slurm.conf are resolvable by DNS, else slurmctld may barf 
on restart.  Sadly no built in sanity checker exists that I am aware of 
aside from actually running slurmctld.  We got around this by putting 
together a gitlab runner which screens our slurm.conf's by running 
synthetic slurmctld to sanity check.


-Paul Edmon-

On 4/27/2021 2:35 PM, David Henkemeyer wrote:

Hello,

I'm new to Slurm (coming from PBS), and so I will likely have a few 
questions over the next several weeks, as I work to transition my 
infrastructure from PBS to Slurm.


My first question has to do with *_adding nodes to Slurm_*.  According 
to the FAQ (and other articles I've read), you need to basically shut 
down slurm, update the slurm.conf file /*on all nodes in the 
cluster*/, then restart slurm.


- Why do all nodes need to know about all other nodes? From what I 
have read, its Slurm does a checksum comparison of the slurm.conf file 
across all nodes.  Is this the only reason all nodes need to know 
about all other nodes?
- Can I create a symlink that points /slurm.conf to a 
slurm.conf file on an NFS mount point, which is mounted on all the 
nodes? This way, I would only need to update a single file, then 
restart Slurm across the entire cluster.
- Any additional help/resources for adding/removing nodes to Slurm 
would be much appreciated.  Perhaps there is a "toolkit" out there to 
automate some of these operations (which is what I already have for 
PBS, and will create for Slurm, if something doesn't already exist).


Thank you all,

David


Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-25 Thread Paul Edmon
So just a heads up here are the two tickets I filed.  The first: 
https://bugs.schedmd.com/show_bug.cgi?id=11183  Has more details as to 
how their plugin works.  The second is the clearing house for 
improvements: https://bugs.schedmd.com/show_bug.cgi?id=11135


-Paul Edmon-

On 3/19/2021 9:25 AM, Paul Edmon wrote:
I was about to ask this as well as we use /scratch as our tmp space 
not /tmp.  I haven't kicked the tires on this to know how it works but 
after I take a look at it I will probably file a feature request to 
make the name of the tmp dir flexible.


-Paul Edmon-

On 3/19/2021 7:19 AM, Tina Friedrich wrote:
That's excellent; I've been using the 'auto_tmpdir' plugin for this; 
having that functionality within SLURM will be good.


Have a question though - we have a need to also create a per-job 
/scratch/ (on a shared fast file system) in much the same way.


I don't see a way that the currentl tmpfs plugin can be used to do 
that, as it would seem that it's hard-coded to mount things into 
/tmp/ (i.e. where to mount a file system can not be changed). Or am I 
misreading this?


Tina

On 16/03/2021 22:26, Tim Wickberg wrote:
One errant backspace snuck into that announcement: the 
job_container.conf man page (with an 'r') serves as the initial 
documentation for this new job_container/tmpfs plugin. The link to 
the HTML version of the man page has been corrected in the text below:


On 3/16/21 4:16 PM, Tim Wickberg wrote:

We are pleased to announce the availability of Slurm version 20.11.5.

This includes a number of moderate severity bug fixes, alongside a 
new job_container/tmpfs plugin developed by NERSC that can be used 
to create per-job filesystem namespaces.


Initial documentation for this plugin is available at:
https://slurm.schedmd.com/job_container.conf.html
Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim









Re: [slurm-users] Set Fairshare by Hand

2021-03-22 Thread Paul Edmon
No, there is no way to my knowledge to do this.  You can zero out some 
one's fairshare (by removing and readding them) or a groups fairshare 
but you can't set it to an arbitrary value.


You can always adjust their RawShares for a somewhat similar effect but 
that will have all the normal consequences of changing their RawShares.


-Paul Edmon-

On 3/22/2021 5:12 AM, Michael Müller wrote:

Dear Slurm users and admins,

can we set the faireshare values manually, i.e., they are not
(re)calculated be Slurm?

With kind regards
Michael





Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-19 Thread Paul Edmon
I was about to ask this as well as we use /scratch as our tmp space not 
/tmp.  I haven't kicked the tires on this to know how it works but after 
I take a look at it I will probably file a feature request to make the 
name of the tmp dir flexible.


-Paul Edmon-

On 3/19/2021 7:19 AM, Tina Friedrich wrote:
That's excellent; I've been using the 'auto_tmpdir' plugin for this; 
having that functionality within SLURM will be good.


Have a question though - we have a need to also create a per-job 
/scratch/ (on a shared fast file system) in much the same way.


I don't see a way that the currentl tmpfs plugin can be used to do 
that, as it would seem that it's hard-coded to mount things into /tmp/ 
(i.e. where to mount a file system can not be changed). Or am I 
misreading this?


Tina

On 16/03/2021 22:26, Tim Wickberg wrote:
One errant backspace snuck into that announcement: the 
job_container.conf man page (with an 'r') serves as the initial 
documentation for this new job_container/tmpfs plugin. The link to 
the HTML version of the man page has been corrected in the text below:


On 3/16/21 4:16 PM, Tim Wickberg wrote:

We are pleased to announce the availability of Slurm version 20.11.5.

This includes a number of moderate severity bug fixes, alongside a 
new job_container/tmpfs plugin developed by NERSC that can be used 
to create per-job filesystem namespaces.


Initial documentation for this plugin is available at:
https://slurm.schedmd.com/job_container.conf.html
Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim









Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Paul Edmon
One should keep in mind that sacct results for memory usage are not 
accurate for Out Of Memory (OoM) jobs.  This is due to the fact that the 
job is typically terminated prior to next sacct polling period, and also 
terminated prior to it reaching full memory allocation.  Thus I wouldn't 
trust any of the results with regards to memory usage if the job is 
terminated by OoM.  sacct just can't pick up a sudden memory spike like 
that and even if it did  it would not correctly record the peak memory 
because the job was terminated prior to that point.



-Paul Edmon-


On 3/15/2021 1:52 PM, Chin,David wrote:

Hi, all:

I'm trying to understand why a job exited with an error condition. I 
think it was actually terminated by Slurm: job was a Matlab script, 
and its output was incomplete.


Here's sacct output:

               JobID    JobName      User  Partition  NodeList   
 Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize         
               AllocTRES AllocGRE
 -- - -- --- 
-- --  -- -- -- 
 
               83387 ProdEmisI+      foob        def   node001   
03:34:26 OUT_OF_ME+    0:125      128Gn                     
billing=16,cpu=16,node=1
         83387.batch      batch  node001   03:34:26 OUT_OF_ME+   
 0:125      128Gn   1617705K   7880672K              cpu=16,mem=0,node=1
        83387.extern     extern  node001   03:34:26  COMPLETED     
 0:0      128Gn       460K  153196K         billing=16,cpu=16,node=1


Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu  215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


Drexel Internal Data



Re: [slurm-users] SLURM submit policy

2021-03-10 Thread Paul Edmon
You might try looking at a partition QoS using the GrpTRESMins or 
GrpTRESRunMins: https://slurm.schedmd.com/resource_limits.html


There are a bunch of options which may do what you want.

-Paul Edmon-

On 3/10/2021 9:13 AM, Marcel Breyer wrote:


Greetings,

we know about the SLURM configuration option *MaxSubmitJobsPerUser* to 
limit the number of jobs a user can submit at a given time.


We would like to have a similar policy that says that the total time 
for all jobs of a user cannot exceed a certain time limit.


For example (normal *MaxSubmitJobsPerUser = 2*):

srun --time 10 ...
srun --time 20 ...
srun --time 10 ... <- fails since only 2 jobs are allowed per user


However, we want something like (for a maximum aggregate time of e.g. 
40mins):


srun --time 10 ...
srun --time 20 ...
srun --time 10 ...
srun --time 5 ... <- fails since the total job times exceed 40mins


However, another allocation pattern could be:

srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ...
srun --time 5 ... <- fails since the total job times exceed 40mins 
(however, after the first job completed, the new job can be submitted 
normally)



In essence we would like to have a policy using the FIFO scheduler 
(such that we don't have to specify another complex scheduler) such 
that we can guarantee that another user has the chance to get access 
to a machine after at most X time units (40mins in the example above).


With the *MaxSubmitJobsPerUser *option we would have to allow only a 
really small number of jobs (penalizing users that divide their 
computation into small sub jobs) or X would be rather large (num_jobs 
* max_wall_time).


Is there an option in SLURM  that mimics such a behavior?

With best regards,
Marcel Breyer



Re: [slurm-users] qos on partition

2021-03-09 Thread Paul Edmon
For the first does MaxJobs not do that?  For the second you can set 
MaxJobsPerUser.  That's what we do here for our test partition, we set a 
limit of 5 jobs per user running at any given time.


You can then tie the QoS to a specific partition using the QoS option in 
the partition config in slurm.conf


-Paul Edmon-

On 3/9/2021 5:10 AM, LEROY Christine 208562 wrote:

Hello,

I’d like to reproduce a configuration we had with torque on queues/partitions :
•   how to set a maximum number of running jobs on a queue ?
•   and a maximum number of running jobs per user for all the users 
(whatever is the user)?
There is a qos with slurm but it seems always attached to a user or an account, 
not to a partition ?
What would be the best thing to do here ?

Thanks in advance,
Christine Leroy




Re: [slurm-users] Rate Limiting of RPC calls

2021-02-09 Thread Paul Edmon
We've hit this before several times. The tricks we've used to deal with 
this are:


1. Being on the latest release: A lot of work has gone into improving 
RPC throughput, if you aren't running the latest 20.11 release I highly 
recommend upgrading.  20.02 also was pretty good at this.


2. max_rpc_cnt/defer: I would recommend using either of these settings 
for SchedulerParameters as it will allow the scheduler more time to breathe.


3. I would make sure that your mysql settings are set such that your DB 
is fully cached in memory and not hitting disk.  I also recommend 
running your DB on the same server as you run your ctld.  We've found 
that this can improve throughput.


4. We put a caching version of squeue in place which gives almost live 
data to the users rather than live data.  This additional buffer layer 
helps cut down traffic.  This is something we rolled in house with a 
database that updates every 30 seconds.


5. Recommend to users to submit jobs that last for more than 10 minutes 
and to use Job arrays instead of looping sbatch.  This will reduce 
thrashing.


Those are my recommendations for how to deal with this.

-Paul Edmon-

On 2/9/2021 7:59 PM, Kota Tsuyuzaki wrote:

Hello guys,

In our cluster, sometimes new incoming member accidentally creates too many 
slurm RPC calls (sbatch, sacct, etc), then slurmctld,
slurmdbd, and mysql may be overloaded.
To prevent such a situation, I'm looking for something like RPC Rate Limit for 
users. Does Slurm supports such a RateLimit feature?
If not, is there way to save Slurm server-side resources?

Best,
Kota


露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki...@hco.ntt.co.jp
NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
0422-59-2837
-








Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
That is correct.  I think NVML has some additional features but in terms 
of actually scheduling them what you have should work. They will just be 
treated as normal gres resources.


-Paul Edmon-

On 1/26/2021 3:55 PM, Ole Holm Nielsen wrote:

On 26-01-2021 21:36, Paul Edmon wrote:
You can include gpu's as gres in slurm with out compiling 
specifically against nvml.  You only really need to do that if you 
want to use the autodetection features that have been built into the 
slurm.  We don't really use any of those features at our site, we 
only started building against nvml to future proof ourselves for 
when/if those features become relevant to us.


Thanks for this clarification about not actually *requiring* the 
NVIDIA NVML library in the Slurm build!


Now I'm seeing this description in https://slurm.schedmd.com/gres.html 
about automatic GPU configuration by Slurm:


If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management 
Library (NVML) is installed on the node and was found during Slurm 
configuration, configuration details will automatically be filled in 
for any system-detected NVIDIA GPU. This removes the need to 
explicitly configure GPUs in gres.conf, though the Gres= line in 
slurm.conf is still required in order to tell slurmctld how many GRES 
to expect. 


I have defined our GPUs manually in gres.conf with File=/dev/nvidia? 
lines, so it would seem that this obviates the need for NVML.  Is this 
the correct conclusion?


/Ole


To me at least it would be nicer if there was a less hacky way of 
getting it to do that.  Arguably Slurm should dynamically link 
against the libs it needs or not depending on the node.  We hit this 
issue with Lustre/IB as well where you have to roll a separate slurm 
for each type of node you have if you want these which is hardly ideal.


-Paul Edmon-

On 1/26/2021 3:24 PM, Robert Kudyba wrote:
You all might be interested in a patch to the SPEC file, to not make 
the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled 
at configure time. See 
https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 
<https://bugs.schedmd.com/show_bug.cgi?id=7919#c3>


On Tue, Jan 26, 2021 at 3:17 PM Paul Raines 
mailto:rai...@nmr.mgh.harvard.edu>> wrote:



    You should check your jobs that allocated GPUs and make sure
    CUDA_VISIBLE_DEVICES is being set in the environment. This is a 
sign
    you GPU support is not really there but SLURM is just doing 
"generic"

    resource assignment.

    I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once
    on a
    non-GPU node and use those RPMs to install on the non-GPU nodes.
    Then build
    again on the GPU node where CUDA is installed via the NVIDIA CUDA
    YUM repo
    rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
    nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to
    the default
    RPM SPEC is needed.  I just run

       rpmbuild --tb slurm-20.11.3.tar.bz2

    You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml'
    and see
    that /usr/lib64/slurm/gpu_nvml.so only exists on the one built 
on the

    GPU node.

    -- Paul Raines
(https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=
<https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=>
    )



    On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:

    > In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
    >>  Personally, I think it's good that Slurm RPMs are now
    available through
    >>  EPEL, although I won't be able to use them, and I'm sure many
    people on
    >>  the list won't be able to either, since licensing issues
    prevent them from
    >>  providing support for NVIDIA drivers, so those of us with GPUs
    on our
    >>  clusters will still have to compile Slurm from source to
    include NVIDIA
    >>  GPU support.
    >
    > We're running Slurm 20.02.6 and recently added some NVIDIA GPU
    nodes.
    > The Slurm GPU documentation seems to be
    >
https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA=
<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
You can include gpu's as gres in slurm with out compiling specifically 
against nvml.  You only really need to do that if you want to use the 
autodetection features that have been built into the slurm.  We don't 
really use any of those features at our site, we only started building 
against nvml to future proof ourselves for when/if those features become 
relevant to us.


To me at least it would be nicer if there was a less hacky way of 
getting it to do that.  Arguably Slurm should dynamically link against 
the libs it needs or not depending on the node.  We hit this issue with 
Lustre/IB as well where you have to roll a separate slurm for each type 
of node you have if you want these which is hardly ideal.


-Paul Edmon-

On 1/26/2021 3:24 PM, Robert Kudyba wrote:
You all might be interested in a patch to the SPEC file, to not make 
the slurm RPMs depend on libnvidia-ml.so, even if it's been enabled at 
configure time. See https://bugs.schedmd.com/show_bug.cgi?id=7919#c3 
<https://bugs.schedmd.com/show_bug.cgi?id=7919#c3>


On Tue, Jan 26, 2021 at 3:17 PM Paul Raines 
mailto:rai...@nmr.mgh.harvard.edu>> wrote:



You should check your jobs that allocated GPUs and make sure
CUDA_VISIBLE_DEVICES is being set in the environment.  This is a sign
you GPU support is not really there but SLURM is just doing "generic"
resource assignment.

I have both GPU and non-GPU nodes.  I build SLURM rpms twice. Once
on a
non-GPU node and use those RPMs to install on the non-GPU nodes.
Then build
again on the GPU node where CUDA is installed via the NVIDIA CUDA
YUM repo
rpms so the NVML lib is at /lib64/libnvidia-ml.so.1 (from rpm
nvidia-driver-NVML-455.45.01-1.el8.x86_64) and no special mods to
the default
RPM SPEC is needed.  I just run

   rpmbuild --tb slurm-20.11.3.tar.bz2

You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml'
and see
that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
GPU node.

-- Paul Raines

(https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=

<https://urldefense.proofpoint.com/v2/url?u=http-3A__help.nmr.mgh.harvard.edu=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=ZuCDM15RrOpv2t-j8DywWrwpn86qa79eBuSPEs96SFo=>
)



On Tue, 26 Jan 2021 2:29pm, Ole Holm Nielsen wrote:

> In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
>>  Personally, I think it's good that Slurm RPMs are now
available through
>>  EPEL, although I won't be able to use them, and I'm sure many
people on
>>  the list won't be able to either, since licensing issues
prevent them from
>>  providing support for NVIDIA drivers, so those of us with GPUs
on our
>>  clusters will still have to compile Slurm from source to
include NVIDIA
>>  GPU support.
>
> We're running Slurm 20.02.6 and recently added some NVIDIA GPU
nodes.
> The Slurm GPU documentation seems to be
>

https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA=

<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_gres.html=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=GxF9VoynMmgS3BBrWWsmPM1Itt0hshTIkGh3x4Xy3hA=>

> We don't seem to have any problems scheduling jobs on GPUs, even
though our
> Slurm RPM build host doesn't have any NVIDIA software installed,
as shown by
> the command:
> $ ldconfig -p | grep libnvidia-ml
>
> I'm curious about Prentice's statement about needing NVIDIA
libraries to be
> installed when building Slurm RPMs, and I read the discussion in
bug 9525,
>

https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI=

<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D9525=DwIBAg=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY=GNGEhyc3F2bEZxbHK93tumXk56f37DOl99aYsOeUVOE=6GDTIFa-spnv8ZMtKsdwJaLreyZMX4T5EW3MnAX54iI=>

> from which it seems that the problem was fix

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Edmon
In our RPM spec we use to build slurm we do the following additional 
things for GPU's:


BuildRequires: cuda-nvml-devel-11-1

the in the %build section we do:

export CFLAGS="$CFLAGS 
-L/usr/local/cuda-11.1/targets/x86_64-linux/lib/stubs/ 
-I/usr/local/cuda-11.1/targets/x86_64-linux/include/"


That ensures the cuda libs are installed and it directs slurm to where 
they are.  After that configure should detect the nvml libs and link 
against them.


I've attached our full spec that we use to build.

-Paul Edmon-

On 1/26/2021 2:29 PM, Ole Holm Nielsen wrote:

In another thread, On 26-01-2021 17:44, Prentice Bisbal wrote:
Personally, I think it's good that Slurm RPMs are now available 
through EPEL, although I won't be able to use them, and I'm sure many 
people on the list won't be able to either, since licensing issues 
prevent them from providing support for NVIDIA drivers, so those of 
us with GPUs on our clusters will still have to compile Slurm from 
source to include NVIDIA GPU support.


We're running Slurm 20.02.6 and recently added some NVIDIA GPU nodes.
The Slurm GPU documentation seems to be
https://slurm.schedmd.com/gres.html
We don't seem to have any problems scheduling jobs on GPUs, even 
though our Slurm RPM build host doesn't have any NVIDIA software 
installed, as shown by the command:

$ ldconfig -p | grep libnvidia-ml

I'm curious about Prentice's statement about needing NVIDIA libraries 
to be installed when building Slurm RPMs, and I read the discussion in 
bug 9525,

https://bugs.schedmd.com/show_bug.cgi?id=9525
from which it seems that the problem was fixed in 20.02.6 and 20.11.

Question: Is there anything special that needs to be done when 
building Slurm RPMs with NVIDIA GPU support?


Thanks,
Ole

Name:   slurm
Version:20.11.3
%define rel 1
Release:%{rel}fasrc01%{?dist}
Summary:Slurm Workload Manager

Group:  System Environment/Base
License:GPLv2+
URL:https://slurm.schedmd.com/

# when the rel number is one, the directory name does not include it
%if "%{rel}" == "1"
%global slurm_source_dir %{name}-%{version}
%else
%global slurm_source_dir %{name}-%{version}-%{rel}
%endif

Source: %{slurm_source_dir}.tar.bz2

# build options .rpmmacros options  change to default action
#   
# --prefix  %_prefix path   install path for commands, 
libraries, etc.
# --with cray   %_with_cray 1   build for a Cray Aries system
# --with cray_network   %_with_cray_network 1   build for a non-Cray system 
with a Cray network
# --with cray_shasta%_with_cray_shasta 1build for a Cray Shasta system
# --with slurmrestd %_with_slurmrestd 1 build slurmrestd
# --with slurmsmwd  %_with_slurmsmwd 1  build slurmsmwd
# --without debug   %_without_debug 1   don't compile with debugging 
symbols
# --with hdf5   %_with_hdf5 pathrequire hdf5 support
# --with hwloc  %_with_hwloc 1  require hwloc support
# --with lua%_with_lua path build Slurm lua bindings
# --with mysql  %_with_mysql 1  require mysql/mariadb support
# --with numa   %_with_numa 1   require NUMA support
# --without pam %_without_pam 1 don't require pam-devel RPM to 
be installed
# --without x11 %_without_x11 1 disable internal X11 support
# --with ucx%_with_ucx path require ucx support
# --with pmix   %_with_pmix pathrequire pmix support

#  Options that are off by default (enable with --with )
%bcond_with cray
%bcond_with cray_network
%bcond_with cray_shasta
%bcond_with slurmrestd
%bcond_with slurmsmwd
%bcond_with multiple_slurmd
%bcond_with ucx

# These options are only here to force there to be these on the build.
# If they are not set they will still be compiled if the packages exist.
%bcond_with hwloc
%bcond_with mysql
%bcond_with hdf5
%bcond_with lua
%bcond_with numa
%bcond_with pmix

# Use debug by default on all systems
%bcond_without debug

# Options enabled by default
%bcond_without pam
%bcond_without x11

# Disable hardened builds. -z,now or -z,relro breaks the plugin stack
%undefine _hardened_build
%global _hardened_cflags "-Wl,-z,lazy"
%global _hardened_ldflags "-Wl,-z,lazy"

Requires: munge

%{?systemd_requires}
BuildRequires: systemd
BuildRequires: munge-devel munge-libs
BuildRequires: python3
BuildRequires: readline-devel
Obsoletes: slurm-lua slurm-munge slurm-plugins

# fake systemd support when building rpms on other platforms
%{!?_unitdir: %global _unitdir /lib/systemd/systemd}

%define use_mysql_devel %(perl -e '`rpm -q mariadb-devel`; print $?;')

%if %{with mysql}
%if %{use_mysql_devel}
BuildRequires: mysql-devel >= 5.0.0
%else
BuildRequires: mariadb-devel >= 5.0.0
%endif
%endif

%if %{with cray}
Buil

Re: [slurm-users] Slurm Upgrade Philosophy?

2020-12-24 Thread Paul Edmon
We are the same way, though we tend to keep pace with minor releases.  
We typically wait until the .1 release of a new major release before 
considering upgrade so that many of the bugs are worked out.  We then 
have a test cluster that we install the release on a run a few test jobs 
to make sure things are working, usually MPI jobs as they tend to hit 
most of the features of the scheduler.


We also like to stay current with releases as there are new features we 
want, or features we didn't know we wanted but our users find and start 
using.  So our general methodology is to upgrade to the latest minor 
release at our next monthly maintenance.  For major releases we will 
upgrade at our next monthly maintenance after the .1 release is out 
unless there is a show stopping bug that we run into in our own 
testing.  At which point we file a bug with SchedMD and get a patch.


-Paul Edmon-

On 12/24/2020 1:57 AM, Chris Samuel wrote:

On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote:


Thanks to several helpful members on this list, I think I have a much better
handle on how to upgrade Slurm. Now my question is, do most of you upgrade
with each major release?

We do, though not immediately and not without a degree of testing on our test
systems.  One of the big reasons for us upgrading is that we've usually paid
for features in Slurm for our needs (for example in 20.11 that includes
scrontab so users won't be tied to favourite login nodes, as well as  the
experimental RPC queue code due to the large numbers of RPCs our systems need
to cope with).

I also keep an eye out for discussions of what other sites find with new
releases too, so I'm following the current concerns about 20.11 and the change
in behaviour for job steps that do (expanding NVIDIA's example slightly):

#SBATCH --exclusive
#SBATCH -N2
srun --ntasks-per-node=1 python multi_node_launch.py

which (if I'm reading the bugs correctly) fails in 20.11 as that srun no
longer gets all the allocated resources, instead just gets the default of
--cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI
built with Slurm support (as it effectively calls "srun orted" and that "orted"
launches the MPI ranks, so in 20.11 it only has access to a single core for
them all to fight over).  Again - if I'm interpreting the bugs correctly!

I don't currently have a test system that's free to try 20.11 on, but
hopefully early in the new year I'll be able to test this out to see how much
of an impact this is going to have and how we will manage it.

https://bugs.schedmd.com/show_bug.cgi?id=10383
https://bugs.schedmd.com/show_bug.cgi?id=10489

All the best,
Chris




Re: [slurm-users] getting fairshare

2020-12-16 Thread Paul Edmon
You can use the -o option to select which field you want it to print.  
The last column is the FairShare score.  The equation is part of the 
slurm documentation: https://slurm.schedmd.com/priority_multifactor.html



If you are using the Classic Fairshare you can look at our 
documentation: https://docs.rc.fas.harvard.edu/kb/fairshare/



-Paul Edmon-


On 12/16/2020 12:30 PM, Erik Bryer wrote:

$ sshare -a
             Account       User  RawShares  NormShares  RawUsage 
 EffectvUsage  FairShare
 -- -- --- --- 
- --
root                                          0.00     158     
 1.00
 root                      root          1    0.25       0     
 0.00   1.00
 borrowed                                1    0.25     157     
 0.994905
  borrowed               ebryer          6    0.020979     157     
 1.00   0.08
  borrowed            napierski          7    0.024476       0     
 0.00   0.33
  borrowed           sagatest01        259    0.905594       0     
 0.00   0.33
  borrowed           sagatest02         14    0.048951       0     
 0.00   0.33
 gaia                                    1    0.25       0     
 0.005095
  gaia                   ebryer          3    0.272727       0     
 1.00   0.416667
  gaia                 napiersk          2    0.181818       0     
 0.00   0.67
  gaia               sagatest01          1    0.090909       0     
 0.00   0.67
  gaia               sagatest02          5    0.454545       0     
 0.00   0.67
 saral                                   1    0.25       0     
 0.00
  saral                  ebryer         20    0.869565       0     
 0.00   1.00
  saral               napierski          1    0.043478       0     
 0.00   1.00
  saral              sagatest01          2    0.086957       0     
 0.00   1.00


Is there a way to take output from sshare and get FairShare? I'm 
looking for a simple equation or some indication why that's not 
possible. I've ready everything I can find on this topic.


Thanks,
Erik


Re: [slurm-users] Query for minimum memory required in partition

2020-12-16 Thread Paul Edmon

We do this here using the job_submit.lua script.   Here is an example:

    if part == "bigmem" then
    if (job_desc.pn_min_memory ~= 0) then
    if (job_desc.pn_min_memory < 19 or 
job_desc.pn_min_memory > 2147483646) then
    slurm.log_user("You must request 
more than 190GB for jobs in bigmem partition")

    return 2052
    end
    end
    end

-Paul Edmon-

On 12/16/2020 11:06 AM, Sistemas NLHPC wrote:

Hello

Good afternoon, i have a query currently in our cluster we have 
different partitions:


1 partition called slims with 48 Gb of ram
1 partition called general 192 Gb of ram
1 partition called largemem with 768 Gb of ram.

Is it possible to restrict access to the largemem partition and for 
tasks to be accepted as long as a minimum of 193 Gb is reserved in 
slurm.conf or another method? This is because we have users who use 
the largemem partition reserving less than 192 GB.


Thanks for help.
--

Mirko Pizarro  Pizarro mailto:mpiza...@nlhpc.cl>
Ingeniero de Sistemas
National Laboratory for High Performance Computing (NLHPC)
www.nlhpc.cl <http://www.nlhpc.cl/>

CMM - Centro de Modelamiento Matemático
Facultad de Ciencias Físicas y Matemáticas (FCFM)
Universidad de Chile

Beauchef 851
Edificio Norte - Piso 6, of. 601
Santiago – Chile
tel +56 2 2978 4603


Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Paul Edmon
It won't figure it out automatically no.  You will need to ensure that 
the spec is installing to the same locale as your vendor installed it if 
they didn't put it in the default location (/opt isn't the default).


-Paul Edmon-

On 12/4/2020 3:39 PM, Jason Simms wrote:

Dear Ole,

Thanks. I've read through your docs many times. The relevant upgrade 
section begins with the assumption that you have properly configured 
RPMs, so all I'm trying to do is ensure I get to that point. As I 
noted, a vendor installed Slurm initially through a 
proprietary script, though they did base it off of created RPMs. I've 
reached out to them to see whether they used a modified slurm.spec 
file, which I suspect they did, given that Slurm is installed in 
/opt/slurm (which seems like a modified prefix, if nothing else).


The fundamental question is, if I am performing a yum update, and I 
don't adjust any settings in the default slurm.spec file, will it 
upgrade everything properly where they currently "live," or will it 
install new files in standard locations? It's a question of whether 
"yum update" is "smart enough" to figure out what was done before and 
go with that, or whether I must specify all relevant information in 
the slurm.spec file each time? Based on Paul's reply, it seems we do 
need an updated slurm.spec file that reflects our environment, each 
time we upgrade.


Jason

On Fri, Dec 4, 2020 at 3:13 PM Ole Holm Nielsen 
mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


Hi Jason,

Slurm upgrading should be pretty simple, IMHO.  I've been through
this
multiple times, and my Slurm Wiki has detailed upgrade documentation:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
<https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm>

Building RPMs is described in this page as well:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms
<https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms>

I hope this helps.

/Ole


On 04-12-2020 20:36, Jason Simms wrote:
> Thank you for being such a helpful resource for All Things Slurm; I
> sincerely appreciate the helpful feedback. Right now, we are
running
> 20.02 and considering upgrading to 20.11 during our next
maintenance
> window in January. This will be the first time we have upgraded
Slurm,
> so understandably we are somewhat nervous and have some questions.
>
> I am able to download the source and build RPMs successfully.
What is
> unclear to me is whether I have to adjust anything in the
slurm.spec
> file or use a .rpmmacros file to control certain aspects of the
> installation. Since this would be an upgrade, rather than a new
install,
> do I have to adjust, e.g., the --prefix value, and all other
settings
> (X11 support, etc.)? Or, will a yum update "correctly" put the
files
> where they are on my system, using settings from the existing
20.02 version?
>
> We purchased the system from a vendor, and of course they use
custom
> scripts to build and install Slurm, and those are tailored for an
> initial installation, not an upgrade. Their advice to us was, don't
> upgrade if you don't need to, which seems reasonable, except
that many
> of you respond to initial requests for help by recommending an
upgrade.
> And in any case, Slurm doesn't upgrade nicely from more than two
major
> versions back, so I'm hesitant to go too long without patching.
>
> I'm terribly sorry for my ignorance of all this. But I really
lament how
> terrible most resources are about all this. They assume that you
have
> built the RPMs already, without offering any real guidance as to
how to
> adjust relevant options, or even whether that is a requirement
for an
> upgrade vs. a fresh installation.
>
> Any guidance would be most welcome.





--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632


Re: [slurm-users] Novice Slurm Upgrade Questions

2020-12-04 Thread Paul Edmon
Usually the slurm.spec file provided doesn't change that much between 
versions.  What we do here is that we maintain a git repository of our 
slurm.spec that we use with our modifications. Then each time Slurm is 
released we compare ours against what is provided, and simply modify the 
provided one with our changes.


Unless you make specific tweaks to the slurm.spec, you should be able to 
just use it out of the box no problem.  As always read the changelog to 
see if there are any major changes between the versions in case a 
feature you were using was deprecated.  This can happen during major 
version upgrades.


At least from my experience if you follow the directions on the Slurm 
documentation regarding upgrades, you should be fine.  The only real 
hitch is that by default the RPM's do restart the slurmdbd and slurmctld 
services, which you don't want when upgrading.  You should either neuter 
this or have those both stopped during the upgrade.  After the upgrade 
you should run slurmdbd and slurmctld in commandline mode for the 
initial run. Once it is done and running normally you can kill these and 
restart the relevant services.


-Paul Edmon-

On 12/4/2020 2:36 PM, Jason Simms wrote:

Hello all,

Thank you for being such a helpful resource for All Things Slurm; I 
sincerely appreciate the helpful feedback. Right now, we are running 
20.02 and considering upgrading to 20.11 during our next maintenance 
window in January. This will be the first time we have upgraded Slurm, 
so understandably we are somewhat nervous and have some questions.


I am able to download the source and build RPMs successfully. What is 
unclear to me is whether I have to adjust anything in the slurm.spec 
file or use a .rpmmacros file to control certain aspects of the 
installation. Since this would be an upgrade, rather than a new 
install, do I have to adjust, e.g., the --prefix value, and all other 
settings (X11 support, etc.)? Or, will a yum update "correctly" put 
the files where they are on my system, using settings from the 
existing 20.02 version?


We purchased the system from a vendor, and of course they use custom 
scripts to build and install Slurm, and those are tailored for an 
initial installation, not an upgrade. Their advice to us was, don't 
upgrade if you don't need to, which seems reasonable, except that many 
of you respond to initial requests for help by recommending an 
upgrade. And in any case, Slurm doesn't upgrade nicely from more than 
two major versions back, so I'm hesitant to go too long without patching.


I'm terribly sorry for my ignorance of all this. But I really lament 
how terrible most resources are about all this. They assume that you 
have built the RPMs already, without offering any real guidance as to 
how to adjust relevant options, or even whether that is a requirement 
for an upgrade vs. a fresh installation.


Any guidance would be most welcome.

Warmest regards,
Jason

--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632


Re: [slurm-users] FairShare

2020-12-02 Thread Paul Edmon

Yup, our doc is for the classic fairshare not for fairtree.

Thanks for the kudos on the doc by the way.  We are glad it is useful.

-Paul Edmon-

On 12/2/2020 12:45 PM, Ryan Cox wrote:

That is not for Fair Tree, which is what Micheal asked about.

Ryan

On 12/2/20 10:32 AM, Renfro, Michael wrote:


Yesterday, I posted https://docs.rc.fas.harvard.edu/kb/fairshare/ 
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rc.fas.harvard.edu%2Fkb%2Ffairshare%2F=04%7C01%7Crenfro%40tntech.edu%7Cc23f89dcb97743ee5eda08d8960679ed%7C66fecaf83dc04d2cb8b8eff0ddea46f0%7C1%7C1%7C637424301864169250%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000=%2FnB4ivZeDNrVZiaeupFnAj86oQLOhMu1%2FK6YiuBxTB8%3D=0>in 
response to a similar question. If you want the simplest general 
explanation for FairShare values, it's that they range from 0.0 to 
1.0, values above 0.5 indicate that account or user has used less 
than their share of the resource, and values below 0.5 indicate that 
that account or user has used more than their share of the resource.


Since all your users have the same RawShares value and are entitled 
to the same share of the resource, you can see that bdehaven has the 
most RawUsage and the lowest FairShare value, followed by ajoel and 
xtsao with almost identical RawUsage and FairShare, and finally 
ahantau with very little usage and the highest FairShare value.


We use FairShare here as the dominant factor in priorities for queued 
jobs: if you're a light user, we bump up your priority over heavier 
users, and your job starts quicker than those for heavier users, 
assuming all other job attributes are equal.


All these values are relative: in our setup, we'd bump ahantau's 
pending jobs ahead of the others, and put bdehaven's at the end. But 
if root needed to run a job outside the sray account, they'd get an 
enormous bump ahead since the sray account has used far more than its 
fair share of the resource.


*From: *slurm-users 
*Date: *Wednesday, December 2, 2020 at 11:23 AM
*To: *slurm-users@lists.schedmd.com 
*Subject: *Re: [slurm-users] FairShare

*External Email Warning*

*This email originated from outside the university. Please use 
caution when opening attachments, clicking links, or responding to 
requests.*




I've read the manual and I re-read the other link. What they boil 
down to is Fair Share is calculated based on a recondite "rooted 
plane tree", which I do not have the background in discrete math to 
understand.


I'm hoping someone can explain it so my little kernel can understand.



*From:*slurm-users  on behalf 
of Micheal Krombopulous 

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] FairShare

Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1  0.076923           0   
 0.00      0.00  inf   1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                raab          1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                benequist          1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                 bosch           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                rjenkins         1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                  esmith            1  0.05           0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1  0.05           0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1  0.05       42449   
 0.069465      0.137396 0.363913   0.018182
  sray                  jmay           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                 aferrier            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                bdehaven         1    0.05    225002   
 0.367771      0.727420   0.068736

Re: [slurm-users] job restart :: how to find the reason

2020-12-02 Thread Paul Edmon
You can dig through the slurmctld log and search for the JobID. That 
should tell you what Slurm was doing at the time.


-Paul Edmon-

On 12/2/2020 6:27 AM, Adrian Sevcenco wrote:

Hi! I encountered a situation when a bunch of jobs were restarted
and this is seen from Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 
ExitCode=0:0


So, i would like to know, how i can i find why there is a Requeue
(when there is only one partition defined) and why there is a restart ..

Thanks a lot!!!
Adrian





Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Paul Edmon
That can help.  Usually this happens due to laggy storage the job is 
using taking time flushing the job's data.  So making sure that your 
storage is up, responsive, and stable will also cut these down.


-Paul Edmon-

On 11/30/2020 12:52 PM, Robert Kudyba wrote:
I've seen where this was a bug that was fixed 
https://bugs.schedmd.com/show_bug.cgi?id=3941 
<https://bugs.schedmd.com/show_bug.cgi?id=3941> but this happens 
occasionally still. A user cancels his/her job and a node gets 
drained. UnkillableStepTimeout=120 is set in slurm.conf


Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2

Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, 
ExitCode 0

Resending TERMINATE_JOB request JobId=6908 Nodelist=node001
update_node: node node001 reason set to: Kill task failed
update_node: node node001 state set to DRAINING
error: slurmd error running JobId=6908 on node(s)=node001: Kill task 
failed


update_node: node node001 reason set to: hung
update_node: node node001 state set to DOWN
update_node: node node001 state set to IDLE
error: Nodes node001 not responding

scontrol show config | grep kill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec

Do we just increase the timeout value?




  1   2   3   >