Re: [slurm-users] Adding an association to a different account

2023-12-19 Thread Michael Gutteridge
Ah, you are creating a *new* association: sacctmgr create user cseraphine account=test1 HTH - Michael On Tue, Dec 19, 2023, 13:47 Chip Seraphine wrote: > TL,DR: How do you associate an existing user with a second account? > > I have a user who has a default account, and I want to give them

Re: [slurm-users] SlurmcltdHost confusion

2023-12-13 Thread Michael Gutteridge
I'll apologize because I don't have a complete answer. I'm not sure why that doesn't work, but my understanding of how it should work for failover scenarios is a "SlurmctldHost" line for each of the controllers, e.g.: SlurmctldHost=host1 SlurmctldHost=host2 ... The list format seems to be used

Re: [slurm-users] --partition requests ignored in scripts

2023-11-09 Thread Michael Gutteridge
The position of the #SBATCH directives also matters (emphasis mine): > The batch script may contain options preceded with "#SBATCH" *before any executable commands* in the script. We've been bit by that a couple times- a stray command before any #SBATCH lines will cause any of those directives

[slurm-users] Glusterfs hints for state database

2023-09-07 Thread Michael Gutteridge
We've settled on the idea of using a glusterfs file system for rolling out an HA Slurm controller. Over the last year we've averaged 88,000 job submissions per day, though it's usually lower than that (10-20K). Disk activity on the existing state databaseseems to be maxing out around 40-50 io/s

Re: [slurm-users] What is the minimal configuration for a compute node

2023-08-24 Thread Michael Gutteridge
Hi By "minimal config" I'm assuming you mean "just enough config to get the slurmd to run". As far as I'm aware, you really need to have a complete and matching config on each of your daemons- like slurmd literally won't start with differing configs. There is the "NO_CONF_HASH" debug flag to

Re: [slurm-users] slurmdbd database usage

2023-08-02 Thread Michael Gutteridge
Pretty sure that dog won't hunt. There's not _just_ the tables, but I believe there's a bunch of queries and other database magic in slurmdbd that is specific to MySQL/MariaDB. - Michael On Wed, Aug 2, 2023 at 2:33 PM Sandor wrote: > I am looking to track accounting and job data. Slurm

Re: [slurm-users] Job in "priority" status - resources available

2023-08-02 Thread Michael Gutteridge
I'm not sure there's enough information in your message- Slurm version and configs are often necessary to make a more confident diagnosis. However, the behaviour you are looking for (lower priority jobs skipping the line) is called "backfill". There's docs here:

Re: [slurm-users] AccountingStorageLoc option has been removed and fatal error

2023-06-13 Thread Michael Gutteridge
Hi I couldn't find an announcement anywhere, but filetxt looks to have been removed in version 20.11 (see here and here ).

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Michael Gutteridge
Does this link help? > Debian and derivatives (e.g. Ubuntu) usually exclude the memory and > memsw (swap) cgroups by default. To include them, add the following > parameters to the kernel

Re: [slurm-users] Job preempts entire host instead of single job

2023-01-17 Thread Michael Gutteridge
Hi I believe this is how the preemption algorithm works- it selects the entire node's resources: > For performance reasons, the backfill scheduler reserves whole nodes for jobs, not partial nodes. - https://slurm.schedmd.com/preempt.html#limitations However, that does specifically call out

Re: [slurm-users] Upgrade from 20.11.0 to Slurm version 22.05.6 ?

2022-11-10 Thread Michael Gutteridge
Theoretically I think you should be able to. Slurm should upgrade from the previous two releases (see this ) and I think that

Re: [slurm-users] sreport question when specifying partitions=

2021-11-10 Thread Michael Gutteridge
My read of the sreport manpage on our currently installed version (21.08) is that the "partitions" condition is only available for job reports, not cluster reports. The description of that condition is in "OPTIONS SPECIFICALLY FOR JOB REPORTS". For example: sreport job SizesByAccount user=me

Re: [slurm-users] Question about adding and removing features in Slurm

2021-06-21 Thread Michael Gutteridge
I believe at the end of the day you do need to edit slurm.conf. There is a similar capability in Slurm with scontrol(1) where you can set "availablefeatures" and "activefeatures": sudo scontrol update nodename=node16 availablefeatures=feature1,feature2,feature3 I'm not sure how that

Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-18 Thread Michael Gutteridge
I would also encourage you to use defaults in the slurm.conf (matching what's shipped in the Ubuntu packages). However, here is what I've done to use non-Ubuntu-package paths for the PID files. Create an override in /etc/systemd/system/slurmd.service.d/override.conf with something like:

Re: [slurm-users] Unsetting a QOS Flag?

2021-02-08 Thread Michael Gutteridge
I believe you want "-=" to do that: sacctmgr modify qos foo set flags-=denyonlimit It doesn't seem to be explicitly documented, but some of the other sacctmgr options use that format. - Michael On Mon, Feb 8, 2021 at 1:39 PM Chin,David wrote: > Hello all: > > I have a QOS defined which

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Michael Gutteridge
I don't believe the IP address is required- if you can configure a DNS/hosts entry differently for cloud nodes you can set: SlurmCtldhost = controllername Then have "controllername" resolve to the private IP for the controller for the on-prem cluster, the public IP for the nodes in the cloud.

Re: [slurm-users] Moving Slurmctld and slurmdbd to a new host

2021-01-16 Thread Michael Gutteridge
I'd confirm that as well. The state directory has all of that information. We just upgraded from 18.05 to 20.02 on a different host and while the cluster was quiet (we had a maintenance reservation in place) there were running jobs which survived the upgrade. I think the big thing to watch out

Re: [slurm-users] Parent account in AllowAccounts

2021-01-15 Thread Michael Gutteridge
I've only ever seen the parent-child account relationship discussed in the context of usage and fairshare. I think for the allow/deny controls you have to specify each account individually. I did find this enhancement request: https://bugs.schedmd.com/show_bug.cgi?id=1398 which would support

Re: [slurm-users] Limit usage outside reservation

2020-10-20 Thread Michael Gutteridge
I'm unaware of any mechanism to do this on a per-user basis. The partition configuration does include the parameter "ReqResv" ( https://slurm.schedmd.com/slurm.conf.html): ReqResv Specifies users of this partition are required to designate a reservation when submitting a job. This option can be

Re: [slurm-users] How do I add a library for the linker in Makefile.in

2020-01-31 Thread Michael Gutteridge
With the caveat that I haven't built these plugins past Slurm 18, these are job submit plugins, and that the documentation is weak, you could look at these plugins I'd written for our cluster: https://github.com/FredHutch/gizmo-plugins Contains two plugins I build in the source tree. These set

Re: [slurm-users] Problem with sbatch

2019-07-08 Thread Michael Gutteridge
Hi I can't find the reference here, but if I recall correctly the preferred user for slurmd is actually root. It is the default. > I assume this can be fixed by modifying the configuration so "SlurmdUser=root", but does this imply that anything run with `srun` will be actually executed by root?

Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-16 Thread Michael Gutteridge
urm.conf just in case you or anyone else wants to > take a more complete overview. > > Best regards, > David > > -- > *From:* slurm-users on behalf of > Michael Gutteridge > *Sent:* 09 April 2019 18:59 > *To:* Slurm User Community List > *Subject:* Re: [slu

Re: [slurm-users] Effect of PriorityMaxAge on job throughput

2019-04-09 Thread Michael Gutteridge
It might be useful to include the various priority factors you've got configured. The fact that adjusting PriorityMaxAge had a dramatic effect suggests that the age factor is pretty high- might be worth looking at that value relative to the other factors. Have you looked at

Re: [slurm-users] Priority access for a group of users

2019-03-06 Thread Michael Gutteridge
It is likely that your job still does not have enough priority to preempt the scavenge job. Have a look at the output of `sprio` to see the priority of those jobs and what factors are in play. It may be necessary to increase the partition priority or adjust some of the job priority factors to

Re: [slurm-users] How to enable QOS correctly?

2019-03-05 Thread Michael Gutteridge
Hi It might be useful to see the configuration of the partition and how the QOS is set up... but at first blush I suspect you may need to set OverPartQOS (https://slurm.schedmd.com/resource_limits.html) to get the QOS limit to take precedence over the limit in the partition. However, the

Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread Michael Gutteridge
Along those lines, there is the slurm.conf setting for _JobRequeue_ which controls the default behavior for jobs' ability to be re-queued. - Michael On Fri, Mar 1, 2019 at 7:07 AM Thomas M. Payerle wrote: > My understanding is that with PreemptMode=requeue, the running scavenger > job

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
blem. Best Michael On Thu, Feb 28, 2019 at 7:54 AM Chris Samuel wrote: > On 28/2/19 7:29 am, Michael Gutteridge wrote: > > > 2221670 largenode sleeper. me PD N/A 1 > > (null) (AssocGrpCpuLimit) > > That says the job e

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
to debug those reservations. Thanks M On Wed, Feb 27, 2019 at 10:22 PM Chris Samuel wrote: > On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote: > > > Yes, we do have time limits set on partitions- 7 days maximum, 3 days > > default. In this ca

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
dle" nodes were powered down by > Slurm power saving stuff. Can you manually force one of the powered-down > nodes to power up, and see if the large job gets assigned to it? > Is it possible Slurm is not able to power up the nodes? > > > On Wed, Feb 27, 2019 at 4:45 PM Michael

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
> maybe it tells some nodes to spin up, but by time they spin up it already > assigned the previously up and idle nodes to the smaller job. > > On Wed, Feb 27, 2019 at 3:33 PM Michael Gutteridge < > michael.gutteri...@gmail.com> wrote: > >> I've run into a problem with

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
ob that requires infinite time. > > Andy > > ------ > *From:* Michael Gutteridge > > *Sent:* Wednesday, February 27, 2019 3:29PM > *To:* Slurm User Community List > > *Cc:* > *Subject:* [slurm-users] Large job starvation on cloud

[slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
I've run into a problem with a cluster we've got in a cloud provider- hoping someone might have some advice. The problem is that I've got a circumstance where large jobs _never_ start... or more correctly, that large-er jobs don't start when there are many smaller jobs in the partition. In this

Re: [slurm-users] Different slurm.conf for master and nodes

2019-02-27 Thread Michael Gutteridge
Hi I don't know what version of Slurm you're using or how it may be different from the one I'm using (18.05), but here's my understanding of memory limits and what I'm seeing on our cluster. The parameter `JobAcctGatherParams=OverMemoryKill` controls whether a step is killed if it goes over the

Re: [slurm-users] Fwd: confirm e3c10e8d4f2f35ab689c7a4a88e5e2b57931da79

2019-01-29 Thread Michael Gutteridge
I got one of these as well- I chalked it up to the overnight gmail outage. Followed the link and confirmed on the subsequent page and everything seems right... - Michael On Tue, Jan 29, 2019 at 1:45 PM Lachlan Musicman wrote: > I got this email to my gmail account? I don't understand why my

[slurm-users] Configuring partition limit MaxCPUsPerNode

2018-11-26 Thread Michael Gutteridge
I'm either misunderstanding how to configure the limit "MaxCPUsPerNode" or how it behaves. My desired end-state is that if a user submits a job to a partition that requests more resources (CPUs) than available on any node in that partition, the job will be immediately rejected, rather than

Re: [slurm-users] $TMPDIR does not honor "TmpFS"

2018-11-21 Thread Michael Gutteridge
I don't think that's a bug. As far as I've ever known, TmpFS is only used to tell slurmd where to look for available space (reported as TmpDisk for the node). The manpage only indicates that, not any additional functionality. We set TMPDIR in a task prolog: #!/bin/bash echo "export

Re: [slurm-users] Documentation for creating a login node for a SLURM cluster

2018-10-12 Thread Michael Gutteridge
I'm unaware of specific docs, but I tend to think of these simply as daemon nodes that aren't listed in slurm.conf. We use Ubuntu and the packages we install are munge, slurm-wlm, and slurm-client (which drags in libslurmXX and slurm-wlm-basic-plugins). Then the setup is very similar to slurmd

Re: [slurm-users] how to easily to obtain jobid for array jobs?

2018-10-11 Thread Michael Gutteridge
There is also the SQUEUE_FORMAT environment variable. Set that in the appropriate place (/etc/profile and such) to '%i %A (and whatever other output you like)' and you should be good to go. - Michael On Thu, Oct 11, 2018 at 4:00 AM Loris Bennett wrote: > Hi Daan, > > Daan van Rossum writes:

Re: [slurm-users] Position in queue?

2018-10-08 Thread Michael Gutteridge
`squeue` has some output options which may do the trick for you. `-o %Q` shows the priority and you can use `--sort` to sort by priority. I have $SQUEUE_FORMAT set: %.15i %.15A %.9u %.8a %.9P %8q %18j %.2t %.10M %.6D %4C %4c %R %Q puts the priority as the last column. I believe default

Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread Michael Gutteridge
is: >> >> [root@cloud01 ~]# scontrol update nodename=cloud01 state=down >> reason=stuck >> [root@cloud01 ~]# scontrol update nodename=cloud01 state=idle >> [root@cloud01 ~]# scontrol update nodename=cloud01 state=power_down >> [root@cloud01 ~]# scontrol update nodename=

[slurm-users] Power save doesn't start nodes

2018-07-17 Thread Michael Gutteridge
Hi I'm running a cluster in a cloud provider and have run up against an odd problem with power save. I've got several hundred nodes that Slurm won't power up even though they appear idle and in the powered-down state. I suspect that they are in a "not-so-idle" state: `scontrol` for all of the