Re: [slurm-users] Slurm and MPICH don't play well together (salloc)

2021-12-28 Thread Antony Cleave
Hi I've not used mpich for years but I think I see the problem. By asking for 24 CPUs per task and specifying 2 tasks you are asking slurm to allocate 48 CPUs per node. Your nodes have 24 CPUs in total so you don't have any nodes that can service this request Try asking for 24 tasks. I've only

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Antony Cleave
Short answer yes Its not risk free but as long as you increase all the timeouts to your worst case estimate x4 and make sure you understand the upgrades section of this link https://slurm.schedmd.com/quickstart_admin.html And keep it open for reference you should be fine Antony On Wed, 26 May

Re: [slurm-users] Slurm User Group Meeting (SLUG'20) Agenda Posted

2020-09-14 Thread Antony Cleave
I think that "Cloud and Stuff" is more "fluffy" than vague On Mon, 14 Sep 2020 at 15:38, Simon Flood wrote: > Can you provide a short description for each session to give an idea what > will be covered as some of the titles are a bit vague (i.e. "Cloud and > stuff"). > > Thanks, > Simon >

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread Antony Cleave
why not just sacctmgr modify user foo set maxjobs=0 existing running jobs will run to completion and pending jobs won't start Antony On Wed, 1 Apr 2020 at 10:57, Mark Dixon wrote: > Hi all, > > I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. > > I'd like to stop user

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-05 Thread Antony Cleave
Hi, from what you are describing it sounds like jobs are backfilling in front and stopping the large jobs from starting You probably need to tweak your backfill window in schedulerparameters in slurm.conf see here *bf_window=#*The number of minutes into the future to look when considering jobs

[slurm-users] Is this a bug in slurm array completion logic or expected behaviour

2020-01-30 Thread Antony Cleave
Hi I want to run an epilogctld after all parts of an array job have completed in order to clean up an on demand filesystem created in the prologctld. First I though I could just assume that I could run the epilog after the completion of the final job step until I realised that they might not

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-16 Thread Antony Cleave
Just a quick thought. What is your slurm.conf setting for this? *JobAcctGatherType* is operating system dependent and controls what mechanism is used to collect accounting information. Supported values are *jobacct_gather/linux* (recommended), *jobacct_gather/cgroup* and *jobacct_gather/none*

Re: [slurm-users] How to apply for multiple GPU cards from different worker nodes?

2019-04-15 Thread Antony Cleave
Ask for 8 gpus on 2 nodes instead. In your script just change the 16 to 8 and it should do what you want. You are currently asking for 2 nodes with 16 gpu each as Gres resources are per node. Antony On Mon, 15 Apr 2019, 09:08 Ran Du, wrote: > Dear all, > > Does anyone know how to set

Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread Antony Cleave
I have always assumed that cancel just kills the job whereas requeue will cancel and then start from the beginning. I know that requeue does this. I never tried cancel. I'm a fan of the suspend mode myself but that is dependent on users not asking for all the ram by default. If you can educate

Re: [slurm-users] Fairshare - root user

2019-02-27 Thread Antony Cleave
I think If you increase the share of mygroup to something like 999 then the share that the root user gets will drop by a factor of 1000 pretty sure I've seen this before and that's how I fixed it Antony On Wed, 27 Feb 2019 at 13:47, Will Dennis wrote: > Looking at output of 'sshare", I see: >

Re: [slurm-users] Slurmd not starting

2019-02-13 Thread Antony Cleave
there is very very a strong likelyhood that you have configured SlurmdUser=slurm and one of the following 1) there is no /var/spool/slurmd folder 2) the /var/spool/slurmd folder exists but is owned by root make sure it exists and is owned by whatever SlurmdUser is set to or change your

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Antony Cleave
lookups: > – They can simply be rebooted to pick up the updated configuration, along > with the new software image. – Alternatively, to avoid a reboot, the > imageupdate command (section 5.6.2) can be run to pick up the new software > image from a provisioner. > > On Wed, 13

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Antony Cleave
; how to integrate these. > > Thanks, > Yugi > > On Feb 13, 2019, at 7:27 AM, Antony Cleave > wrote: > > can you ssh to the compute node that job was trying to run on as as the AD > user in question? > > I've seen similar issues on AD integrated systems where some nodes b

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Antony Cleave
can you ssh to the compute node that job was trying to run on as as the AD user in question? I've seen similar issues on AD integrated systems where some nodes boot from a different image that have not yet been joined to the domain. Antony On Wed, 13 Feb 2019 at 04:58, Yugendra Guvvala <

Re: [slurm-users] Federated Clusters

2019-02-12 Thread Antony Cleave
You will need to be able to connect both clusters to the same SlurmDBD as well, but if that is not a problem you are good to go. Antony On Tue, 12 Feb 2019 at 11:37, Gestió Servidors wrote: > Hi, > > I would like to know if "federated clusters in SLURM" concept allows > connecting two SLURM

Re: [slurm-users] NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ???

2019-02-08 Thread Antony Cleave
if you want slurm to just ignore the difference between physical and logical cores then you can change SelectTypeParameters=CR_Core to SelectTypeParameters=CR_CPU and then it will treat threads as CPUs and then it will let you start the number of tasks you expect Antony On Thu, 7 Feb 2019 at

[slurm-users] "We have more time than is possible" in slurmdbd.log with no runaway jobs

2019-02-06 Thread Antony Cleave
Hi All seeing this after some hours of mysql downtime yesterday to correct something else but i didn't notice these errors until after I had performed the Slurm update to 18.08 which went through fine in spite of these errors firstly when restarting the slurmdbd before I started the update

Re: [slurm-users] Slurm missing non primary group memberships

2018-11-13 Thread Antony Cleave
Are you sure this isn't working as designed? I remember there is something annoying about groups in the manual. Here it is. This is why I prefer accounts. *NOTE:* For performance reasons, Slurm maintains a list of user IDs allowed to use each partition and this is checked at job submission

Re: [slurm-users] Accounting: set default account with no access

2018-11-07 Thread Antony Cleave
Try adding a default account and then set a limit of 0 jobs on it. >From memory I think it is grpjobs This is the maximum allowed jobs this account can have queued. This requires limits to be enforced in accountingstorageenforce Or you could simply add the account to the denyaccount list for

[slurm-users] changing PriorityDecayHalfLife has no impact on stored accounting data

2018-10-16 Thread Antony Cleave
Hi All Yes, I realise this is almost certainly the intended outcome. I have wondered this for a long time but only recently got round to testing it on a safe system. Process is simple run a lot of jobs let decay take effect change the setting restart dbd and ctld run another job with debug2 on

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-14 Thread Antony Cleave
I have noticed on several clusters that sreport can be upto one hour out of date i.e. it will update on the hour every hour. sacct does not behave this way and is always up to date. I cannot see this stated in the docs or see any config settings to control this but it happens on the last 17.02

Re: [slurm-users] Power save doesn't start nodes

2018-07-18 Thread Antony Cleave
I've not seen the IDLE* issue before but when my nodes got stuck I've always beena ble to fix them with this: [root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck [root@cloud01 ~]# scontrol update nodename=cloud01 state=idle [root@cloud01 ~]# scontrol update nodename=cloud01