Hi
I've not used mpich for years but I think I see the problem. By asking for
24 CPUs per task and specifying 2 tasks you are asking slurm to allocate 48
CPUs per node.
Your nodes have 24 CPUs in total so you don't have any nodes that can
service this request
Try asking for 24 tasks. I've only
Short answer yes
Its not risk free but as long as you increase all the timeouts to your
worst case estimate x4 and make sure you understand the upgrades section of
this link
https://slurm.schedmd.com/quickstart_admin.html
And keep it open for reference you should be fine
Antony
On Wed, 26 May
I think that "Cloud and Stuff" is more "fluffy" than vague
On Mon, 14 Sep 2020 at 15:38, Simon Flood wrote:
> Can you provide a short description for each session to give an idea what
> will be covered as some of the titles are a bit vague (i.e. "Cloud and
> stuff").
>
> Thanks,
> Simon
>
why not just sacctmgr modify user foo set maxjobs=0
existing running jobs will run to completion and pending jobs won't start
Antony
On Wed, 1 Apr 2020 at 10:57, Mark Dixon wrote:
> Hi all,
>
> I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
>
> I'd like to stop user
Hi, from what you are describing it sounds like jobs are backfilling in
front and stopping the large jobs from starting
You probably need to tweak your backfill window in schedulerparameters in
slurm.conf see here
*bf_window=#*The number of minutes into the future to look when considering
jobs
Hi
I want to run an epilogctld after all parts of an array job have completed
in order to clean up an on demand filesystem created in the prologctld.
First I though I could just assume that I could run the epilog after the
completion of the final job step until I realised that they might not
Just a quick thought.
What is your slurm.conf setting for this?
*JobAcctGatherType* is operating system dependent and controls what
mechanism is used to collect accounting information. Supported values are
*jobacct_gather/linux* (recommended), *jobacct_gather/cgroup* and
*jobacct_gather/none*
Ask for 8 gpus on 2 nodes instead.
In your script just change the 16 to 8 and it should do what you want.
You are currently asking for 2 nodes with 16 gpu each as Gres resources are
per node.
Antony
On Mon, 15 Apr 2019, 09:08 Ran Du, wrote:
> Dear all,
>
> Does anyone know how to set
I have always assumed that cancel just kills the job whereas requeue will
cancel and then start from the beginning. I know that requeue does this. I
never tried cancel.
I'm a fan of the suspend mode myself but that is dependent on users not
asking for all the ram by default. If you can educate
I think If you increase the share of mygroup to something like 999 then the
share that the root user gets will drop by a factor of 1000
pretty sure I've seen this before and that's how I fixed it
Antony
On Wed, 27 Feb 2019 at 13:47, Will Dennis wrote:
> Looking at output of 'sshare", I see:
>
there is very very a strong likelyhood that you have configured
SlurmdUser=slurm and one of the following
1) there is no /var/spool/slurmd folder
2) the /var/spool/slurmd folder exists but is owned by root
make sure it exists and is owned by whatever SlurmdUser is set to
or change your
lookups:
> – They can simply be rebooted to pick up the updated configuration, along
> with the new software image. – Alternatively, to avoid a reboot, the
> imageupdate command (section 5.6.2) can be run to pick up the new software
> image from a provisioner.
>
> On Wed, 13
; how to integrate these.
>
> Thanks,
> Yugi
>
> On Feb 13, 2019, at 7:27 AM, Antony Cleave
> wrote:
>
> can you ssh to the compute node that job was trying to run on as as the AD
> user in question?
>
> I've seen similar issues on AD integrated systems where some nodes b
can you ssh to the compute node that job was trying to run on as as the AD
user in question?
I've seen similar issues on AD integrated systems where some nodes boot
from a different image that have not yet been joined to the domain.
Antony
On Wed, 13 Feb 2019 at 04:58, Yugendra Guvvala <
You will need to be able to connect both clusters to the same SlurmDBD as
well, but if that is not a problem you are good to go.
Antony
On Tue, 12 Feb 2019 at 11:37, Gestió Servidors
wrote:
> Hi,
>
> I would like to know if "federated clusters in SLURM" concept allows
> connecting two SLURM
if you want slurm to just ignore the difference between physical and
logical cores then you can change
SelectTypeParameters=CR_Core
to
SelectTypeParameters=CR_CPU
and then it will treat threads as CPUs and then it will let you start the
number of tasks you expect
Antony
On Thu, 7 Feb 2019 at
Hi All
seeing this after some hours of mysql downtime yesterday to correct
something else but i didn't notice these errors until after I had
performed the Slurm update to 18.08 which went through fine in spite of
these errors
firstly when restarting the slurmdbd before I started the update
Are you sure this isn't working as designed?
I remember there is something annoying about groups in the manual. Here it
is. This is why I prefer accounts.
*NOTE:* For performance reasons, Slurm maintains a list of user IDs allowed
to use each partition and this is checked at job submission
Try adding a default account and then set a limit of 0 jobs on it.
>From memory I think it is grpjobs
This is the maximum allowed jobs this account can have queued.
This requires limits to be enforced in accountingstorageenforce
Or you could simply add the account to the denyaccount list for
Hi All
Yes, I realise this is almost certainly the intended outcome. I have
wondered this for a long time but only recently got round to testing it on
a safe system.
Process is simple run a lot of jobs
let decay take effect
change the setting
restart dbd and ctld
run another job with debug2 on
I have noticed on several clusters that sreport can be upto one hour out of
date i.e. it will update on the hour every hour.
sacct does not behave this way and is always up to date.
I cannot see this stated in the docs or see any config settings to control
this but it happens on the last 17.02
I've not seen the IDLE* issue before but when my nodes got stuck I've
always beena ble to fix them with this:
[root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck
[root@cloud01 ~]# scontrol update nodename=cloud01 state=idle
[root@cloud01 ~]# scontrol update nodename=cloud01
22 matches
Mail list logo