Dear SchedMD,
Just verifying what might be no more than a missed docupdate: is
CR_Core_Memory definitely implemented in 2.6.0? The cons_res.html that
comes with it says it isn't, but it seems to work when activated.
Regards
Jeff
Dr. Jeff Tan
High Performance Computing Specialist
IBM
tried
it just now on an x86 cluster and it also went off straight away.
Perhaps some extra debugging might reveal why the 2.6.5 was holding the
jobs back?
Regards
Jeff
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
Phone: +61 3
D-oh. Chris and I sent the same query within a couple of minutes of one
another. Please ignore this one and respond to Chris Samuel's instead (
Guidance on planning a slurmdbd outage).
Regards
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences
are
missing. Any suggestions would be appreciated.
Regards
Jeff
--
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
My mistake! I was looking at the wrong figure/column:
I have logs where the reported
d_cpu matches the total number of CPU-seconds for an hour during the
hourly
rollup, but sometimes the number is higher and sometimes lower.
No, d_cpu did not quite match the total number of CPU-seconds on
in a script or (as I did above) just on the command line? Also,
there are two places where cluster names are defined: slurm.conf and via
sacctmgr. Perhaps one or the other has the spelling wrong?
Regards
Jeff
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences
received the complaint in the rollup only when
the allocation was at 93.75% and higher, and never for hours when
allocation was lower.
Regards
Jeff
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
Jeff Tan/Australia/IBM wrote on 09/09
Hi Brian
Glad to have helped with the errors, but I'm not sure what you mean
regarding sshare. What does the output look like when you run the command?
Regards
Jeff
Jeff Tan
High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
From
Hi Markus
Just to clarify:
"When switching to user '71187' and executing a 'scancel' to a job from
from user '70032' (similar to the last entry of the output above), it is
impossible to get a job cancelled."
So is user 71187 able to cancel the job submitted by user 70032 with
scancel or not?
normally wait for all resources to be lined up
before it starts the job otherwise.
Also, as I understand it, -m (--distribution) does not change Slurm's
behavior to line up all the CPUs required in total before starting the job
(unless it's a job array).
Regards
Jeff
--
Jeff Tan
Infra
supports this if you'd like.
Regards
Jeff
--
Jeff Tan
Infrastructure Services & Technologies
IBM Research - Australia
From: "Koziol, Lucas" <lucas.koz...@exxonmobil.com>
To: "slurm-dev" <slurm-dev@schedmd.com>
Date: 05/01/2017 03:26
Subject:
Hi Mahmood
> [root@cluster ~]# ps aux | grep slurmdb
> root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01
> /usr/sbin/slurmdbd
> root 17146 0.0 0.0 105308 888 pts/2S+ 13:26 0:00 grep
slurmdb
That's good. What does its /var/log/slurm/slurmdbd.log say? Any errors?
>
Hi Mahmood
> [root@cluster ~]# sacctmgr -i create cluster Rocks-Cluster
> sacctmgr: error: slurmdbd: Sending DbdInit msg: Unable to connect to
database
> sacctmgr: error: Problem talking to the database: Unable to connect
> to database
You need to narrow that down. If you're using sacctmgr, you
13 matches
Mail list logo