Also found my answer for the weight value here
https://slurm.schedmd.com/priority_multifactor.html#fairshare
IMPORTANT: The weight values should be high enough to get a good set of
significant digits since all the factors are floating point numbers from 0.0 to
1.0. For example, one job could have
Fix the issue with TRESBillingWeights,
It seems like I will need to set PartitionName for it to work
https://bugs.schedmd.com/show_bug.cgi?id=3753
PartitionName=DEFAULT TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0"
From: slurm-users On Behalf Of Yap, Mike
Sent: Wednesday, 7 April 2021 9:57
Thanks Luke.. Will go through the 2 commands (will try to digest them)
Wondering if you're able to advise on
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0". Tried to include it in
slurm.conf but slurm fail to start
Also wondering if anyone can advise on the fairshare value. I recall read
Hi,
I'm very new to Slurm and try to understand basic concepts. One of them
is the "Multifactor Priority Plugin". For this I submitted some jobs and
looked at sshare output. To my surprise I don't get any numbers for
"RawUsage", regardless what I do RawUsage stays 0 (same in "scontrol
show as
I have updated the "pestat" tool for printing Slurm nodes status with 1
line per node including job info. The download page is
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
(also listed in https://slurm.schedmd.com/download.html).
The pestat tool can print a large variety of
Sajesh,
For those other users that may have run into this. I found a reason why srun
cannot run interactive jobs, and it may not necessarily be related to
RHEL/CentOS 7
If one straces the slurmd one may see (see arg 3 for gid)
chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)
I just checked my cluster and my spool dir is
SlurmdSpoolDir=/var/spool/slurm
(i.e. without the d at the end)
It doesn't really matter, as long as the directory exists and has the
correct permissions on all nodes
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Ser
I think I've worked out a problem
I see in your slurm.conf you have this
SlurmdSpoolDir=/var/spool/slurm/d
It should be
SlurmdSpoolDir=/var/spool/slurmd
You'll need to restart slurmd on all the nodes after you make that change
I would also double check the permissions on that directory on all
It looks like your ctl isn't contacting the slurmdbd properly. The control
host, control port etc are all blank.
The first thing I would do is change the ClusterName in your slurm.conf
from upper case TUC to lower case tuc. You'll then need to restart your
ctld. Then recheck sacctmgr show cluster
sinfo -N -o "%N %T %C %m %P %a"
NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL
wn001 drained 0/0/2/2 3934 TUC* up
wn002 drained 0/0/2/2 3934 TUC* up
wn003 drained 0/0/2/2 3934 TUC* up
wn004 drained 0/0/2/2 3934 TUC* up
wn005 drained 0/0/2/2 3934 TUC* up
wn006 drained 0/0/2/2 3934 TUC* u
sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES
GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS
Def QOS
-- --- - - --- -
- --- -
It looks like your attachment of sinfo -R didn't come through
It also looks like your dbd isn't set up correctly
Can you also show the output of
sacctmgr list cluster
and
scontrol show config | grep ClusterName
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Comput
Hi Sean,
I am trying to submit a simple job but freeze
srun -n44 -l /bin/hostname
srun: Required node not available (down, drained or reserved)
srun: job 15 queued and waiting for resources
^Csrun: Job allocation 15 has been revoked
srun: Force Terminated job 15
daemons are activ
13 matches
Mail list logo