Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Also found my answer for the weight value here https://slurm.schedmd.com/priority_multifactor.html#fairshare IMPORTANT: The weight values should be high enough to get a good set of significant digits since all the factors are floating point numbers from 0.0 to 1.0. For example, one job could have

Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Fix the issue with TRESBillingWeights, It seems like I will need to set PartitionName for it to work https://bugs.schedmd.com/show_bug.cgi?id=3753 PartitionName=DEFAULT TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0" From: slurm-users On Behalf Of Yap, Mike Sent: Wednesday, 7 April 2021 9:57

Re: [slurm-users] Fairshare +FairTree Algorithm + TRESBillingWeights

2021-04-06 Thread Yap, Mike
Thanks Luke.. Will go through the 2 commands (will try to digest them) Wondering if you're able to advise on TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=2.0". Tried to include it in slurm.conf but slurm fail to start Also wondering if anyone can advise on the fairshare value. I recall read

[slurm-users] RawUsage 0??

2021-04-06 Thread Matthias Leopold
Hi, I'm very new to Slurm and try to understand basic concepts. One of them is the "Multifactor Priority Plugin". For this I submitted some jobs and looked at sshare output. To my surprise I don't get any numbers for "RawUsage", regardless what I do RawUsage stays 0 (same in "scontrol show as

[slurm-users] Updated "pestat" tool for printing Slurm nodes status with 1 line per node including job info

2021-04-06 Thread Ole Holm Nielsen
I have updated the "pestat" tool for printing Slurm nodes status with 1 line per node including job info. The download page is https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat (also listed in https://slurm.schedmd.com/download.html). The pestat tool can print a large variety of

Re: [slurm-users] Cannot run interactive jobs

2021-04-06 Thread Manalo, Kevin L
Sajesh, For those other users that may have run into this. I found a reason why srun cannot run interactive jobs, and it may not necessarily be related to RHEL/CentOS 7 If one straces the slurmd one may see (see arg 3 for gid) chown("/dev/pts/1", 1326, 7) = -1 EPERM (Operation not permitted)

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
I just checked my cluster and my spool dir is SlurmdSpoolDir=/var/spool/slurm (i.e. without the d at the end) It doesn't really matter, as long as the directory exists and has the correct permissions on all nodes -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Ser

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
I think I've worked out a problem I see in your slurm.conf you have this SlurmdSpoolDir=/var/spool/slurm/d It should be SlurmdSpoolDir=/var/spool/slurmd You'll need to restart slurmd on all the nodes after you make that change I would also double check the permissions on that directory on all

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
It looks like your ctl isn't contacting the slurmdbd properly. The control host, control port etc are all blank. The first thing I would do is change the ClusterName in your slurm.conf from upper case TUC to lower case tuc. You'll then need to restart your ctld. Then recheck sacctmgr show cluster

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread ibotsis
sinfo -N -o "%N %T %C %m %P %a" NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL wn001 drained 0/0/2/2 3934 TUC* up wn002 drained 0/0/2/2 3934 TUC* up wn003 drained 0/0/2/2 3934 TUC* up wn004 drained 0/0/2/2 3934 TUC* up wn005 drained 0/0/2/2 3934 TUC* up wn006 drained 0/0/2/2 3934 TUC* u

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread ibotsis
sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS -- --- - - --- - - --- -

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby
It looks like your attachment of sinfo -R didn't come through It also looks like your dbd isn't set up correctly Can you also show the output of sacctmgr list cluster and scontrol show config | grep ClusterName Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Comput

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Ioannis Botsis
Hi Sean, I am trying to submit a simple job but freeze srun -n44 -l /bin/hostname srun: Required node not available (down, drained or reserved) srun: job 15 queued and waiting for resources ^Csrun: Job allocation 15 has been revoked srun: Force Terminated job 15 daemons are activ