[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-26 Thread Ole Holm Nielsen via slurm-users

On 25-05-2024 03:49, Hongyi Zhao via slurm-users wrote:

Ultimately, I found that the cause of the problem was that
hyper-threading was enabled by default in the BIOS. If I disable
hyper-threading, I observed that the computational efficiency is
consistent between using slurm and using mpirun directly. Therefore,
it appears that hyper-threading should not be enabled in the BIOS when
using slurm.


Whether or not to enable Hyper-Threading (HT) on your compute nodes 
depends entirely on the properties of applications that you wish to run 
on the nodes.  Some applications are faster without HT, others are 
faster with HT.  When HT is enabled, the "virtual CPU cores" obviously 
will have only half the memory available per core.


The VASP code is highly CPU- and memory intensive, and HT should 
probably be disabled for optimal performance with VASP.


Slurm doesn't affect the performance of your codes with or without HT. 
Slurm just schedules tasks to run on the available cores.


/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm DB upgrade failure behavior

2024-05-17 Thread Ole Holm Nielsen via slurm-users

On 5/16/24 20:27, Yuengling, Philip J. via slurm-users wrote:
I'm writing up some Ansible code to manage Slurm software updates, and I 
haven't found any documentation about slurmdbd behavior if the 
mysql/mariadb database doesn't upgrade successfully.



I would discourage the proposed Slurm updates automatically using Ansible 
or any other automation tool!  Unexpected bugs might come to the surface 
during upgrading!


The mysql/mariadb database service isn't affected by Slurm updates, 
although the database contents are changed of course :-)


You need to very carefully make a dry-run slurmdbd update on a test node 
before doing the actual slurmdbd upgrade, and you need to make a backup of 
the database before upgrading!


Updates of slurmctld must also be made very carefully with a backup of the 
spool directory (just in case).


The slurmd in most cases can be upgraded with now or small issues.

My Slurm upgrading notes are in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

What I do know is that if it is sucessful I can expect to see "Conversion 
done: success!" in the slurmdbd log.  This is good, but minor updates do 
not update the database as far as I know.



If the Slurm database cannot upgrade upon an update, does it always shut 
down with a fatal error?  What other behaviors should I look for if there 
is a failure?


IHTH,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Removing safely a node

2024-05-17 Thread Ole Holm Nielsen via slurm-users

On 5/17/24 05:16, Ratnasamy, Fritz via slurm-users wrote:
  What is the "official" process to remove nodes safely? I have drained 
the nodes so jobs are completed and put them in down state after they are 
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can 
see that the nodes were removed from the partition with the command sinfo


However, I was told I might need to restart the service slurmctld, do you 
know if it is necessary? Should I also run scontrol reconfig?


The SchedMD presentations in https://slurm.schedmd.com/publications.html 
describe node add/remove.


I've collected my notes on this in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#add-and-remove-nodes

/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users

On 5/7/24 15:32, Henderson, Brent via slurm-users wrote:
Over the past few days I grabbed some time on the nodes and ran for a few 
hours.  Looks like I **can** still hit the issue with cgroups disabled.  
Incident rate was 8 out of >11k jobs so dropped an order of magnitude or 
so.  Guessing that exonerates cgroups as the cause, but possibly just a 
good way to tickle the real issue.  Over the next few days, I’ll try to 
roll everything back to RHEL 8.9 and see how that goes.


My 2 cents: RHEL/AlmaLinux/RockyLinux 9.4 is out now, maybe it's worth a 
try to update to 9.4?


/Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
It turns out that the Slurm job limits are *not* controlled by the normal 
/etc/security/limits.conf configuration.  Any service running under 
Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2].


The limits of processes started by slurmd are defined by LimitXXX in 
/usr/lib/systemd/system/slurmd.service, and current Slurm versions have 
LimitNOFILE=131072.


I guess that LimitNOFILE is the limit applied to every Slurm job, and that 
jobs presumably ought to crash if opening more than LimitNOFILE files?


If this is correct, I think the kernel's fs.file-max ought to be set to 
131072 times the maximum possible number of Slurm jobs per node, plus a 
safety margin for the OS.  Depending on Slurm configuration, fs.file-max 
should be set to 131072 times number of CPUs plus some extra margin.  For 
example, a 96-core node might have fs.file-max set to 100*131072 = 13107200.


Does this make sense?

Best regards,
Ole

[1] "How to set limits for services in RHEL and systemd" 
https://access.redhat.com/solutions/1257953
[2] 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#slurmd-systemd-limits


On 4/18/24 11:23, Ole Holm Nielsen wrote:
I looked at some of our busy 96-core nodes where users are currently 
running the STAR-CCM+ CFD software.


One job runs on 4 96-core nodes.  I'm amazed that each STAR-CCM+ process 
has opened almost 1000 open files, for example:


$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u  | wc -l
94606

So it's no wonder that 65536 open files would have been exhausted, and 
that my current limit is just barely sufficient:


$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:

2024-04-18 10:54:33 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:55:34 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:56:35 +0200 Info:  Failed to accept connection: Too many 
open files in system

2024-04-18 10:57:22 +0200 Info:  Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:  Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:

[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error



The node became completely non-responsive until I restored 
fs.file-max=131072.


Conclusions:

1. Munge should be upgraded to 0.5.15 or later to avoid the munged.log 
filling up the disk.  I summarize this in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service


2. We still need some heuristics for determining sufficient values for the 
kernel's fs.file-max limit.  I don't understand whether the kernel itself 
might set good default values, which we have noticed on some servers and 
login nodes.


As Jeffrey points out, there are both soft and hard user limits on the 
number of files, and this is what I see for a normal user:


$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144

Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU core 
count (if we believe that users will only run 1 process per core).  An 
extra safety margin would need to be added on top.  Or maybe we need 
something a lot higher?


Question: Would there be any negative side effect of setting fs.file-max 
to a very large number (10s of millions)?


Interestingly, the (possibly outdated) Large Cluster Administration Guide 
at https://slurm.schedmd.com/big_sys.html recommends a ridiculously low 
number:


/proc/sys/fs/file-max: The maximum number of concurrently open files. We 
recommend a limit of at least 32,832.


Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.


The ulimit is a frontend to rusage limits, which are per-process 
restrictions (not per-user).


The fs.file-max is the kernel's limit on how many file descriptors can 
be open in aggregate.  You'd have to edit that with sysctl:



    *$ sysctl fs.file-max*
    fs.file-max = 26161449



Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an 
alternative limit versus the default.






But if you have ulimit -n == 1024, then no user should be able to h

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-18 Thread Ole Holm Nielsen via slurm-users
all of them, I was just hoping to also rule out any bug in 
munged.  Since you're upgrading munged, you'll now get the errno 
associated with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users

Hi Bjørn-Helge,

On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote:

Ole Holm Nielsen via slurm-users  writes:


Therefore I believe that the root cause of the present issue is user
applications opening a lot of files on our 96-core nodes, and we need
to increase fs.file-max.


You could also set a limit per user, for instance in
/etc/security/limits.d/.  Then users would be blocked from opening
unreasonably many files.  One could use this to find which applications
are responsible, and try to get them fixed.


That sounds interesting, but which limit might affect the kernel's 
fs.file-max?  For example, a user already has a narrow limit:


ulimit -n
1024

whereas the permitted number of user processes is a lot higher:

ulimit -u
3092846

I'm not sure how the number 3092846 got set, since it's not defined in 
/etc/security/limits.conf.  The "ulimit -u" varies quite a bit among our 
compute nodes, so which dynamic service might affect the limits?


Perhaps there is a recommendation for defining nproc in 
/etc/security/limits.conf on compute nodes?


Thanks,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users

Hi Jeffrey,

Thanks a lot for the information:

On 4/15/24 15:40, Jeffrey T Frey wrote:

https://github.com/dun/munge/issues/94


I hadn't seen issue #94 before, and it seems to be relevant to our 
problem.  It's probably a good idea to upgrade munge beyond what's 
supplied by EL8/EL9.  We can build the latest 0.5.16 RPMs by:


wget 
https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz

rpmbuild -ta munge-0.5.16.tar.xz

I've updated my Slurm Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service 
accordingly now.



The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


Correct, we run munge 0.5.13 as supplied by EL8 (RockyLinux 8.9).


If you go on one of the affected nodes and do an `lsof -p ` I'm betting 
you'll find a long list of open file descriptors — that would explain the "Too many open 
files" situation _and_ indicate that this is something other than external memory pressure 
or open file limits on the process.


Actually, munged is normally working without too many open files as seen 
by "lsof -p `pidof munged`" over the entire partition, where the munged 
open file count is only 29.  I currently don't have any broken nodes with 
a full file system that I can examine.


Therefore I believe that the root cause of the present issue is user 
applications opening a lot of files on our 96-core nodes, and we need to 
increase fs.file-max.  And upgrade munge as well to avoid the log file 
growing without bounds.


I'd still like to know if anyone has good recommendations for setting the 
fs.file-max parameter on Slurm compute nodes?


Thanks,
Ole


On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
 wrote:

We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a while 
to 100% (tens of GBs), and the node eventually comes to a grinding halt!  
Wiping munged.log and restarting the node works around the issue.

I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
processing backlog

2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
"/var/run/munge/munge.socket.2": Resource temporarily unavailable
   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
RESPONSE_ACCT_GATHER_UPDATE has authentication error

3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge.  The error may perhaps be 
triggered by certain user codes (possibly star-ccm+) that might be opening a 
lot more files on the 96-core nodes than on nodes with a lower core count.

My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
since!

The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version 
in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well?  Are there 
any good recommendations for setting the fs.file-max parameter on Slurm compute 
nodes?

Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Munge log-file fills up the file system to 100%

2024-04-15 Thread Ole Holm Nielsen via slurm-users
We have some new AMD EPYC compute nodes with 96 cores/node running 
RockyLinux 8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a 
while to 100% (tens of GBs), and the node eventually comes to a grinding 
halt!  Wiping munged.log and restarting the node works around the issue.


I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
processing backlog


2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to 
connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable
   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: 
auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error


3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge.  The error may 
perhaps be triggered by certain user codes (possibly star-ccm+) that might 
be opening a lot more files on the 96-core nodes than on nodes with a 
lower core count.


My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p".  We haven't seen any of the Munge 
errors since!


The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
version in https://github.com/dun/munge/releases/tag/munge-0.5.16

I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well?  Are 
there any good recommendations for setting the fs.file-max parameter on 
Slurm compute nodes?


Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Lua script

2024-03-20 Thread Ole Holm Nielsen via slurm-users

What is the contents of your /etc/slurm/job_submit.lua file?
Did you reconfigure slurmctld?
Check the log file by: grep job_submit /var/log/slurm/slurmctld.log
What is your Slurm version?

You can read about job_submit plugins in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

I hope this helps,
Ole


On 3/20/24 09:49, Gestió Servidors via slurm-users wrote:
after adding “EnforcePartLimits=ALL” in slurm.conf and restarting 
slurmctld daemon, job continues being accepted… so I don’t undertand where 
I’m doing some wrong.


My slurm.conf is this:

ControlMachine=my_server

MailProg=/bin/mail

MpiDefault=none

ProctrackType=proctrack/linuxproc

ReturnToService=2

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

SlurmdUser=root

AuthType=auth/munge

StateSaveLocation=/var/log/slurm

SwitchType=switch/none

TaskPlugin=task/none,task/affinity,task/cgroup

TaskPluginParam=none

DebugFlags=NO_CONF_HASH,Backfill,BackfillMap,SelectType,Steps,TraceJobs

*JobSubmitPlugins=lua*

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core

SchedulerParameters=max_script_size=20971520

*EnforcePartLimits=ALL*

CoreSpecPlugin=core_spec/none

AccountingStorageType=accounting_storage/slurmdbd

AccountingStoreFlags=job_comment

JobCompType=jobcomp/filetxt

JobCompLoc=/var/log/slurm/job_completions

ClusterName=my_cluster

JobAcctGatherType=jobacct_gather/linux

SlurmctldDebug=5

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=5

SlurmdLogFile=/var/log/slurmd.log

AccountingStorageEnforce=limits

AccountingStorageHost=my_server

NodeName=clus[01-06] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 
ThreadsPerCore=1 RealMemory=128387 TmpDisk=81880 Feature=big-mem


NodeName=clus[07-12] CPUs=12 SocketsPerBoard=2 CoresPerSocket=6 
ThreadsPerCore=1 RealMemory=15491 TmpDisk=81880 Feature=small-mem


NodeName=clus-login CPUs=4 SocketsPerBoard=2 CoresperSocket=2 
ThreadsperCore=1 RealMemory=15886 TmpDisk=30705


*PartitionName=nodo.q Nodes=clus[01-12] Default=YES MaxTime=04:00:00 
State=UP AllocNodes=clus-login,clus05 MaxCPUsPerNode=12*


KillOnBadExit=1

OverTimeLimit=30 # si el trabajo dura mas de 30 minutos despues del tiempo 
maximo (2 horas), se cancela


TCPTimeout=5

PriorityType=priority/multifactor

PriorityDecayHalfLife=7-0

PriorityCalcPeriod=5

PriorityUsageResetPeriod=QUARTERLY

PriorityFavorSmall=NO

PriorityMaxAge=7-0

PriorityWeightAge=1

PriorityWeightFairshare=100

PriorityWeightJobSize=1000

PriorityWeightPartition=1000

PriorityWeightQOS=0

PropagateResourceLimitsExcept=MEMLOCK

And testing script is this:

#!/bin/bash

*#SBATCH --time=5-00:00:00*

srun /bin/hostname

date

sleep 50

date

Why my job is being submited into the queue and not refused BEFORE being 
queued?


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Ole Holm Nielsen via slurm-users

Hi Simon,

Maybe you could print the user's limits using this tool:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

Which version of Slurm do you run?

/Ole

On 3/14/24 17:47, Simon Andrews via slurm-users wrote:
Our cluster has developed a strange intermittent behaviour where jobs are 
being put into a pending state because they aren’t passing the 
AssocGrpCpuLimit, even though the user submitting has enough cpus for the 
job to run.


For example:

$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"

JOBID PARTITION NAME USER ST   TIME MIN_MEM MIN_CPU 
NODELIST(REASON)


    799    normal hostname andrewss PD   0:00  2G   5   
(AssocGrpCpuLimit)


..so the job isn’t running, and it’s the only job in the queue, but:

$ sacctmgr list associations part=normal user=andrewss 
format=Account,User,Partition,Share,GrpTRES


    Account   User  Partition Share   GrpTRES

-- -- -- - -

   andrewss   andrewss normal 1 cpu=5

That user has a limit of 5 CPUs so the job should run.

The weird thing is that this effect is intermittent.  A job can hang and 
the queue will stall for ages but will then suddenly start working and you 
can submit several jobs and they all work, until one fails again.


The cluster has active nodes and plenty of resource:

$ sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST

normal*    up   infinite  2   idle compute-0-[6-7]

interactive    up 1-12:00:00  3   idle compute-1-[0-1,3]

The slurmctld log just says:

[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 
InitPrio=4294901720 usec=259


Whilst it’s in this state I can run other jobs with core requests of up to 
4 and they work, but not 5.  It’s like slurm is adding one CPU to the 
request and then denying it.


//

I’m sure I’m missing something fundamental but would appreciate it if 
someone could point out what it is!


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm management of dual-node server trays?

2024-03-07 Thread Ole Holm Nielsen via slurm-users

Hi Luke,

Thanks very much for your feedback about Lenovo SD650 V1 water-cooled 
servers.  The new SD665 V3 also consists of 2 AMD Genoa servers in a 
shared tray.  I have now installed RockyLinux 8.9 on the nodes and tested 
the Infiniband connectivity.


Fortunately, Lenovo/Mellanox/Nvidia seem to have fixed the Infiniband 
"SharedIO" adapters in the latest generation (V3 ?) of servers so that the 
Primary node can be rebooted or even powered off *without* causing any 
Infiniband glitches on the Auxiliary (left-hand) node.  This is a great 
relief to me :-)


What I did was to run a ping command on the IPoIB interface on the 
Auxiliary (left-hand) node to another node in the cluster, while rebooting 
or powering down the Primary node.  Not a single IP packet was lost.


Best regards,
Ole

On 3/1/24 18:25, Luke Sudbery wrote:

We have these cards in some sd650v1 servers.

You get 2 nodes in a 1u configuration, but they are attached, you can only 
pull both out of the rack at once.


Ours are slightly older, so we only have 1x 1Gb on-board per server, plus 
1x 200Gb HDR port on the B server, which provides a “virtual” 200G 4xHDR 
port on each node, although I think in practice they function as a 2xHDR 
100G port on each server. I think the sd655v3 will have 2x 25G SPF+ NICs 
on each node, plus the 1 or 2 200G NDR QSFP112 ports provided by the 
ConnectX7 card on the B server, shared with the A server.


You can totally reboot either server without affecting the other. You will 
just see something like:


[  356.799171] mlx5_core :58:00.0: mlx5_fw_tracer_start:821:(pid 819): 
FWTracer: Ownership granted and active


As the “owner” fails over from one node to the other.

However, doing a full power off of the B server, will crash the A node:

bear-pg0206u28a: 02/24/2024 15:36:57 OS Stop, Run-time critical Stop 
(panic/BSOD) (Sensor 0x46)


bear-pg0206u28a: 02/24/2024 15:36:58 Critical Interrupt, Bus Uncorrectable 
Error (PCIs)


bear-pg0206u28a: 02/24/2024 15:37:01 Slot / Connector, Fault Status 
asserted PCIe 1 (PCIe 1)


bear-pg0206u28a: 02/24/2024 15:37:03 Critical Interrupt, Software NMI (NMI 
State)


bear-pg0206u28a: 02/24/2024 15:37:07 Module / Board, State Asserted 
(SharedIO fail)


And this will put the fault light on. Once the B node is back, the A node 
will recover OK and you need to do virtual reseat or just restart the BMC 
to clear the fault.


So in day to day usage we don’t generally notice. It can be a bit of pain 
during outages or reinstall – obviously updating firmware on the B node 
will take out the A node too as the card needs to be reset. But they are 
fairly rare, and we don’t do anything special except 
rebooting/reinstalling the A nodes after the B nodes are all done to clear 
any errors.


Oh, and the A node doesn’t show up in the IB fabric:

[root@bear-pg0206u28a ~]# ibnetdiscover

ibwarn: [24150] _do_madrpc: send failed; Invalid argument

ibwarn: [24150] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)

/var/tmp/OFED_topdir/BUILD/rdma-core-58mlnx43/libibnetdisc/ibnetdisc.c:811; 
Failed to resolve self

ibnetdiscover: iberror: failed: discover failed

[root@bear-pg0206u28a ~]#

So we can’t use automated topology generation scripts (without a little 
special casing).


Cheers,

Luke

--

Luke Sudbery

Principal Engineer (HPC and Storage).

Architecture, Infrastructure and Systems

Advanced Research Computing, IT Services

Room 132, Computer Centre G5, Elms Road

*Please note I don’t work on Monday.*

*From:*Sid Young via slurm-users 
*Sent:* Friday, February 23, 2024 9:49 PM
*To:* Ole Holm Nielsen 
*Cc:* Slurm User Community List 
*Subject:* [slurm-users] Re: Slurm management of dual-node server trays?

*CAUTION:*This email originated from outside the organisation. Do not 
click links or open attachments unless you recognise the sender and know 
the content is safe.


Thats a Very interesting design and looking at the SD665 V3 documentation 
am I correct each node has dual 25GBs SFP28 interfaces?


If so, the despite dual nodes in a 1u configuration, you actually have 2 
separate servers?


Sid

On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, 
mailto:slurm-users@lists.schedmd.com>> wrote:


We're in the process of installing some racks with Lenovo SD665 V3 [1]
water-cooled servers.  A Lenovo DW612S chassis contains 6 1U trays with 2
SD665 V3 servers mounted side-by-side in each tray.

Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand
"SharedIO" adapters [2] so that one node is the Primary including a PCIe
adapter, and the other is Auxiliary with just a cable to the Primary's
adapter.

Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care
and planning.  What is worse is that when the Primary node is rebooted or
powered down, the Auxiliary node will lose its Infiniband connection and
may have a PCIe fault or an NMI a

[slurm-users] Slurm management of dual-node server trays?

2024-02-23 Thread Ole Holm Nielsen via slurm-users
We're in the process of installing some racks with Lenovo SD665 V3 [1] 
water-cooled servers.  A Lenovo DW612S chassis contains 6 1U trays with 2 
SD665 V3 servers mounted side-by-side in each tray.


Lenovo delivers SD665 V3 servers including water-cooled NVIDIA InfiniBand 
"SharedIO" adapters [2] so that one node is the Primary including a PCIe 
adapter, and the other is Auxiliary with just a cable to the Primary's 
adapter.


Obviously, servicing 2 "Siamese twin" Slurm nodes requires a bit of care 
and planning.  What is worse is that when the Primary node is rebooted or 
powered down, the Auxiliary node will lose its Infiniband connection and 
may have a PCIe fault or an NMI as documented in [3].  And when nodes are 
powered up, the Primary must have completed POST before the Auxiliary gets 
started.  I wonder how to best deal with power failures?


It seems that when Slurm jobs are running on Auxiliary nodes, these jobs 
are going to crash when the possibly unrelated Primary node goes down.


This looks like a pretty bad system design on the part of Lenovo :-(  The 
goal was apparently to same some money on IB adapters and having fewer IB 
cables.


Question: Do any Slurm sites out there already have experiences with 
Lenovo "Siamese twin" nodes with SharedIO IB?  Have you developed some 
operational strategies, for example dealing with node pairs as a single 
entity for job scheduling?


Thanks for sharing any ideas and insights!

Ole

[1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server
[2] 
https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters
[3] 
https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-servers-and-storage


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: URL for how to do for SLURM accounting setup

2024-02-15 Thread Ole Holm Nielsen via slurm-users

On 2/16/24 07:01, John Joseph via slurm-users wrote:
we were able to setup a test SLURM based system, with 4 nodes , Ubuntu 
22.04 LTS and we were able to run COMSOL using "comsol batch" command

Now we plan to have accounting

https://slurm.schedmd.com/accounting.html 




Like to reach out and get guidance on any tutorial or how to do 
documentation on setting up accounting


You might take a look at our Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/ and the other 
pages in the Wiki.


IHTH,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Ole Holm Nielsen via slurm-users

On 1/31/24 09:02, Bjørn-Helge Mevik via slurm-users wrote:

This isn't answering your question, but I strongly suggest you build
Slurm from source.  You can use the provided slurm.spec file to make
rpms (we do) or use "configure + make".  Apart from being able to
upgrade whenever a new version is out (especially important for
security!), you can tailor the rpms/build to your needs (IB? SlingShot?
Nvidia? etc.).


I agree that Slurm should be built from the source tar-balls in stead of 
installed some (outdated) repo.  Detailed installation instructions are in 
my Slurm Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/


Best regards,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen

On 1/30/24 09:36, Fokke Dijkstra wrote:
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in 
a completing state and slurmd daemons can't be killed because they are 
left in a CLOSE-WAIT state. See my previous mail to the mailing list for 
the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 
 for another site having 
issues.


Bug 18561 was submitted by a user with no support contract, so it's 
unlikely that SchedMD will look into it.


I guess many sites are considering the upgrade to 23.11, and if there is 
an issue as reported, a site with a valid support contract needs to open a 
support case.  I'm very interested in hearing about any progress with 23.11!


Thanks,
Ole



Re: [slurm-users] error

2024-01-18 Thread Ole Holm Nielsen

On 1/18/24 17:42, Felix wrote:

I started a new AMD node, and the error is as follows:

"CPU frequency setting not configured for this node"

extended looks like this:

[2024-01-18T18:28:06.682] CPU frequency setting not configured for this node
[2024-01-18T18:28:06.691] slurmd started on Thu, 18 Jan 2024 18:28:06 +0200
[2024-01-18T18:28:06.691] CPUs=128 Boards=1 Sockets=1 Cores=64 Threads=2 
Memory=256786 TmpDisk=875797 Uptime=4569 CPUSpecList=(null) 
FeaturesAvail=(null) FeaturesActive=(null)


In the configuration file I have the following:

NodeName=awn-1[04] NodeAddr=192.168.4.[111] CPUs=128 RealMemory=256000 
Sockets=1 CoresPerSocket=64 ThreadsPerCore=2 Feature=HyperThread


Could you please help me?


You should run "slurmd -C" on the node to print how Slurm sees your 
hardware.  That's more or less the configuration you should add to slurm.conf.


IHTH,
Ole



Re: [slurm-users] A fairshare policy that spans multiple clusters

2024-01-05 Thread Ole Holm Nielsen

On 05-01-2024 17:26, David Baker wrote:
We are soon to install new Slurm cluster at our site. That means that we 
will have a total of three clusters running Slurm. Only two, that is the 
new clusters, will share a common file system. The original cluster has 
its own file system is independent of the new arrivals. If possible, we 
would like to try to prevent users from making significant user of all 
the clusters and get a 'triple whammy'. In other words, is there any way 
to share the fairshare information between the clusters so that a user's 
use of one of the clusters impacts their usage on the other clusters – 
if that makes sense. Does anyone have any thoughts on this question, 
please?


Am I correct in thinking that federating clusters is related to my 
question? Do I gather correctly, however, that federation only works if 
there is a common database on a shared file system?


Just my 2 cents: Clusters in a federation share a common Slurm database, 
otherwise they're independent clusters.  See the presentation of 
federated clusters in https://slurm.schedmd.com/SC17/FederationSC17.pdf 
and the documentation page https://slurm.schedmd.com/federation.html


I don't know if federated fairshare is possible.

/Ole



Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Ole Holm Nielsen

On 23-12-2023 05:09, Jeffrey Tunison wrote:
Is there a straightforward way to create a batch job that runs once on 
every node in the cluster?


A technique simpler than generating a list from sinfo output and 
dispatching the job in a for loop for the N nodes.


That’s not very hard, but I thought there might be an elegant solution 
which would make dispatching maintenance jobs easier.


One solution is the method in this script:
https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh

This works very reliably for us when we need to apply OS or firmware 
updates.



SLURM 22.05.09


Note: You should apply the recent Slurm security updates ASAP!

/Ole



Re: [slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Ole Holm Nielsen

On 20-12-2023 15:59, Michael Bernasconi wrote:
I'm trying to get slurm working on an Intel 12th gen CPU. slurmd 
instantly fails with the error message "Thread count (24) not multiple 
of core count (16)".
I have tried adding "SlurmdParameters=config_overrides" to slurm.conf, 
and I have experimented with various combinations of "Sockets", 
"CoresperSocket", "ThreadsPerCore", and "CPUs" but so far nothing has 
worked.
Does anyone know how to get this working? Are CPUs with heterogeneous 
cores just not supported by slurm?


On the node run "slurmd -C" which will print Slurm's view of the 
hardware.  Use this information in your slurm.conf file.


You should configure the RealMemory value slightly less than what is 
reported by slurmd -C, because kernel upgrades may give a slightly lower 
RealMemory value in the future and cause problems with the node’s health 
status.


/Ole




Re: [slurm-users] install new slurm, no slurmctld found

2023-12-15 Thread Ole Holm Nielsen

Hi Farcas,

On 12/15/23 11:00, Felix wrote:

we are installing a new server with slurm on ALMA Linux 9.2


Slurm support on EL9 might perhaps be a little less mature than on EL8.


we did the followimg:

dnf install slurm

The result is

rpm -qa | grep slurm
slurm-libs-22.05.9-1.el9.x86_64
slurm-22.05.9-1.el9.x86_64


Please note the warning about Slurm RPM packages from EPEL in the Slurm 
Quick Start guide https://slurm.schedmd.com/quickstart_admin.html



NOTE: In the beginning of 2021, a version of Slurm was added to the EPEL 
repository. This version is not supported or maintained by SchedMD, and is not 
currently recommend for customer use. Unfortunately, this inclusion could cause 
Slurm to be updated to a newer version outside of a planned maintenance period. 
In order to prevent Slurm from being updated unintentionally, we recommend you 
modify the EPEL Repository configuration to exclude all Slurm packages from 
automatic updates.

exclude=slurm*


You should follow SchedMD's Slurm Quick Start guide in stead of installing 
Slurm packages from EPEL.


/Ole



Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen

On 12/13/23 10:44, John Joseph wrote:
Thanks for the mail, and sorry for not properly explaining what info I was 
requesting, what actually I meant was that how could we could  do a check 
how the HPC system I set is working.


Eg a program which can be run individually on a node, and comparing how 
the same code performed under the slurm system, Hope I was able to explain 
more clearly on my request

And sorry once again of not posting the email more clearly


Code performance is not related to Slurm resource manager.  Maybe you just 
want to find a benchmark code to run on your system.  For example, the HPL 
Linpack code, see https://www.top500.org/project/linpack/


/Ole



Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen

On 12/13/23 07:13, John Joseph wrote:

We have setup of slurm setup for a HPC setup of 4 node
We want to do a stress test , guidnace requested for getting a  code which 
can test the functionality of the SLURM efficiency.  If there is such  a 
program, like to try out

Guidance requested


Then please define clearly your question.  The Slurm resource manager is 
very efficient at handling hundreds or thousands of compute nodes with 
good functionality!


/Ole



Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ole Holm Nielsen

On 10-12-2023 17:29, Ryan Novosielski wrote:

This is basically always somebody filling up /tmp and /tmp residing on the same 
filesystem as the actual SlurmdSpoolDirectory.

/tmp, without modifications, it’s almost certainly the wrong place for 
temporary HPC files. Too large.


Agreed!  That's why temporary job directories may be configured in 
Slurm, see the Wiki page for a summary:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories

/Ole


On Dec 8, 2023, at 10:02, Xaver Stiensmeier  wrote:

Dear slurm-user list,

during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
directory on the workers that is used for job state information
(https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
I was unable to find more precise information on that dictionary. We
compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
of free space where nothing is intentionally put during the run. This
error only occurred on very few nodes.

I would like to understand what Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.

Best regards
Xaver Stiensmeier




Re: [slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Ole Holm Nielsen

Hi Xaver,

On 12/8/23 16:00, Xaver Stiensmeier wrote:

during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
directory on the workers that is used for job state information
(https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
I was unable to find more precise information on that dictionary. We
compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
of free space where nothing is intentionally put during the run. This
error only occurred on very few nodes.

I would like to understand what Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.


With Slurm RPM installation this directory is configured:

$ scontrol show config | grep SlurmdSpoolDir
SlurmdSpoolDir  = /var/spool/slurmd

In SlurmdSpoolDir we find job scripts and various cached data.  In our 
cluster it's usually a few Megabytes on each node.  We never had any 
issues with the size of SlurmdSpoolDir.


Do you store SlurmdSpoolDir on a shared network storage, or what?

Can you job scripts contain large amounts of data?

/Ole



Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

On 12/6/23 11:51, Xaver Stiensmeier wrote:

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.


There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving 
Power with Slurm" at https://slurm.schedmd.com/publications.html


For reasons of security and functionality it is recommended to follow 
Slurm's releases (maybe not the first few minor versions of new major 
releases like 23.11).  FYI, I've collected information about upgrading 
Slurm in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


/Ole



Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xaver,

Your version of Slurm may matter for your power saving experience.  Do you 
run an updated version?


/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Your repository would've been really helpful for me when we started>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:

using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason= Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save




Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of 
many (>242) node starts that resulted in


|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances - 
that are created on demand - are again `~idle` after being removed by the 
fail program. They are set to RESUME before the actual instance gets 
destroyed. I remember that I had this case manually before, but I don't 
remember when it occurs.


Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with 
state=RESUME.  The scontrol manual page says:


Reason= Identify the reason the node is in a "DOWN", "DRAINED", 
"DRAINING", "FAILING" or "FAIL" state.


Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole




Re: [slurm-users] RPC rate limiting for different users

2023-11-28 Thread Ole Holm Nielsen

On 11/28/23 11:59, Cutts, Tim wrote:
Is the new rate limiting feature always global for all users, or is there 
an option, which I’ve missed, to have different settings for different 
users?  For example, to allow a higher rate from web services which submit 
jobs on behalf of a large number of users?


The rate limiting is global for all users.  You can only play with the 
various rl_* parameters described in the slurm.conf manual page to 
increase bucket size etc. for everyone.


/Ole



Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen




On 11/24/23 12:15, Ole Holm Nielsen wrote:

On 11/24/23 09:31, Gestió Servidors wrote:
Some days ago, I started to configure a new server with SLURM 23.02.5. 
Yesterday, I read in this mailing list that version 23.11.0 was 
released, so today I have compiled this latest version. However, after 
starting slurmdbd (with a database upgrade), I have got problems with 
slurmctld, because of “select/cons_res” has dissapeard and, now, this 
parameter must be changed to “select_cons/tres”. If I change this, what 
is the impact in the system/cluster? What are the differences between 
“select_cons/res” and “select_cons/tres” if in newest versions it is 
mandatory to configure in this way?


In https://bugs.schedmd.com/show_bug.cgi?id=15470#c3 you can read:

When making a change from cons_res to cons_tres there isn't much you 
need to do.  These two select type plugins are very similar.  The 
difference being that the cons_tres plugin adds much more functionality 
related to GPUs.  If you're moving from cons_res to cons_tres there 
shouldn't be any effect on the running jobs.  If you were changing from 
cons_tres to cons_res and you had jobs on the system that used the new 
syntax that is available for GPUs, then you would run into problems.  
For your reference, this is described in the documentation here:

https://slurm.schedmd.com/slurm.conf.html#OPT_SelectType_1

This page shows the new options that are available when using the 
cons_tres plugin:

https://slurm.schedmd.com/cons_res.html#using_cons_tres


Note added:  A little more details are in this Wiki page: 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#upgrade-cons-res-to-cons-tres


/Ole



Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen

On 11/24/23 09:31, Gestió Servidors wrote:
Some days ago, I started to configure a new server with SLURM 23.02.5. 
Yesterday, I read in this mailing list that version 23.11.0 was released, 
so today I have compiled this latest version. However, after starting 
slurmdbd (with a database upgrade), I have got problems with slurmctld, 
because of “select/cons_res” has dissapeard and, now, this parameter must 
be changed to “select_cons/tres”. If I change this, what is the impact in 
the system/cluster? What are the differences between “select_cons/res” and 
“select_cons/tres” if in newest versions it is mandatory to configure in 
this way?


In https://bugs.schedmd.com/show_bug.cgi?id=15470#c3 you can read:


When making a change from cons_res to cons_tres there isn't much you need to 
do.  These two select type plugins are very similar.  The difference being that 
the cons_tres plugin adds much more functionality related to GPUs.  If you're 
moving from cons_res to cons_tres there shouldn't be any effect on the running 
jobs.  If you were changing from cons_tres to cons_res and you had jobs on the 
system that used the new syntax that is available for GPUs, then you would run 
into problems.  For your reference, this is described in the documentation here:
https://slurm.schedmd.com/slurm.conf.html#OPT_SelectType_1

This page shows the new options that are available when using the cons_tres 
plugin:
https://slurm.schedmd.com/cons_res.html#using_cons_tres


IHTH,
Ole



Re: [slurm-users] slurm comunication between versions

2023-11-23 Thread Ole Holm Nielsen

Hi Felix,

On 11/23/23 18:14, Felix wrote:
Will slurm-20.02 which is installed on a management node comunicate with 
slurm-22.05 installed on a work nodes?


They have the same configuration file slurm.conf

Or do the version have to be the same. Slurm 20.02 was installed manually 
and slurm 22.05 was installed through dnf.


It is only possible for Slurm versions in a cluster to differ by 2 major 
versions.  The 22.05 slurmctld can therefore only work with slurmd 22.05, 
21.08 and 20.11. The documentation is in 
https://slurm.schedmd.com/quickstart_admin.html#upgrade


My Slurm Wiki page explains the details of how to upgrade Slurm versions: 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


Please note that the current version of Slurm is now 23.11, and that only 
23.11 and 23.02 are supported.  There are important security fixes that 
makes it important to upgrade to a supported version of Slurm.


I hope this will help you to make the upgrades.

Best regards,
Ole



Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Ole Holm Nielsen

On 11/23/23 11:50, Markus Kötter wrote:

On 23.11.23 10:56, Schneider, Gerald wrote:

I have a recurring problem with allocated TRES, which are not
released after all jobs on that node are finished. The TRES are still
marked as allocated and no new jobs can't be scheduled on that node
using those TRES.


Remove the node from slurm.conf and restart slurmctld, re-add, restart.
Remove from Partition definitions as well.


Just my 2 cents:  Do NOT remove a node from slurm.conf just as described!

When adding or removing nodes, both slurmctld as well as all slurmd's must 
be restarted!  See the SchedMD presentation 
https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf slides 51-56 for the 
recommended procedure.


/Ole



Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-19 Thread Ole Holm Nielsen

On 19-11-2023 09:11, Joseph John wrote:

I am new user, trying out SLURM

Like to check if the SLURM has a GUI/web based management tool also


Did you read the Quick Start Administrator Guide at
https://slurm.schedmd.com/quickstart_admin.html ?

I don't believe there are any Slurm management tools as a web GUI, and 
that would probably be a security nightmare anyway because privileged 
system access is required.


There are a number of monitoring tools for viewing the status of Slurm jobs.

/Ole



Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-13 Thread Ole Holm Nielsen

Hi Max and Ward,

I've made a variation of your scripts which wait for at least 1 Infiniband 
port to come up before starting services such as slurmd or NFS mounts.


I prefer Max's Systemd service which comes before the Systemd 
network-online.target.  And I like Ward's script which checks the 
Infiniband status in /sys/class/infiniband/ in stead of relying on 
NetworkManager being installed.


At our site there are different types of compute nodes with different 
types of NICs:


1. Mellanox Infiniband.
2. Cornelis Omni-Path behaving just like Infiniband.
3. Intel X722 Ethernet NICs presenting a "fake" iRDMA Infiniband.
4. Plain Ethernet only.

I've written some modified scripts which are available in
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/InfiniBand
and which have been tested on the 4 types of NICs listed above.

The case 3. is particularly troublesome as reported earlier because it's 
an Ethernet port which presents an iRDMA InfiniBand interface.  My 
waitforib.sh script skips NICs whose link_layer type is not equal to 
InfiniBand.


Comments and suggestions would be most welcome.

Best regards,
Ole

On 11/10/23 19:45, Ward Poelmans wrote:

Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:

On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This might disturb the logic in waitforib.sh, or at least cause some 
confusion?


I had never heard of these cards. But if they behave like infiniband 
cards, is there also an .../ports/1/state file present in /sys with the 
state? In that case it should work just as well.


We could also change the glob '/sys/class/infiniband/*/ports/*/state' to 
only look at devices starting with mlx. I have no clue how much diversity 
is out there, we only have Mellanox cards (or rebrands of those).



IMHO, this seems quite confusing.


Yes, I agree.


Regarding the slurmd service:


An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has:

Before=network-online.target

What do you think of these considerations?


I think Max his approach is the better one. We only do it for slurmd while 
his is completely general for everything that waits on network. The 
downside is probably that if you have issue with your IB network, this 
will make it worse ;)


Ward




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ole Holm Nielsen

Hi Ward,

On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This looks really good!  However, I was testing the waitforib.sh script on 
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC 
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).


The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that 
the Ethernet ports are also Infiniband ports:


# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 -> 
../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.0/infiniband/irdma0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 -> 
../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.1/infiniband/irdma1


This might disturb the logic in waitforib.sh, or at least cause some 
confusion?


One advantage of Max's script using NetworkManager is that nmcli isn't 
fooled by the fake irdma Infiniband device:


# nmcli connection show
NAME  UUID  TYPE  DEVICE
eno1  cb0937f8-1902-48f7-8139-37cf0c4077b2  ethernet  eno1
eno2  98130354-9215-412e-ab26-032c76c2dbe4  ethernet  --

I found a discussion of the mysterious irdma device in
https://github.com/prometheus/node_exporter/issues/2769
with this explanation:


The irdma module is Intel's replacement for the legacy i40iw module, which was the 
iWARP driver for the Intel X722. The irdma module is a complete rewrite, which 
landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP 
& RoCE).


The Infiniband commands also work on the fake device, claiming that it 
runs 100 Gbit/s:


# ibstatus
Infiniband device 'irdma0' port 1 status:
default gid: 3cec:ef38:d960:::::
base lid:0x1
sm lid:  0x0
state:   4: ACTIVE
phys state:  5: LinkUp
rate:100 Gb/sec (4X EDR)
link_layer:  Ethernet

Infiniband device 'irdma1' port 1 status:
default gid: 3cec:ef38:d961:::::
base lid:0x1
sm lid:  0x0
state:   1: DOWN
phys state:  3: Disabled
rate:100 Gb/sec (4X EDR)
link_layer:  Ethernet

IMHO, this seems quite confusing.

Regarding the slurmd service:


And we add it as a dependency for slurmd:

$ cat /etc/systemd/system/slurmd.service.d/wait.conf

[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity

[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service


An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service 
which has:

Before=network-online.target

What do you think of these considerations?

Best regards,
Ole


On 2/11/2023 09:28, Ole Holm Nielsen wrote:

Hi Ward,

Thanks a lot for the feedback!  The method of probing 
/sys/class/infiniband/*/ports/*/state is also used in the NHC script 
lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
from the NetworkManager package.


Can I ask you how you implement your script as a service in the Systemd 
booting process, perhaps similar to Max's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces ?


Thanks,
Ole

On 11/1/23 20:09, Ward Poelmans wrote:
We have a slightly difference script to do the same. It only relies on 
/sys:


# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
 logger "No infiniband found"
 exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
 for port in ${ports}; do
 if grep -qc ACTIVE $port; then
 logger "Infiniband online at $port"
 exit 0
 fi
 done
 sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1







Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-02 Thread Ole Holm Nielsen

Hi Ward,

Thanks a lot for the feedback!  The method of probing 
/sys/class/infiniband/*/ports/*/state is also used in the NHC script 
lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
from the NetworkManager package.


Can I ask you how you implement your script as a service in the Systemd 
booting process, perhaps similar to Max's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces ?


Thanks,
Ole

On 11/1/23 20:09, Ward Poelmans wrote:

We have a slightly difference script to do the same. It only relies on /sys:

# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
     logger "No infiniband found"
     exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
     for port in ${ports}; do
     if grep -qc ACTIVE $port; then
     logger "Infiniband online at $port"
     exit 0
     fi
     done
     sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1




Re: [slurm-users] RES: RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
I would like to report how the Infiniband/OPA network device starts up 
step by step as reported by Max's Systemd service from 
https://github.com/maxlxl/network.target_wait-for-interfaces


This is the sequence of events during boot:

$ grep wait-for-interfaces.sh /var/log/messages
Nov  1 16:13:39 d064 wait-for-interfaces.sh[1610]: Wait for network devices
Nov  1 16:13:39 d064 wait-for-interfaces.sh[1610]: Available connections are:
Nov  1 16:13:40 d064 wait-for-interfaces.sh[1613]: NAMEUUID 
  TYPEDEVICE
Nov  1 16:13:40 d064 wait-for-interfaces.sh[1613]: eno8403 
1108d0aa-8841-4f2e-b42e-bd9509a2aba0  ethernet--
Nov  1 16:13:40 d064 wait-for-interfaces.sh[1613]: System eno8303 
44931a14-005a-415d-a82b-8c1a2007a118  ethernet--
Nov  1 16:13:40 d064 wait-for-interfaces.sh[1613]: System ib0 
2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89  infiniband  --
Nov  1 16:13:40 d064 wait-for-interfaces.sh[2011]: Error: Device 'ib0' not 
found.
Nov  1 16:13:41 d064 wait-for-interfaces.sh[2127]: Error: Device 'ib0' not 
found.
Nov  1 16:13:41 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online:
Nov  1 16:13:42 d064 wait-for-interfaces.sh[2134]: Error: Device 'ib0' not 
found.
Nov  1 16:13:42 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online:
Nov  1 16:13:43 d064 wait-for-interfaces.sh[2148]: Error: Device 'ib0' not 
found.
Nov  1 16:13:43 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online:
Nov  1 16:13:44 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:45 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:46 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:47 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:48 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:49 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:50 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:51 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 20 (unavailable)
Nov  1 16:13:52 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
100 (connected)ib0 to come online: 20 (unavailable)
Nov  1 16:13:53 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 80 (connecting (checking IP connectivity))
Nov  1 16:13:54 d064 wait-for-interfaces.sh[1610]: Waiting for interface 
ib0 to come online: 100 (connected)


As you can see there are many intermediate steps before the "100 
(connected)" status reports that ib0 is up.


The slurmd service will only start after this, which is what we wanted.

Best regards,
Ole

On 11/1/23 14:03, Paulo Jose Braga Estrela wrote:

Ole,

Look at the NetworkManager-wait-online.service man page bellow (from RHEL 8.8). 
Maybe your IB interfaces aren't properly configured in NetworkManager. The *** 
were added by me.

" NetworkManager-wait-online.service blocks until NetworkManager logs "startup 
complete" and announces startup
complete on D-Bus. How long that takes depends on the network and the 
NetworkManager configuration. If it
takes longer than expected, then the reasons need to be investigated in 
NetworkManager.

There are various reasons what affects NetworkManager reaching "startup 
complete" and how long
NetworkManager-wait-online.service blocks.

·   In general, ***startup complete is not reached as long as 
NetworkManager is busy activating a device and as
long as there are profiles in activating state ***. During boot, 
NetworkManager starts autoactivating
suitable profiles that are ***configured to autoconnect***. If 
activation fails, NetworkManager might retry
right away (depending on connection.autoconnect-retries setting). 
While trying and retrying,
NetworkManager is busy until all profiles and devices either 
reached an activated or disconnected state
and no further events are expected.

***Basically, as long as there are devices and connections in 
activating state visible with nmcli device
and nmcli connection, startup is still pending. ***"



PÚBLICA
-Mensagem original-----
De: slurm-users  Em nome de Ole Holm 
Nielsen
Enviada em: quarta-feira, 1 de novembro de 2023 05:19
Para: slurm-users@lists.schedmd.com
Assunto: Re: [slurm-users] RES: How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

Hi Paulo,
> O emitente desta mensagem é responsável por seu conteúdo e endereçamento e 
deve observar as normas internas da Petrobras. Cabe ao destina

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen

Hi Rémi,

Thanks for the feedback!  The patch revert[1] explains SchedMD's reason:


The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not if the
HealthCheckProgram execution failed or returned non-zero exit code. For
that case, the program itself would take the appropiate actions, such
as draining the node and setting an appropiate Reason.


We speculate that there may possibly be an issue with slurmd starting up 
at boot time and starting new jobs, while NHC is running in a separate 
thread and possibly fails the node AFTER the job has started!  NHC might 
fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not 
yet started up completely.


I still need to verify whether this observation is correct and 
reproducible.  Does anyone have evidence that jobs start before NHC is 
complete when slurmd starts up?


IMHO, slurmd ought to start up without delay at boot time, then execute 
the NHC and wait for it to complete.  Only after NHC has succeeded without 
errors should slurmd begin accepting new jobs.


We should configure NHC to make site-specific hardware and network checks, 
for example for Infiniband/OPA network or NVIDIA GPUs.


Best regards,
Ole

On 11/1/23 09:44, Rémi Palancher wrote:

Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :

I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
by slurmd then fails the node (as it should).  This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA network work without problems.

Question: Does anyone know how to reliably delay the start of the slurmd
Systemd service until the Infiniband/OPA network is fully up?

…


FWIW, after a while struggling with systemd dependencies to wait for
availability of networks and shared filesystems, we ended up with a
customer writing a patch in Slurm to delay slurmd registration (and jobs
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1]
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal
and should be taken with caution but it works for years for this customer.

[1]
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620



Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen

Hi Paulo,

On 11/1/23 01:12, Paulo Jose Braga Estrela wrote:

I think that you should use NetworkManager-wait-online.service In RHEL 8. Take 
a look at its man page. It only allows the system reach network-online after 
all network interfaces are online. So, if your OP interfaces are managed by 
Network Manager, you can use it.


Unfortunately NetworkManager-wait-online.service returns as soon as 1 
network interface is up.  It doesn't wait for any other networks, 
including the Infiniband/OPA network, unfortunately :-(


You can see that the NetworkManager-wait-online.service file executes:

ExecStart=/usr/bin/nm-online -s -q

and this is causing our problems with Infiniband/OPA networks.  This is 
the reason why we need Max's workaround wait-for-interfaces.service.


/Ole



-Mensagem original-
De: slurm-users  Em nome de Ole Holm 
Nielsen
Enviada em: terça-feira, 31 de outubro de 2023 07:00
Para: Slurm User Community List 
Assunto: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

Hi Jeffrey,

On 10/30/23 20:15, Jeffrey R. Lang wrote:

The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


Thanks for the info.  We can install the systemd-networkd RPM from the EPEL 
repo as you suggest.

I tried to understand the properties of systemd-networkd before implementing it 
in our compute nodes.  While there are lots of networkd man-pages, it's harder 
to find an overview of the actual properties of networkd.  This is what I found:

* Networkd is a service included in recent versions of Systemd.  It seems to be 
an alternative to NetworkManager.

* Red Hat has stated that systemd-networkd is NOT going to be implemented in 
RHEL 8 or 9.

* Comparing systemd-networkd and NetworkManager:
https://fedoracloud.readthedocs.io/en/latest/networkd.html

* Networkd is described in the Wikipedia article
https://en.wikipedia.org/wiki/Systemd

While networkd seems to be really nifty, I hesitate to replace NetworkManager 
by networkd on our EL8 and EL9 systems because this is an unsupported and only 
lightly tested setup, and it may require additional work to keep our systems 
up-to-date in the future.

It seems to me that Max Rutkowski's solution in
https://github.com/maxlxl/network.target_wait-for-interfaces is less intrusive 
than converting to systemd-networkd.

Best regards,
Ole



-Original Message-
From: slurm-users  On Behalf Of
Ole Holm Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Ole Holm Nielsen

Hi Jeffrey,

On 10/30/23 20:15, Jeffrey R. Lang wrote:

The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


Thanks for the info.  We can install the systemd-networkd RPM from the 
EPEL repo as you suggest.


I tried to understand the properties of systemd-networkd before 
implementing it in our compute nodes.  While there are lots of networkd 
man-pages, it's harder to find an overview of the actual properties of 
networkd.  This is what I found:


* Networkd is a service included in recent versions of Systemd.  It seems 
to be an alternative to NetworkManager.


* Red Hat has stated that systemd-networkd is NOT going to be implemented 
in RHEL 8 or 9.


* Comparing systemd-networkd and NetworkManager: 
https://fedoracloud.readthedocs.io/en/latest/networkd.html


* Networkd is described in the Wikipedia article 
https://en.wikipedia.org/wiki/Systemd


While networkd seems to be really nifty, I hesitate to replace 
NetworkManager by networkd on our EL8 and EL9 systems because this is an 
unsupported and only lightly tested setup, and it may require additional 
work to keep our systems up-to-date in the future.


It seems to me that Max Rutkowski's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces is less 
intrusive than converting to systemd-networkd.


Best regards,
Ole



-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in 
RHEL 8 and clones, AFAICT.


/Ole




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Max,

Thanks so much for your fast response with a solution!  I didn't know that 
NetworkManager (falsely) claims that the network is online as soon as the 
first interface comes up :-(


Your solution of a wait-for-interfaces Systemd service makes a lot of 
sense, and I'm going to try it out.


Best regards,
Ole

On 10/30/23 14:30, Max Rutkowski wrote:

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.


Our solution was to implement a little wait-for-interface systemd service 
which delays the network.target until the ib interface has come up.


Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.


I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces


You may need to do small adjustments, but it's pretty straight forward

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620 in

general.


Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
executed by slurmd then fails the node (as it should).  This happens 
only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
with Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
the CornelisNetworks drivers are not available for RHEL 8.8.

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!


Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 
adapter) is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check 
(NHC), but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole


--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkow...@gfz-potsdam.de
___

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam




[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed 
by slurmd then fails the node (as it should).  This happens only on EL8 
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with 
Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since the 
CornelisNetworks drivers are not available for RHEL 8.8.


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!


Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 adapter) 
is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check (NHC), 
but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] RES: Change something in user's script using job_submit.lua plugin

2023-10-28 Thread Ole Holm Nielsen

Hi Paulo,

Maybe what you see is due to a bug then?  You might try to update Slurm 
to see if has been fixed.


You should not use the Slurm RPMs from EPEL - I think offering these 
RPMs was a mistake.


Anyway you ought to upgrade to the latest Slurm 23.02.6 since a serious 
security issue was fixed a couple of weeks ago.  Older Slurm versions 
are all affected!  Perhaps this Wiki guide can help you upgrade to the 
latest RPM: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/


/Ole


On 27-10-2023 13:13, Paulo Jose Braga Estrela wrote:

Yes, the script is running and changing other fields like comment, partition, 
account is working fine. The only problem seems to be the script field of 
job_rec. I'm using Slurm 20.11.9 from EPEL repository for RHEL 8. Thank you for 
sharing your Wiki. I've accessed it before. It's really useful for HPC 
engineers.

Best regards,


PÚBLICA
-Mensagem original-
De: slurm-users  Em nome de Ole Holm 
Nielsen
Enviada em: sexta-feira, 27 de outubro de 2023 03:31
Para: slurm-users@lists.schedmd.com
Assunto: Re: [slurm-users] Change something in user's script using 
job_submit.lua plugin

Hi Paulo,

Which Slurm version do you have, and did you set this in slurm.conf:
JobSubmitPlugins=lua ?

Perhaps you may find some useful information in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

/Ole


On 26-10-2023 19:07, Paulo Jose Braga Estrela wrote:

Is it possible to change something in user’s sbatch script by using a
job_submit plugin? To be more specific, using Lua job_submit plugin.

I’m trying to do the following in job_submit.lua when a user changes
job’s partition to “cloud” partition, but the script got executed
without modification.

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)

  if job_desc.partition == "cloud" then

  slurm.log_info("slurm_job_modify: Bursting job %u
from uid %u to the cloud...",job_rec.job_id,modify_uid)

  script = job_rec.script

  slurm.log_info("Script BEFORE change: %s",script)

  -- changing user command to another command

  script = string.gsub(script,"local command","cloud
command")

  slurm.log_info("Script AFTER change %s",script)

  -- The script variable is really changed

  job_rec.script = script

  slurm.log_info("Job RECORD SCRIPT %s",job_rec.script)

  -- The job record also got changed, but the EXECUTED
script isn’t changed at all. It runs without modification.

  end

  return slurm.SUCCESS

end

*PAULO ESTRELA*





Re: [slurm-users] Change something in user's script using job_submit.lua plugin

2023-10-27 Thread Ole Holm Nielsen

Hi Paulo,

Which Slurm version do you have, and did you set this in slurm.conf: 
JobSubmitPlugins=lua ?


Perhaps you may find some useful information in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

/Ole


On 26-10-2023 19:07, Paulo Jose Braga Estrela wrote:
Is it possible to change something in user’s sbatch script by using a 
job_submit plugin? To be more specific, using Lua job_submit plugin.


I’m trying to do the following in job_submit.lua when a user changes 
job’s partition to “cloud” partition, but the script got executed 
without modification.


function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)

     if job_desc.partition == "cloud" then

     slurm.log_info("slurm_job_modify: Bursting job %u from 
uid %u to the cloud...",job_rec.job_id,modify_uid)


 script = job_rec.script

     slurm.log_info("Script BEFORE change: %s",script)

     -- changing user command to another command

     script = string.gsub(script,"local command","cloud 
command")


     slurm.log_info("Script AFTER change %s",script)

     -- The script variable is really changed

     job_rec.script = script

     slurm.log_info("Job RECORD SCRIPT %s",job_rec.script)

     -- The job record also got changed, but the EXECUTED 
script isn’t changed at all. It runs without modification.


     end

     return slurm.SUCCESS

end

*PAULO ESTRELA*





Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Ole Holm Nielsen

Hi Tim,

I think the scontrol manual page explains the "scontrol reboot" function 
fairly well:



   reboot  [ASAP]  [nextstate={RESUME|DOWN}] [reason=]
   {ALL|}
  Reboot the nodes in the system when they become idle  using  the
  RebootProgram  as  configured  in Slurm's slurm.conf file.  Each
  node will have the "REBOOT" flag added to its node state.  After
  a  node  reboots  and  the  slurmd  daemon  starts up again, the
  HealthCheckProgram will run once. Then, the slurmd  daemon  will
  register  itself with the slurmctld daemon and the "REBOOT" flag
  will be cleared.  The node's "DRAIN" state flag will be  cleared
  if  the reboot was "ASAP", nextstate=resume or down.  The "ASAP"
  option adds the "DRAIN" flag to each  node's  state,  preventing
  additional  jobs  from running on the node so it can be rebooted
  and returned to service  "As  Soon  As  Possible"  (i.e.  ASAP).


It seems to be implicitly understood that if nextstate is specified, this 
implies setting the "DRAIN" state flag:


The node's "DRAIN" state flag will be  cleared if the reboot was "ASAP", nextstate=resume or down. 


You can verify the node's DRAIN flag with "scontrol show node ".

IMHO, if you want nodes to continue accepting new jobs, then nextstate is 
irrelevant.


We always use "reboot ASAP" because our cluster is usually so busy that 
nodes never become idle if left to themselves :-)


FYI: We regularly make package updates and firmware updates using the 
"scontrol reboot asap" method which is explained in this script:

https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh

Best regards,
Ole,
Ole


On 10/25/23 13:39, Tim Schneider wrote:

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you cite, 
the command is supposed to be "scontrol reboot nextstate=RESUME" (without 
ASAP).


So to clarify: my problem is that if I type "scontrol reboot 
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On 
the other hand, if I type "scontrol reboot", jobs continue to get 
scheduled, which is what I want. I just don't understand, why setting 
nextstate results in the nodes not accepting jobs anymore.


My usecase is similar to the one you describe. We use the ASAP option when 
we install a new image to ensure that from the point of the reinstallation 
onwards, all jobs end up on nodes with the new configuration only. 
However, in some cases when we do only minor changes to the image 
configuration, we prefer to cause as little disruption as possible and 
just reinstall the nodes whenever they are idle. Here, being able to set 
nextstate=RESUME is useful, since we usually want the nodes to resume 
after reinstallation, no matter what their previous state was.


Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES" to ensure that from that point
onwards no new jobs would start in the old software configuration, only
the new one.

Also slurmctld would know that these nodes are due to come back in
"ResumeTimeout" seconds after the reboot is issued and so could plan for
them as part of scheduling large jobs, rather than thinking there was no
way it could do so and letting lots of smaller jobs get in the way.




Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Ole Holm Nielsen

On 10/13/23 12:22, Taras Shapovalov wrote:

Oh, does this mean that no one should use Slurm versions <= 21.08 any more?


SchedMD recommends to use the currently supported versions (currently 
22.05 or 23.02).  Next month 23.11 will be released and 22.05 will become 
unsupported.


The question for sites is whether they can accept running software that 
contains known security holes?  That goes for Slurm as well as all other 
software such as the Linux kernel etc. etc.  We don't yet know the CVE 
score for CVE-2023-41914, but SchedMD's description of the fixes sounds 
pretty serious.


IMHO, your organization's IT security policy should be consulted in order 
to answer your question.


/Ole



Re: [slurm-users] Slurm powersave

2023-10-06 Thread Ole Holm Nielsen

Hi Davide,

On 10/5/23 15:28, Davide DelVento wrote:

IMHO, "pretending" to power down nodes defies the logic of the Slurm
power_save plugin. 


And it is sure useless ;)
But I was using the suggestion from 
https://slurm.schedmd.com/power_save.html 
 which says


You can also configure Slurm with programs that perform no action as 
*SuspendProgram* and *ResumeProgram* to assess the potential impact of 
power saving mode before enabling it.


I had not noticed the above sentence in the power_save manual before!  So 
I decided to test a "no action" power saving script, similar to what you 
have done, applying it to a test partition.  I conclude that "no action" 
power saving DOES NOT WORK, at least in Slurm 23.02.5.  So I opened a bug 
report https://bugs.schedmd.com/show_bug.cgi?id=17848 to find out if the 
documentation is obsolete, or if there may be a bug.  Please follow that 
bug to find out the answer from SchedMD.


What I *believe* (but not with 100% certainty) really happens with power 
saving in the current Slurm versions is what I wrote yesterday:



Slurmctld expects suspended nodes to *really* power
down (slurmd is stopped).  When slurmctld resumes a suspended node, it
expects slurmd to start up when the node is powered on.  There is a
ResumeTimeout parameter which I've set to about 15-30 minutes in case of
delays due to BIOS updates and the like - the default of 60 seconds is
WAY too small!


I hope this helps,
Ole



Re: [slurm-users] Slurm powersave

2023-10-05 Thread Ole Holm Nielsen

Hi Davide,

On 10/4/23 23:03, Davide DelVento wrote:
I'm experimenting with slurm powersave and I have several questions. I'm 
following the guidance from https://slurm.schedmd.com/power_save.html 
 and the great presentation 
from our own https://slurm.schedmd.com/SLUG23/DTU-SLUG23.pdf 



I presented that talk at SLUG'23 :-)


I am running slurm 23.02.3

1) I'm not sure I fully understand ReconfigFlags=KeepPowerSaveSettings
The documentations ways that if set, an "scontrol reconfig" command will 
preserve the current state of SuspendExcNodes, SuspendExcParts and 
SuspendExcStates. Why would one *NOT* want to preserve that? What would 
happen if one does not (or does) have this setting? For now I'm using it, 
assuming that it means "if I run scontrol reconfig" don't shut off nodes 
that are up because I said so that they should be up in slurm.conf with 
those three options" --- but I am not clear if that is really what it says.


As I understand it, the ReconfigFlags means that if you updated some 
settings using scontrol, they will be lost when slurmctld is reconfigured, 
and the settings from slurm.conf will be used in stead.


2) the PDF above says that the problem with nodes in down and drained 
state is solved in 23.02 but that does not appear to be the case. Before 
running my experiment, I had


$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
ECC memory errors    root      2023-08-26T07:21:04 node27

and after it became

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
none                 Unknown   Unknown             node27


Please use "sinfo -lR" so that we can see the node STATE.


And that despite having excluded drain'ed nodes as below:

--- a/slurm/slurm.conf
+++ b/slurm/slurm.conf
@@ -140,12 +140,15 @@ SlurmdLogFile=/var/log/slurm/slurmd.log
  #
  #
  # POWER SAVE SUPPORT FOR IDLE NODES (optional)
+SuspendProgram=/opt/slurm/poweroff
+ResumeProgram=/opt/slurm/poweron
+SuspendTimeout=120
+ResumeTimeout=240
  #ResumeRate=
+SuspendExcNodes=node[13-32]:2
+SuspendExcStates=down,drain,fail,maint,not_responding,reserved
+BatchStartTimeout=60
+ReconfigFlags=KeepPowerSaveSettings # not sure if needed: preserve 
current status when running "scontrol reconfig"
-PartitionName=compute512 Default=False Nodes=node[13-32] State=UP 
DefMemPerCPU=9196
+PartitionName=compute512 Default=False Nodes=node[13-32] State=UP 
DefMemPerCPU=9196 SuspendTime=600


so probably that's not solved? Anyway, that's a nuisance, not a deal breaker


With my 23.02.5 the SuspendExcStates is working as documented :-)

3) The whole thing does not appear to be working as I intended. My 
understanding of the "exclude node" above should have meant that slurm 
should never attempt to shut off more than all idle nodes in that 
partition minus 2. Instead it shut them off all of them, and then tried to 
turn them back on:


$ sinfo | grep 512
compute512     up   infinite      1 alloc# node15
compute512     up   infinite      2  idle# node[14,32]
compute512     up   infinite      3  down~ node[16-17,31]
compute512     up   infinite      1 drain~ node27
compute512     up   infinite     12  idle~ node[18-26,28-30]
compute512     up   infinite      1  alloc node13


I agree that 2 nodes from node[13-32] shouldn't be suspended, according to 
SuspendExcNodes in the slurm.conf manual.  I haven't tested this feature.


But again this is a minor nuisance which I can live with (especially if it 
happens only when I "flip the switch"), and I'm mentioning only in case 
it's a symptom of something else I'm doing wrong. I did try to use both 
the SuspendExcNodes=node[13-32]:2 syntax as it seem more reasonable to me 
(compared to the rest of the file, e.g. partitions definition) and the 
SuspendExcNodes=node[13\-32]:2 as suggested in the slurm powersave 
documentation. Behavior, exactly identical


4) Most importantly from the output above you may have noticed two nodes 
(actually three by the time I ran the command below) that slurm deemed down


$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
reboot timed out     slurm     2023-10-04T14:51:28 node14
reboot timed out     slurm     2023-10-04T14:52:28 node15
reboot timed out     slurm     2023-10-04T14:49:58 node32
none                 Unknown   Unknown             node27

This can't be the case, the nodes are fine, and cannot have timed out 
while "rebooting", because for now my poweroff and poweron script are 
identical and literally a simple one-liner bash script doing almost 
nothing and the log file is populated correctly as I would expect


echo "Pretending to $0 the following node(s): $1"  >> $log_file 2>&1

So I can confirm slurm invoked the script, but then waited 

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ole Holm Nielsen

On 29-09-2023 17:33, Ryan Novosielski wrote:
I’ll just say, we haven’t done an online/jobs running upgrade recently 
(in part because we know our database upgrade will take a long time, and 
we have some processes that rely on -M), but we have done it and it does 
work fine. So the paranoia isn’t necessary unless you know that, like 
us, the DB upgrade time is not tenable (Ole’s wiki has some great 
suggestions for how to test that, but they aren’t especially Slurm 
specific, it’s just a dry-run).


Slurm upgrades are clearly documented by SchedMD, and there's no reason 
to worry if you follow the official procedures.  At least, it has always 
worked for us :-)


Just my 2 cents: The detailed list of upgrade steps/commands (first dbd, 
then ctld, then slurmds, finally login nodes) are documented in my Wiki 
page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


The Slurm dbd upgrade instructions in 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#make-a-dry-run-database-upgrade 
are totally Slurm specific, since that's the only database upgrade I've 
ever made :-)  I highly recommend doing the database dry-run upgrade on 
a test node before doing the real dbd upgrade!


/Ole



Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ole Holm Nielsen

On 9/28/23 17:58, Groner, Rob wrote:
There's 14 steps to upgrading slurm listed on their website, including 
shutting down and backing up the database.  So far we've only updated 
slurm during a downtime, and it's been a major version change, so we've 
taken all the steps indicated.


We now want to upgrade from 23.02.4 to 23.02.5.

Our slurm builds end up in version named directories, and we tell 
production which one to use via symlink.  Changing the symlink will 
automatically change it on our slurm controller node and all slurmd nodes.


Is there an expedited, simple, slimmed down upgrade path to follow if 
we're looking at just a . level upgrade?


Upgrading minor releases usually work without any problems.  We use RPMs 
in stead of the symlink method, so I can't speak for symlinks.  I 
recommend you to read the latest SLUG'23 presentations and Jason Booth's 
slides https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf page 26+.


For those using the RPM method, I have collected some notes in my Wiki 
page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


/Ole



Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Ole Holm Nielsen

On 9/26/23 14:50, Groner, Rob wrote:
There's a builtin slurm command, I can't remember what it is and google is 
failing me, that will take a compacted list of nodenames and return their 
full names, and I'm PRETTY sure it will do the opposite as well (what 
you're asking for).


It's probably sinfo or scontrolmaybe an sutil if that exists.


The command would be:

scontrol show hostname awn-0[01-32,46-77,95-99]

/Ole


--
*From:* slurm-users  on behalf of 
Felix 

*Sent:* Tuesday, September 26, 2023 7:22 AM
*To:* Slurm User Community List 
*Subject:* [slurm-users] question about configuration in slurm.conf
hello

I have at my site the following work nodes

awn001 ... awn099

and then it continues awn100 ... awn199

How can I configure this line

PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] Default=YES
MaxTime=INFINITE State=UP

so that it can contain the nodes from 001 to 199

can I write:

PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] awn-1[00-99]
Default=YES MaxTime=INFINITE State=UP

is this correct?




Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Ole Holm Nielsen

On 9/20/23 01:39, Feng Zhang wrote:

Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.


Probably not.  If the filesystem used by the job is hung, the node must 
probably be rebooted, and the filesystem must be checked.


/Ole


On Tue, Sep 19, 2023 at 8:03 AM Felix  wrote:


Hello

I have a job on my system which is running more than its time, more than
4 days.

1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047

I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information
about the job, the job is still there

[@arc7-node ~]# squeue | grep awn-047
 1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047

Can I do any other thinks to kill end the job?




Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Ole Holm Nielsen




On 9/19/23 13:59, Felix wrote:

Hello

I have a job on my system which is running more than its time, more than 4 
days.


1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047


The job has state "CG" which means "Completing".  The Completing status is 
explained in "man sinfo".


This means that Slurm is trying to cancel the job, but it hangs for some 
reason.



I'm trying to cancel it

[@arc7-node ~]# scancel 1808851

I get no message as if the job was canceled but when getting information 
about the job, the job is still there


[@arc7-node ~]# squeue | grep awn-047
    1808851 debug  gridjob  atlas01 CG 4-00:00:19 1 awn-047


What is your UnkillableStepTimeout parameter?  The default of 60 seconds 
can be changed in slurm.conf.  My cluster:


$ scontrol show config | grep UnkillableStepTimeout
UnkillableStepTimeout   = 126 sec


Can I do any other thinks to kill end the job?


It may be impossible to kill the job's processes, for example, if a 
filesystem is hanging.


You may log in to the node and give the job's processes a "kill -9".  Or 
just reboot the node.


/Ole



Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen

Hi Magnus,

On 8/30/23 11:17, Hagdorn, Magnus Karl Moritz wrote:

On Wed, 2023-08-30 at 10:38 +0200, Ole Holm Nielsen wrote:

This is a very useful example!  I guess that you have also defined
EnergyIPMIUsername and EnergyIPMIPassword in acct_gather.conf?  How
is the
EnergyIPMIPassword protected from normal users if the
/etc/slurm/acct_gather.conf file exists?


it talks to the BMC via to OS, so no password/user required.


Ah, of course, the slurmd on your nodes can do local IPMI commands :-)


An EnergyIPMIFrequency of 10 seconds sounds like it could put a high
load
on the BMC and the server?


that might be my problem - I haven't checked that.


Maybe this could be a problem.  It's anyway better not to have "OS jitter" 
in HPC compute nodes by having system tasks executing too frequently.



I have never tested IPMI DCMI_ENHANCED commands.  Do you have some
FreeIMPI commands which can be used to verify the basic IPMI
DCMI_ENHANCED
functionality?


I checked the spec sheet of our BMC which suggested that it should be
able to do DCMI_ENHANCED


That's good to know.  Our servers from Huawei don't seem to support 
DCMI_ENHANCED.


The following ipmitool command works locally on a node, but I can't figure 
out the corresponding command to use with FreeIPMI.


# ipmitool dcmi power reading

Instantaneous power reading:   689 Watts
Minimum during sampling period: 19 Watts
Maximum during sampling period:905 Watts
Average power reading over sample period:  682 Watts
IPMI timestamp:   Wed Aug 30 09:35:28 2023
Sampling period:  0001 Seconds.
Power reading state is:   activated


Best regards,
Ole



Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen

Hi Magnus,

On 8/30/23 10:12, Hagdorn, Magnus Karl Moritz wrote:

Yes, but can you share the details of which parameters you configure
in
this plugin so that you can extract node power?  This doesn't seem
obvious to me.


not much needs configuring. We have

EnergyIPMIFrequency=10
EnergyIPMICalcAdjustment=yes
EnergyIPMIPowerSensors=Node=DCMI_ENHANCED

in the acct_gather.conf.


This is a very useful example!  I guess that you have also defined 
EnergyIPMIUsername and EnergyIPMIPassword in acct_gather.conf?  How is the 
EnergyIPMIPassword protected from normal users if the 
/etc/slurm/acct_gather.conf file exists?


An EnergyIPMIFrequency of 10 seconds sounds like it could put a high load 
on the BMC and the server?


I have never tested IPMI DCMI_ENHANCED commands.  Do you have some 
FreeIMPI commands which can be used to verify the basic IPMI DCMI_ENHANCED 
functionality?


Thanks,
Ole



Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen

Hi Magnus,

On 29-08-2023 13:56, Hagdorn, Magnus Karl Moritz wrote:

I'm curious to learn about your energy gathering method:  How do you
extract node power using IPMI using FreeIMPI (or some other toolset),
and
how do you configure Slurm for this?



We are using the SLURM plugin which is enabled using
AcctGatherEnergyType=acct_gather_energy/ipmi

https://slurm.schedmd.com/acct_gather.conf.html#SECTION_acct_gather_energy/IPMI


Yes, but can you share the details of which parameters you configure in 
this plugin so that you can extract node power?  This doesn't seem 
obvious to me.


Thanks,
Ole



Re: [slurm-users] bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen

Hi Magnus,

On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote:

we recently enabled the energy gathering plugin on using the IPMI
gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on
rocky 8.5. We are getting sporadic buffer overflows in slurmd when it
is trying to query the IPMI interface. We have the feeling this occurs
when a lot of jobs are getting started on the node. Has anybody come
across this issue and even better found a solution?


I'm curious to learn about your energy gathering method:  How do you 
extract node power using IPMI using FreeIMPI (or some other toolset), and 
how do you configure Slurm for this?


In our cluster I select a Dell node where I obtain this IPMI power reading 
from the BMC using a FreeIMPI tool:



$ ipmi-dcmi -D LAN_2_0 --username=root --password= --hostname=c190b 
--get-system-power-statistics
Current Power: 151 Watts
Minimum Power over sampling duration : 6 watts
Maximum Power over sampling duration : 293 watts
Average Power over sampling duration : 153 watts
Time Stamp   : 08/29/2023 - 08:54:03
Statistics reporting time period : 1000 milliseconds
Power Measurement: Active


However, the node's iDRAC BMC web GUI presents a somewhat different 
reading, which I assume must be reliable:  168 W.


I'm also using the Slurm with 
AcctGatherEnergyType=acct_gather_energy/rapl, see [1].  With RAPL and 
"scontrol show node c190" Slurm reports CurrentWatts=177 which just 
measures CPU+DIMM power.


Thanks for sharing any insights.

Best regards,
Ole

[1] 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#power-monitoring-and-management




Re: [slurm-users] Is there any public scientific-workflow example that can be run through Slurm?

2023-08-18 Thread Ole Holm Nielsen

Hi Alper,

On 18-08-2023 18:39, Alper Alimoglu wrote:
In slurm we can build pipelines using [slurm dependencies][1], which 
allows us to run workflows.


In my work, I have stuck in a point regarding finding a workflow that I 
can run using Slurm.


As an example, I have to use a workflow benchmark like in here 
https://pegasus.isi.edu/workflow_gallery/ 
 , but all of them are 
implemented inside Pegasus.


I was wondering is there any public scientific-workflow examples that 
can be run through Slurm?


Any suggestion is highly appreciated.

   [1]: 
https://www.hpc.caltech.edu/documentation/faq/dependencies-and-pipelines 


My colleagues at Technical University of Denmark are heavy users of 
workflows through Slurm on our cluster, and this accounts for most of 
our usage.  They have developed an Open Source workflow system:



MyQueue is a frontend for SLURM/PBS/LSF that makes handling of tasks easy. It 
has a command-line interface called mq with a number of Sub-commands and a 
Python interface for managing Workflows. Simple to set up: no system 
administrator or database required.


See https://myqueue.readthedocs.io/en/latest/

I'm personally not involved in MyQueue, but you might take a look to see 
if it's useful in your environment.


Best regards,
Ole



Re: [slurm-users] slurm sinfo format memory

2023-07-21 Thread Ole Holm Nielsen
Hi Arsene,

On 7/20/23 18:24, Arsene Marian Alain wrote:
> I would like to see the following information of my nodes "hostname, total 
> mem, free mem and cpus". So, I used  ‘sinfo -o "%8n %8m %8e %C"’ but in 
> the output it shows me the memory in MB like "190560" and I need it in GB 
> (without decimals if possible) like "190GB". Any ideas or suggestions on 
> how I can do that?

Just my 2 cents:  The old "-o" options flag should be replaced by Slurm's 
more modern "-O" option which uses readable parameters and allows more 
output fields than -o, for example:

sinfo -O "NodeHost,CPUsState:20,Memory:20,FreeMem:20,StateComplete:30"

/Ole


Re: [slurm-users] Notify users about job submit plugin actions

2023-07-20 Thread Ole Holm Nielsen
Hi Lorenzo,

On 7/20/23 12:16, Lorenzo Bosio wrote:
> One more thing I'd like to point out, is that I need to monitor jobs going 
> from pending to running state (after waiting in the jobs queue). I 
> currently have a separate pthread to achieve this, but I think at this 
> point the job_submit()/job_modify() function has already exited.
> I do get the output of the slurm_kill_job() function when called, but 
> that's not useful for the user and I couldn't find a way to append custom 
> messages.

Maybe it's useful to have E-mail notifications sent to your users when the 
job transitions its state?

According to the sbatch man-page the user can specify himself what mail 
alerts he would like:

--mail-type=
   Notify user by email when certain event types occur.
   Valid type values are NONE, BEGIN, END, FAIL, REQUEUE,
   ALL (equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE,
   and STAGE_OUT), INVALID_DEPEND  (dependency  never  satisfied),
   STAGE_OUT  (burst  buffer  stage  out  and  teardown  completed),
   TIME_LIMIT,  TIME_LIMIT_90  (reached  90 percent of time limit),
   TIME_LIMIT_80 (reached 80 percent of time limit),
   TIME_LIMIT_50 (reached 50 percent of time limit) and
   ARRAY_TASKS (send emails for each array task).

/Ole


Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Ole Holm Nielsen
Hi Lorenzo,

On 7/19/23 14:22, Lorenzo Bosio wrote:
> I'm developing a job submit plugin to check if some conditions are met 
> before a job runs.
> I'd need a way to notify the user about the plugin actions (i.e. why its 
> jobs was killed and what to do), but after a lot of research I could only 
> write to logs and not the user shell.
> The user gets the output of slurm_kill_job but I can't find a way to add a 
> custom note.
> 
> Can anyone point me to the right api/function in the code?

I've written a fairly general job submit plugin which you can copy and 
customize for your needs:

https://github.com/OleHolmNielsen/Slurm_tools/tree/master/plugins

The slurm.log_user() function prints an error message to the user's terminal.

I hope this helps.

/Ole


Re: [slurm-users] Job step do not take the hole allocation

2023-06-30 Thread Ole Holm Nielsen

On 6/30/23 08:41, Tommi Tervo wrote:

This was an annoying change:

22.05.x RELEASE_NOTES:
  -- srun will no longer read in SLURM_CPUS_PER_TASK. This means you will
 implicitly have to specify --cpus-per-task on your srun calls, or set the
 new SRUN_CPUS_PER_TASK env var to accomplish the same thing.

Here one can find relevant discussion:

https://bugs.schedmd.com/show_bug.cgi?id=15632

I'll attach our cli-filter pre_submit function which works for us.


The discussion in bug 15632 concludes that this bug will only be fixed in 
23.11.  Your workaround looks nice, however, I have not been able to find 
any documentation of slurmctld calling any Lua functions named 
slurm_cli_pre_submit or slurm_cli_post_submit.


Some very similar functions are documented in 
https://slurm.schedmd.com/cli_filter_plugins.html for functions 
cli_filter_p_setup_defaults, cli_filter_p_pre_submit, and 
cli_filter_p_post_submit.


Can anyone she light on the relationship between Tommi's 
slurm_cli_pre_submit function and the ones defined in the 
cli_filter_plugins page?


Thanks,
Ole



Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Ole Holm Nielsen

Hi Andrew,

On 6/12/23 01:43, Andrew Elwell wrote:
Are your slurm to influx scripts publicly available anywhere? I do 
something similar for squeue via python subprocess to call


squeue -M all -a -o "%P,%a,%u,%D,%q,%T,%r"

And some sinfo calls for node/cpu usage:

sinfo -M {} -o "%P,%a,%F"
sinfo -M {} -o "%%R,%a,%C,%B,%z"

But I'd be interested to see what other places do. Perhaps some examples 
could be gathered for Ole's wiki?


I'd be happy to copy examples and links to documentation to the Wiki.  I 
guess this would be the best place?


https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#other-accounting-report-tools

/Ole



Re: [slurm-users] Temporary Stop User Submission

2023-05-26 Thread Ole Holm Nielsen

On 5/26/23 01:56, Markuske, William wrote:
I would but unfortunately they were creating 100s of TBs of data and I 
need them to log in and delete it but I don't want them creating more in 
the meantime.


Does your filesystem have disk quotas?  Using disk quotas would seem to be 
a good choice to contain this problem.


/Ole



Re: [slurm-users] Temporary Stop User Submission

2023-05-26 Thread Ole Holm Nielsen

On 5/26/23 01:29, Doug Meyer wrote:

I always like

Sacctmgr update user where user= set grpcpus=0


This GrpCPUs group limit may perhaps affect the entire group?  Anyway, 
GrpCPUs is undocumented in the sacctmgr manual page, for which I opened a 
bug in https://bugs.schedmd.com/show_bug.cgi?id=16832


I think Sean's suggestion of maxjobs=0 is a better choice, or perhaps set 
GrpTRES=cpu=0 ?


/Ole



Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 15:23, Roger Mason wrote:

NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKNOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A 
MCS_label=N/A
Partitions=macpro
BootTime=None SlurmdStartTime=None
CfgTRES=cpu=4,mem=10193M,billing=4
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2023-05-25T09:26:59]

But the 'Low RealMemory' is incorrect.  The entry in slurm.conf for
node012 is:

NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=10193  State=UNKNOWN


Thanks for the info.  Some questions arise:

1. Is slurmd running on the node?

2. What's the output of "slurmd -C" on the node?

3. Define State=UP in slurm.conf in stead of UNKNOWN

4. Why have you configured TmpDisk=0?  It should be the size of the /tmp 
filesystem.


Since you run Slurm 20.02, there are some suggestions in my Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration 
where this might be useful:



Note for Slurm 20.02: The Boards=1 SocketsPerBoard=2 configuration gives error 
messages, see bug_9241 and bug_9233. Use Sockets= in stead:


I hope changing these slurm.conf parameters will help.

Best regards,
Ole






Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen

On 5/25/23 13:59, Roger Mason wrote:

slurm 20.02.7 on FreeBSD.


Uh, that's old!


I have a couple of nodes stuck in the drain state.  I have tried

scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume

without success.

I then tried

/usr/local/sbin/slurmctld -c
scontrol update nodename=node012 state=idle

also without success.

Is there some other method I can use to get these nodes back up?


What's the output of "scontrol show node node012"?

/Ole



Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Ole Holm Nielsen

On 5/8/23 08:39, Bjørn-Helge Mevik wrote:

Angel de Vicente  writes:


But one possible way to something similar is to have a partition only
for interactive jobs and a different partition for batch jobs, and then
enforce that each job uses the right partition. In order to do this, I
think we can use the Lua contrib module (check the job_submit.lua
example).


Wouldn't it be simpler to just refuse too long interactive jobs in
job_submit.lua?


This sounds like a good idea, but how would one identify an interactive 
job in the job_submit.lua script?  A solution was suggested in 
https://serverfault.com/questions/1090689/how-can-i-set-up-interactive-job-only-or-batch-job-only-partition-on-a-slurm-clu 




Interactive jobs have no script and job_desc.script will be empty / not set.


So maybe something like this code snippet?

if job_desc.script == NIL then
   -- This is an interactive job
   -- make checks of job timelimit
   if job_desc.time_limit > 3600 then
 slurm.log_user("NOTICE: Interactive jobs are limited to 3600 seconds")
 -- ESLURM_INVALID_TIME_LIMIT in slurm_errno.h
 return 2051
   end
end

/Ole



Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen

On 5/1/23 12:08, Angel de Vicente wrote:

Hello Ole,

Ole Holm Nielsen  writes:


As Brian wrote:


On a technical note: slurm keeps the detailed accounting data for each cluster
in separate TABLES within a single database.


In the Federation page
https://urldefense.com/v3/__https://slurm.schedmd.com/federation.html__;!!D9dNQwwGXtA!UXs13P7Zdf-J6x0HmI1pkRQ7dxPXonmaR08N9UtrXNcoixhdJMhbWu2-wEKkxP8qjCcbDTbNpaJyJP224dxuZU6gbW1FV7rFvg$
it is implicitly assumed that the sacctmgr command talks only to a single
slurmdbd instance.  It is not, however, explicitly stated as an answer to your
question.


And hence my question.. because as I was saying in a previous mail,
reading the documentation I understand that this is the standard way to
do it, but right now I got it working the other way: in each cluster I
have one slurmdbd daemon that connects with a single mysqld daemon in a
third machine (option 2 from my question).

I have a single database with detailed accounting data for each cluster
in separate tables, and from each cluster I can query the whole database
so as far as I can see all is working fine but it is implemented
different to the standard approach.

I did it this way not because I wanted something special or outside of
the standard, simply because it was not very clear to me from the
documentation which way to go and this came natural when implementing it
(maybe simply because in the database machine I don't have Slurm
installed). And I have no problem with changing the installation to a
single slurmdbd daemon if I need to.

But this being my first time I just hope to learn if this is really a
bad idea that is going to bite me in the near future when these machines
go to production and I should change to the standard way, or in general
whether someone has a clear idea of the pros/cons of both ways.


If implementing Slurm for the first time, the slurm-users mailing list is 
probably the most helpful way to ask questions.  The official Slurm 
documentation is of course the place to start learning.  Some people have 
found my Slurm Wiki page helpful:

https://wiki.fysik.dtu.dk/Niflheim_system/SLURM/
However, I do not describe federated clusters because we don't use this 
aspect.


I also recommend SchedMD's paid support contracts, since they are the 
experts and give a fantastic service: https://www.schedmd.com/support.php


/Ole



Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen

Hi Angel,

On 5/1/23 11:28, Angel de Vicente wrote:

Ole Holm Nielsen  writes:


If I read Brian's comments correctly, he's saying that Slurm already has a
well-tested and documented solution for multi-cluster sites: Federated clusters.


Thanks Ole. Don't get me wrong, I have nothing against using Federated
clusters, and I guess I will probably end up going for it, but my
question keeps just the same (as far as I understand nothing changes in
that respect with multi-cluster or federated setting?): whether I should
just run one slurmdbd daemon or several.


As Brian wrote:

On a technical note: slurm keeps the detailed accounting data for each cluster in separate TABLES within a single database. 


In the Federation page https://slurm.schedmd.com/federation.html it is 
implicitly assumed that the sacctmgr command talks only to a single 
slurmdbd instance.  It is not, however, explicitly stated as an answer to 
your question.


You can see in another presentation that there is only a *single* slurmdbd 
in a federated multi-cluster scenario: 
https://slurm.schedmd.com/SLUG18/slurm_overview.pdf

Look at slide 28 "Typical Enterprise Architecture".

/Ole



Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen

On 5/1/23 09:22, Angel de Vicente wrote:

This is the first time that I'm installing Slurm, so things are not very
clear to me yet (even more so for multi-cluster operation).

Brian Andrus  writes:


You can do it however you like. You asked if there was a good or existing way to
do it easily, that was provided. Up to you if you want to write your own scripts
that do the work and manage that, or just have to learn the ins and outs of
running sreport.


I'm not sure what scripts you have in mind above, since as far as I can
see I already have a working solution for what I need (i.e. keep all job
records from different clusters in a single database).


If I read Brian's comments correctly, he's saying that Slurm already has a 
well-tested and documented solution for multi-cluster sites: Federated 
clusters.  You don't HAVE to use the solution that Slurm/SchedMD provides, 
but it will be the easy and well tested solution for you.


If you don't want to use federated clusters, you are free to do so.  But 
then you have to write *your own scripts* to implement your own ideas. 
Probably no-one can help you with your ideas, and you will have to develop 
everything by yourself from scratch (not an easy task if this is your 
first experience with Slurm).


I hope Brian's comments will help you select the best way forward.  The 
slurm-users list is generally helpful, also to new Slurm users.


/Ole



But let's say I go for the federated cluster option. I think my question
still holds. Let's say, for clarity, that I have two clusters (CA and
CB) and another machine (DB) where I will store the mysql database. As
far as I can see, in terms of the daemons running in each machine, I can
implement the whole thing in two ways:

option 1)
   CA: slurmd, slurmctld (AccountingStorageHost: DB)
   CB: slurmd, slurmctld (AccountingStorageHost: DB)
   DB: slurmdbd, mysqld

option 2)
   CA: slurmd, slurmctld, slurmdbd (StorageHost: DB)
   CB: slurmd, slurmctld, slurmdbd (StorageHost: DB)
   DB: mysqld


By reading the documentation on multi-cluster and federated clusters I
think option 1) is the preferred way, but I was just trying to
understand why and what are the pros/cons of each option.




Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-29 Thread Ole Holm Nielsen

On 29-04-2023 11:44, Angel de Vicente wrote:

Hello,

I'm setting Slurm in a number of machines and (at least for the moment)
we don't plan to let users submit across machines, so the initial plan
was to install Slurm+slurmdbd+mysql in every machine.

But in order to get stats for all the machines and to simplify things a
bit, I'm planning now to have only on mysql server running in a separate
machine and to collect all job data there.

What would make more sense?
   + 1) to install only one slurmdbd+mysql and configure all slurmctlds to
   communicate with the single slurmdbd, or
   + 2) to have a slurmdbd in every machine and only one mysql to which all
   slurmdbds connect?

Right now I have 2) working, but I wonder if there are pros/cons that I
have not considered yet for option 1)


Maybe you want to use Slurm federated clusters with a single database, 
see https://slurm.schedmd.com/federation.html

There is a presentation at https://slurm.schedmd.com/SC17/FederationSC17.pdf

I hoe this helps.

/Ole



Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-29 Thread Ole Holm Nielsen

On 28-04-2023 18:28, Hoot Thompson wrote:

I’m somewhat confused by your statement "This would only occur if you lower the 
GrpTRESMins limit after a job has started.”. My test case had the limits established 
before job submittal and the job was terminated when the threshold was crossed.


According to the sacctmgr manual page describing the GrpTRESMins limit:


When  this  limit  is reached  all  associated  jobs  running will be killed 
and all future jobs submitted with associations in the group will be delayed 
until they are able to run inside the limit.


Therefore a job requesting more minutes than the GrpTRESMins limit 
should not be permitted to start.


/Ole



On Apr 28, 2023, at 6:43 AM, Ole Holm Nielsen  
wrote:

Hi Hoot,

I'm glad that you have figured out that GrpTRESMins is working as documented 
and kills running jobs when the limit is exceeded.  This would only occur if 
you lower the GrpTRESMins limit after a job has started.

/Ole

On 4/27/23 22:39, Hoot Thompson wrote:

I have GrpTRESMins working and terminating jobs as expected. I was working 
under the belief that the limit “current value” was only updated upon job 
completion. That is not the case, it’s actually updated every 5 minutes it 
appears. If and when the limit/threshold is crossed, jobs are in fact canceled.
Thanks for your help.

On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen  
wrote:

On 24-04-2023 18:33, Hoot Thompson wrote:

In my reading of the Slurm documentation, it seems that exceeding the limits 
set in GrpTRESMins should result in terminating a running job. However, in 
testing this, The ‘current value’ of the GrpTRESMins only updates upon job 
completion and is not updated as the job progresses. Therefore jobs aren’t 
being stopped. On the positive side, no new jobs are started if the limit is 
exceeded. Here’s the documentation that is confusing me…..


I think the jobs resource usage will only be added to the Slurm database upon 
job completion.  I believe that Slurm doesn't update the resource usage 
continually as you seem to expect.


If any limit is reached, all running jobs with that TRES in this group will be 
killed, and no new jobs will be allowed to run.
Perhaps there is a setting or misconfiguration on my part.


The sacctmgr manual page states:


GrpTRESMins=TRES=[,TRES=,...]
The total number of TRES minutes that can possibly be used by past, present and 
future jobs running from this association and its children.  To clear a 
previously set value use the modify command with a new value of -1 for each 
TRES id.
NOTE: This limit is not enforced if set on the root association of a cluster.  
So even though it may appear in sacctmgr output, it will not be enforced.
ALSO NOTE: This limit only applies when using the Priority Multifactor plugin.  
The time is decayed using the value of PriorityDecayHalfLife or 
PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached 
all associated jobs running will be killed and all future jobs submitted with 
associations in the group will be delayed until they are able to run inside the 
limit.


Can you please confirm that you have configured the "Priority Multifactor" 
plugin?

Your jobs should not be able to start if the user's GrpTRESMins has been 
exceeded.  Hence they won't be killed!

Can you explain step by step what you observe?  It may be that the above 
documentation of killing jobs is in error, in which case we should make a bug 
report to SchedMD.




Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-28 Thread Ole Holm Nielsen

Hi Hoot,

I'm glad that you have figured out that GrpTRESMins is working as 
documented and kills running jobs when the limit is exceeded.  This would 
only occur if you lower the GrpTRESMins limit after a job has started.


/Ole

On 4/27/23 22:39, Hoot Thompson wrote:

I have GrpTRESMins working and terminating jobs as expected. I was working 
under the belief that the limit “current value” was only updated upon job 
completion. That is not the case, it’s actually updated every 5 minutes it 
appears. If and when the limit/threshold is crossed, jobs are in fact canceled.

Thanks for your help.



On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen  
wrote:

On 24-04-2023 18:33, Hoot Thompson wrote:

In my reading of the Slurm documentation, it seems that exceeding the limits 
set in GrpTRESMins should result in terminating a running job. However, in 
testing this, The ‘current value’ of the GrpTRESMins only updates upon job 
completion and is not updated as the job progresses. Therefore jobs aren’t 
being stopped. On the positive side, no new jobs are started if the limit is 
exceeded. Here’s the documentation that is confusing me…..


I think the jobs resource usage will only be added to the Slurm database upon 
job completion.  I believe that Slurm doesn't update the resource usage 
continually as you seem to expect.


If any limit is reached, all running jobs with that TRES in this group will be 
killed, and no new jobs will be allowed to run.
Perhaps there is a setting or misconfiguration on my part.


The sacctmgr manual page states:


GrpTRESMins=TRES=[,TRES=,...]
The total number of TRES minutes that can possibly be used by past, present and 
future jobs running from this association and its children.  To clear a 
previously set value use the modify command with a new value of -1 for each 
TRES id.
NOTE: This limit is not enforced if set on the root association of a cluster.  
So even though it may appear in sacctmgr output, it will not be enforced.
ALSO NOTE: This limit only applies when using the Priority Multifactor plugin.  
The time is decayed using the value of PriorityDecayHalfLife or 
PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached 
all associated jobs running will be killed and all future jobs submitted with 
associations in the group will be delayed until they are able to run inside the 
limit.


Can you please confirm that you have configured the "Priority Multifactor" 
plugin?

Your jobs should not be able to start if the user's GrpTRESMins has been 
exceeded.  Hence they won't be killed!

Can you explain step by step what you observe?  It may be that the above 
documentation of killing jobs is in error, in which case we should make a bug 
report to SchedMD.




Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-24 Thread Ole Holm Nielsen

On 24-04-2023 18:33, Hoot Thompson wrote:
In my reading of the Slurm documentation, it seems that exceeding the 
limits set in GrpTRESMins should result in terminating a running job. 
However, in testing this, The ‘current value’ of the GrpTRESMins only 
updates upon job completion and is not updated as the job progresses. 
Therefore jobs aren’t being stopped. On the positive side, no new jobs 
are started if the limit is exceeded. Here’s the documentation that is 
confusing me…..


I think the jobs resource usage will only be added to the Slurm database 
upon job completion.  I believe that Slurm doesn't update the resource 
usage continually as you seem to expect.


If any limit is reached, all running jobs with that TRES in this group 
will be killed, and no new jobs will be allowed to run.


Perhaps there is a setting or misconfiguration on my part.


The sacctmgr manual page states:


GrpTRESMins=TRES=[,TRES=,...]
The total number of TRES minutes that can possibly be used by past, present and 
future jobs running from this association and its children.  To clear a 
previously set value use the modify command with a new value of -1 for each 
TRES id.

NOTE: This limit is not enforced if set on the root association of a cluster.  
So even though it may appear in sacctmgr output, it will not be enforced.

ALSO NOTE: This limit only applies when using the Priority Multifactor plugin.  
The time is decayed using the value of PriorityDecayHalfLife or 
PriorityUsageResetPeriod as set in the slurm.conf.  When this limit is reached 
all associated jobs running will be killed and all future jobs submitted with 
associations in the group will be delayed until they are able to run inside the 
limit.


Can you please confirm that you have configured the "Priority 
Multifactor" plugin?


Your jobs should not be able to start if the user's GrpTRESMins has been 
exceeded.  Hence they won't be killed!


Can you explain step by step what you observe?  It may be that the above 
documentation of killing jobs is in error, in which case we should make 
a bug report to SchedMD.


/Ole






Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen

On 4/24/23 08:56, Purvesh Parmar wrote:
Thank you.. will try this and get back. Any other step being missed here 
for migration?


I don't know if any steps are missing, because I never tried moving a 
cluster like you want to do.


/Ole

On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


On 4/24/23 08:09, Purvesh Parmar wrote:
 > thank you, however, because this is change in the data center, the
names
 > of the servers contain datacenter names as well in its hostname and in
 > fqdn as well, hence i have to change both, hostnames as well as ip
 > addresses, compulsorily, to given hostnames as per new DC names.

Could your data center be persuaded to introduce DNS CNAME aliases for
the
old names to point to the new DC names?

If you're forced to use new DNS names only, then it's simple to change
DNS
names of compute nodes and partitions in slurm.conf:

NodeName=...
PartitionName=xxx Nodes=...

as well as the slurmdb server name:

AccountingStorageHost=...

What I have never tried before is to change the DNS name of the slurmctld
host:

ControlMachine=...

The critical aspect here is that you need to stop all batch jobs, plus
slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the
Slurm state directories:

StateSaveLocation=/var/spool/slurmctld

However, I don't know if the name of the ControlMachine is hard-coded in
the StateSaveLocation files?

I strongly suggest that you try to make a test migration of the
cluster to
the new DC to find out if it works or not.  Then you can always make
multiple attempts without breaking anything.

Best regards,
Ole


 > On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen
mailto:ole.h.niel...@fysik.dtu.dk>
 > <mailto:ole.h.niel...@fysik.dtu.dk
<mailto:ole.h.niel...@fysik.dtu.dk>>> wrote:
 >
 >     On 4/24/23 06:58, Purvesh Parmar wrote:
 >      > thank you, but its change of hostnames as well, apart from ip
 >     addresses
 >      > as well of the slurm server, database serverver name and slurmd
 >     compute
 >      > nodes as well.
 >
 >     I suggest that you talk to your networking people and request
that the
 >     old
 >     DNS names be created in the new network's DNS for your Slurm
cluster.
 >     Then Ryan's solution will work.  Changing DNS names is a very
simple
 >     matter!
 >
 >     My 2 cents,
 >     Ole
 >
 >
 >      > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
 >     mailto:novos...@rutgers.edu>
<mailto:novos...@rutgers.edu <mailto:novos...@rutgers.edu>>
 >      > <mailto:novos...@rutgers.edu <mailto:novos...@rutgers.edu>
<mailto:novos...@rutgers.edu <mailto:novos...@rutgers.edu>>>> wrote:
 >      >
 >      >     I think it’s easier than all of this. Are you actually
changing
 >     names
 >      >     of all of these things, or just IP addresses? It they all
 >     resolve to
 >      >     an IP now and you can bring everything down and change the
 >     hosts files
 >      >     or DNS, it seems to me that if the names aren’t changing,
 >     that’s that.
 >      >     I know that “scontrol show cluster” will show the wrong IP
 >     address but
 >      >     I think that updates itself.
 >      >
 >      >     The names of the servers are in slurm.conf, but again,
if the names
 >      >     don’t change, that won’t matter. If you have IPs there, you
 >     will need
 >      >     to change them.
 >      >
 >      >     Sent from my iPhone
 >      >
 >      >      > On Apr 23, 2023, at 14:01, Purvesh Parmar
 >     mailto:purveshp0...@gmail.com>
<mailto:purveshp0...@gmail.com <mailto:purveshp0...@gmail.com>>
 >      >     <mailto:purveshp0...@gmail.com
<mailto:purveshp0...@gmail.com>
 >     <mailto:purveshp0...@gmail.com
<mailto:purveshp0...@gmail.com>>>> wrote:
 >      >      > 
 >      >      > Hello,
 >      >      >
 >      >      > We have slurm 21.08 on ubuntu 20. We have a cluster
of 8 nodes.
 >      >     Entire slurm communication happens over 192.168.5.x
network (LAN).
 >      >     However as per requirement, now we are migrating the
cluster to
 >     other
 >      >     premises and there we have 172.16.1.x (LAN). I have to
migrate the
 >      >     entire network including SLURMDBD (maria

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen

On 4/24/23 08:09, Purvesh Parmar wrote:
thank you, however, because this is change in the data center, the names 
of the servers contain datacenter names as well in its hostname and in 
fqdn as well, hence i have to change both, hostnames as well as ip 
addresses, compulsorily, to given hostnames as per new DC names.


Could your data center be persuaded to introduce DNS CNAME aliases for the 
old names to point to the new DC names?


If you're forced to use new DNS names only, then it's simple to change DNS 
names of compute nodes and partitions in slurm.conf:


NodeName=...
PartitionName=xxx Nodes=...

as well as the slurmdb server name:

AccountingStorageHost=...

What I have never tried before is to change the DNS name of the slurmctld 
host:


ControlMachine=...

The critical aspect here is that you need to stop all batch jobs, plus 
slurmdbd and slurmctld.  Then you can backup (tar-ball) and transfer the 
Slurm state directories:


StateSaveLocation=/var/spool/slurmctld

However, I don't know if the name of the ControlMachine is hard-coded in 
the StateSaveLocation files?


I strongly suggest that you try to make a test migration of the cluster to 
the new DC to find out if it works or not.  Then you can always make 
multiple attempts without breaking anything.


Best regards,
Ole


On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


On 4/24/23 06:58, Purvesh Parmar wrote:
 > thank you, but its change of hostnames as well, apart from ip
addresses
 > as well of the slurm server, database serverver name and slurmd
compute
 > nodes as well.

I suggest that you talk to your networking people and request that the
old
DNS names be created in the new network's DNS for your Slurm cluster.
Then Ryan's solution will work.  Changing DNS names is a very simple
matter!

My 2 cents,
Ole


 > On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski
mailto:novos...@rutgers.edu>
 > <mailto:novos...@rutgers.edu <mailto:novos...@rutgers.edu>>> wrote:
 >
 >     I think it’s easier than all of this. Are you actually changing
names
 >     of all of these things, or just IP addresses? It they all
resolve to
 >     an IP now and you can bring everything down and change the
hosts files
 >     or DNS, it seems to me that if the names aren’t changing,
that’s that.
 >     I know that “scontrol show cluster” will show the wrong IP
address but
 >     I think that updates itself.
 >
 >     The names of the servers are in slurm.conf, but again, if the names
 >     don’t change, that won’t matter. If you have IPs there, you
will need
 >     to change them.
 >
 >     Sent from my iPhone
 >
 >      > On Apr 23, 2023, at 14:01, Purvesh Parmar
mailto:purveshp0...@gmail.com>
 >     <mailto:purveshp0...@gmail.com
<mailto:purveshp0...@gmail.com>>> wrote:
 >      > 
 >      > Hello,
 >      >
 >      > We have slurm 21.08 on ubuntu 20. We have a cluster of 8 nodes.
 >     Entire slurm communication happens over 192.168.5.x network (LAN).
 >     However as per requirement, now we are migrating the cluster to
other
 >     premises and there we have 172.16.1.x (LAN). I have to migrate the
 >     entire network including SLURMDBD (mariadb), SLURMCTLD, SLURMD.
ALso
 >     the cluster network is also changing from 192.168.5.x to 172.16.1.x
 >     and each node will be assigned the ip address from the 172.16.1.x
 >     network.
 >      > The cluster has been running for the last 3 months and it is
 >     required to maintain the old usage stats as well.
 >      >
 >      >
 >      >  Is the procedure correct as below :
 >      >
 >      > 1) Stop slurm
 >      > 2) suspend all the queued jobs
 >      > 3) backup slurm database
 >      > 4) change the slurm & munge configuration i.e. munge conf,
mariadb
 >     conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute
nodes),
 >     gres.conf, service file
 >      > 5) Later, do the update in the slurm database by executing below
 >     command
 >      > sacctmgr modify node where node=old_name set name=new_name
 >      > for all the nodes.
 >      > ALso, I think, slurm server name and slurmdbd server names
are also
 >     required to be updated. How to do it, still checking
 >      > 6) Finally, start slurmdbd, slurmctld on server and slurmd on
 >     compute nodes
 >      >
 >      > Please help and guide for above.
 >      >
 >      > Regards,
 >      >
 >      > Purvesh Parmar
 >      > INHAIT






Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-23 Thread Ole Holm Nielsen

On 4/24/23 06:58, Purvesh Parmar wrote:
thank you, but its change of hostnames as well, apart from ip addresses  
as well of the slurm server, database serverver name and slurmd compute 
nodes as well.


I suggest that you talk to your networking people and request that the old 
DNS names be created in the new network's DNS for your Slurm cluster. 
Then Ryan's solution will work.  Changing DNS names is a very simple matter!


My 2 cents,
Ole


On Mon, 24 Apr 2023 at 10:04, Ryan Novosielski > wrote:


I think it’s easier than all of this. Are you actually changing names
of all of these things, or just IP addresses? It they all resolve to
an IP now and you can bring everything down and change the hosts files
or DNS, it seems to me that if the names aren’t changing, that’s that.
I know that “scontrol show cluster” will show the wrong IP address but
I think that updates itself.

The names of the servers are in slurm.conf, but again, if the names
don’t change, that won’t matter. If you have IPs there, you will need
to change them.

Sent from my iPhone

 > On Apr 23, 2023, at 14:01, Purvesh Parmar mailto:purveshp0...@gmail.com>> wrote:
 > 
 > Hello,
 >
 > We have slurm 21.08 on ubuntu 20. We have a cluster of 8 nodes.
Entire slurm communication happens over 192.168.5.x network (LAN).
However as per requirement, now we are migrating the cluster to other
premises and there we have 172.16.1.x (LAN). I have to migrate the
entire network including SLURMDBD (mariadb), SLURMCTLD, SLURMD. ALso
the cluster network is also changing from 192.168.5.x to 172.16.1.x
and each node will be assigned the ip address from the 172.16.1.x
network.
 > The cluster has been running for the last 3 months and it is
required to maintain the old usage stats as well.
 >
 >
 >  Is the procedure correct as below :
 >
 > 1) Stop slurm
 > 2) suspend all the queued jobs
 > 3) backup slurm database
 > 4) change the slurm & munge configuration i.e. munge conf, mariadb
conf, slurmdbd.conf, slurmctld.conf, slurmd.conf (on compute nodes),
gres.conf, service file
 > 5) Later, do the update in the slurm database by executing below
command
 > sacctmgr modify node where node=old_name set name=new_name
 > for all the nodes.
 > ALso, I think, slurm server name and slurmdbd server names are also
required to be updated. How to do it, still checking
 > 6) Finally, start slurmdbd, slurmctld on server and slurmd on
compute nodes
 >
 > Please help and guide for above.
 >
 > Regards,
 >
 > Purvesh Parmar
 > INHAIT




Re: [slurm-users] sview not installed

2023-04-23 Thread Ole Holm Nielsen

On 23-04-2023 02:43, mohammed shambakey wrote:
I installed slurm 23.11.0-0rc1, and sview is not installed, despite it 
exists in /src/sview/sview. I can execute it from that path but 
not /bin (because it does not exist there).


I tried just copying it to /bin, but it 
complained about being just a wrapper.


I wonder if I'm missing something?


If your system is RPM based, you will build Slurm packages like this:

$ rpmbuild -ta slurm-23.02.1.tar.bz2  --with mysql --with slurmrestd

The /usr/bin/sview command is located in the 
slurm-23.02.1-1.el7.x86_64.rpm package.


/Ole



Re: [slurm-users] Resource LImits

2023-04-21 Thread Ole Holm Nielsen

Hi Jason,

On 4/20/23 20:11, Jason Simms wrote:

Hello Ole and Hoot,

First, Hoot, thank you for your question. I've managed Slurm for a few 
years now and still feel like I don't have a great understanding about 
managing or limiting resources.


Ole, thanks for your continued support of the user community with your 
documentation. I do wish not only that more of your information were 
contained within the official docs, but also that there were even clearer 
discussions around certain topics.


As an example, you write that "It is important to configure slurm.conf so 
that the locked memory limit isn’t propagated to the batch jobs" by 
setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether 
you are suggesting that literally everyone should have that set, or 
whether it only applies to certain configurations. We don't have it set, 
for instance, but we've not run into trouble with jobs failing due to 
locked memory errors.


The link mentioned in the page hopefully explains it: 
https://slurm.schedmd.com/faq.html#memlock


Then, in the official docs, to which you link, it says that "it may also 
be desirable to lock the slurmd daemon's memory to help ensure that it 
keeps responding if memory swapping begins" by creating 
/etc/sysconfig/slurm containing the line SLURMD_OPTIONS="-M". Would there 
ever be a reason *not* to include that? That is, I can't think it would 
ever be desirable for slurmd to stop responding. So is that another 
"universal" recommendation, I wonder?


I'm not an expert on locking slurmd pages!  The -M option is documented in 
the slurmd manual page, and I probably read a thread long ago abut this on 
the slurm-users mailing list discussing this.  You could try it out in 
your environment and see if all is well.


It may be me talking as a new-ish user, but I would find a concise 
document laying out common or useful configuration options to be presented 
when setting up or reconfiguring Slurm. I'm certain I have inefficient or 
missing options that I should have.


IMHO, most sites have their own requirements and preferences, so I don't 
think there is a one-size-fits-all Slurm installation solution.


Since requirements can be so different, and because Slurm is a fantastic 
software that can be configured for many different scenarios, IMHO a 
support contract with SchedMD is the best way to get consulting services, 
get general help, and report bugs.  We have excellent experiences with 
SchedMD support (https://www.schedmd.com/support.php).


Best regards,
Ole

On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen 
mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:
 > Is there a ‘how to’ or recipe document for setting up and enforcing
resource limits? I can establish accounts, users, and set limits but
'current value' is not incrementing after running jobs.

I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits 
<https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits>




Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen

On 20-04-2023 18:23, Hoot Thompson wrote:

Ole,

Earlier I found your Slurm_tools posting and found it very useful. This 
remains my problem, ‘current value’ not incrementing even after making 
needed changes to slurm.conf.


The ‘current value’ refers to those jobs that are currently running. 
Does that answer your question?


/Ole


./showuserlimits -u ubuntu

scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc

Association (Parent account):

ClusterName = dev-uid-testing

Account = testing

UserName =

Partition =

Priority = 0

ID = 6

SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00

UsageRaw/Norm/Efctv = 0.00/1.00/0.00

ParentAccount = root, current value = 1

Lft = 2

DefAssoc = No

GrpJobs =

GrpJobsAccrue =

GrpSubmitJobs =

GrpWall =

GrpTRES =

cpu:Limit = 1500, current value = 0

GrpTRESMins =

GrpTRESRunMins =

MaxJobs =

MaxJobsAccrue =

MaxSubmitJobs =

MaxWallPJ =

MaxTRESPJ =

MaxTRESPN =

MaxTRESMinsPJ =

MinPrioThresh =

Association (User):

ClusterName = dev-uid-testing

Account = testing

UserName = ubuntu, UID=1000

Partition =

Priority = 0

ID = 9

SharesRaw/Norm/Level/Factor = 1/18446744073709551616.00/1/0.00

UsageRaw/Norm/Efctv = 0.00/1.00/0.00

ParentAccount =

Lft = 3

DefAssoc = Yes

GrpJobs =

GrpJobsAccrue =

GrpSubmitJobs =

GrpWall =

GrpTRES =

cpu:Limit = 1500, current value = 0

GrpTRESMins =

cpu:Limit = 1000, current value = 0

GrpTRESRunMins =

MaxJobs =

MaxJobsAccrue =

MaxSubmitJobs =

MaxWallPJ =

MaxTRESPJ =

MaxTRESPN =

MaxTRESMinsPJ =

MinPrioThresh =

Slurm share information:

AccountUserRawSharesNormSharesRawUsageEffectvUsageFairShare

 -- -- --- --- 
- --


testingubuntu1 00.00 0.00



Clearly I’m still missing something or I don’t understand how it’s 
supposed to work.


Hoot



On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen 
 wrote:


Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:
Is there a ‘how to’ or recipe document for setting up and enforcing 
resource limits? I can establish accounts, users, and set limits but 
'current value' is not incrementing after running jobs.


I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits





Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen

Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:

Is there a ‘how to’ or recipe document for setting up and enforcing resource 
limits? I can establish accounts, users, and set limits but 'current value' is 
not incrementing after running jobs.


I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits

IHTH,
Ole



Re: [slurm-users] Submit sbatch to multiple partitions

2023-04-17 Thread Ole Holm Nielsen

On 4/17/23 11:36, Xaver Stiensmeier wrote:

let's say I want to submit a large batch job that should run on 8 nodes.
I have two partitions, each holding 4 nodes. Slurm will now tell me that
"Requested node configuration is not available". However, my desired
output would be that slurm makes use of both partitions and allocates
all 8 nodes.


A compute node can be a member of multiple partitions, this is how you can 
handle your case.


Suppose you have 4 nodes in part1 and 4 nodes in part2.  Then you can 
create a new partition "partbig" which contains all 8 nodes.


You may want to configure restrictions on "partbig" if you don't want 
every user to submit to it, or configure a lower maximum time for jobs.


I hope this helps,
Ole



Re: [slurm-users] Slurmdbd High Availability

2023-04-13 Thread Ole Holm Nielsen

On 4/13/23 11:49, Shaghuf Rahman wrote:

I am setting up Slurmdb in my system and I need some inputs

My current setup is like
server1 : 192.168.123.12(slurmctld)
server2: 192.168.123.13(Slurmctld)
server3: 192.168.123.14(Slurmdbd) which is pointing to both Server1 and 
Server2.

database: MySQL

I have 1 more server named as server 4: 192.168.123.15 which I need to 
make it as a secondary database server. I want to configure this server4 
which will sync the database and make it either Active-Active slurmdbd or 
Active-Passive.


Could anyone please help me with the *steps* how to configure and also how 
am i going to *sync* my *database* on both the servers simultaneously.


Slurm administrators have different opinions about the usefulness versus 
complexity of HA setups.  You could read SchedMD's presentation from page 
38 and onwards: https://slurm.schedmd.com/SLUG19/Field_Notes_3.pdf


Some noteworthy slides state:


Separating slurmctld and slurmdbd in normal production use
is recommended.
Master/backup slurmctld is common, and - as long as the
performance for StateSaveLocation is kept high - not that
difficult to implement.



For slurmdbd, the critical element in the failure domain is
MySQL, not slurmdbd. slurmdbd itself is stateless.



IMNSHO, the additional complexity of a redundant MySQL
deployment is more likely to cause an outage than it is to
prevent one.
So don’t bother setting up a redundant slurmdbd, keep
slurmdbd + MySQL local to a single server.


I hope this helps.

/Ole



Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ole Holm Nielsen

Hi Thomas,

I think the Slurm power_save is not problematic for us with bare-metal 
on-premise nodes, in contrast to the problems you're having.


We use power_save with on-premise nodes where we control the power down/up 
by means of IPMI commands as provided in the scripts which you will find 
in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
There's no hokus-pocus once the IPMI commands are working correctly with 
your nodes.


Of course, our slurmctld server can communicate with our IPMI management 
network to perform power management.  I don't see this network access as a 
security problem.


I think we had power_save with IPMI working also in Slurm 21.08 before we 
upgraded to 22.05.


As for job scheduling, slurmctld may allocate a job to some powered-off 
nodes and then calls the ResumeProgram defined in slurm.conf.  From this 
point it may indeed take 2-3 minutes before a node is up and running 
slurmd, during which time it will have a state of POWERING_UP (see "man 
sinfo").  If this doesn't complete after ResumeTimeout seconds, the node 
will go into a failed state.  All this logic seems to be working well.


If you would like to try out the above mentioned IPMI scripts, you could 
test them on a node on your IPMI network to see if you can reliably power 
some nodes up and down.  If this works, hopefully you could configure 
slurmctld so that it executes the scripts (note: it will be run by the 
"slurm" user).


Best regards,
Ole


On 3/29/23 14:16, Dr. Thomas Orgis wrote:

Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen :


FYI: Slurm power_save works very well for us without the issues that you
describe below.  We run Slurm 22.05.8, what's your version?


I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
nodes to a job only when these nodes are actually present (ideally,
with a health check passing) or
b) designing an interface between slurm as manager of available
resources and another site-specific service responsible for off-/onlining
resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.






Re: [slurm-users] [External] Power saving method selection for different kinds of hardware

2023-03-27 Thread Ole Holm Nielsen

Hi Prentice,

Since the last message I figured out a way to implement power_save:

I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
This page contains a link to power_save scripts on GitHub.

Best regards,
Ole

On 27-03-2023 19:35, Prentice Bisbal wrote:
I'm just catching up on old mailing list messages now. Why not make your 
SuspendProgram and ResumePrograms be shell scripts that look at some 
node information in Slurm (look at the features as in your example) or 
some other source ( use a case statement based on node names) and call 
the correct suspend/resume command based on that?


I agree that attaching this metadata in the node definition and have 
slurm act on it directly is the best solution, but in the meantime, 
having a shell script that can figure out the correct way to 
suspend/resume each host should be very doable, if not ideal.


Prentice

On 11/8/22 09:36, Ole Holm Nielsen wrote:
I'm thinking about the best way to configure power saving (see 
https://slurm.schedmd.com/power_save.html) when we have different 
types of node hardware whose power state have to be managed differently:


1. Nodes with a BMC NIC interface where "ipmitool chassis power ..." 
commands can be used.


2. Nodes where the BMC cannot be used for powering up due to the 
shared NICs going down when the node is off :-(


3. Cloud nodes where special cloud CLI commands must be used (such as 
Azure CLI).


The slurm.conf only permits one SuspendProgram and one ResumeProgram 
which then need to figure out the cases listed above and perform 
appropriate actions.


I was thinking to add a node feature to indicate the kind of power 
control mechanism available, for example along these lines for the 3 
above cases:


Nodename=node001 Feature=power_ipmi
Nodename=node002 Feature=power_none
Nodename=node003 Feature=power_azure

The node feature might be inquired in the SuspendProgram and 
ResumeProgram and jump to separate branches of the script for power 
control commands.


Question: Has anyone thought of a similar or better way to handle 
power saving for different types of nodes?


Thanks,
Ole










Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-27 Thread Ole Holm Nielsen

Hi Thomas,

FYI: Slurm power_save works very well for us without the issues that you 
describe below.  We run Slurm 22.05.8, what's your version?


I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
This page contains a link to power_save scripts on GitHub.

IHTH,
Ole

On 3/27/23 12:57, Dr. Thomas Orgis wrote:

Am Mon, 06 Mar 2023 13:35:38 +0100
schrieb Stefan Staeglich :


But this fixed not the main error but might have reduced the frequency of
occurring. Has someone observed similar issues? We will try a higher
SuspendTimeout.


We had issues with power saving. We powered the idle nodes off, causing
a full boot to resume. We observed repeatedly the strange behaviour
that the node is present for a while, but only detected by slurmctld as
being ready right when it is giving up with SuspendTimeout.

But instead of fixing this possibly subtle logic error, we figured that

a) The node suspend support in Slurm was not really designed for full
power off/on, which can take minutes regularily.

b) This functionality of taking nodes out of/into production is
something the cluster admin does. This is not in the scope of the
batch system.

Hence I wrote a script that runs as a service on a superior admin node.
It queries Slurm for idle nodes and pending jobs and then decides which
nodes to drain and then power down or bring back online.

This needs more knowledge on Slurm job and node states than I'd like,
but it works. Ideally, I'd like the powersaving feature of slurm
consisting of a simple interface that can communicate

1. which nodes are probably not needed in the coming x minutes/hours,
depending on the job queue, with settings like keeping a minimum number
of nodes idle, and
2. which nodes that are currently drained/offline it could use to satisfy
user demand.

I imagine that Slurm upstream is not very keen on hashing out a robust
interface for that. I can see arguments for keeping this wholly
internal to Slurm, but for me, taking nodes in/out of production is not
directly a batch system's task. Obviously the integration of power
saving that involves nodes really being powered down brings
complications like the strange ResumeTimeout behaviour. Also, in the
case of node that have trouble getting back online, the method inside
Slurm provides for a bad user experience:

The nodes are first allocated to the job, and _then_ they are powered
up. In the worst case of a defective node, Slurm will wait for the
whole SuspendTimeout just to realize that it doesn't really have the
resources it just promised to the job, making the job run attempt fail
needlessly.

With my external approach, the handling of bringing a node back up is
done outside slurmctld. Only after a node is back, it is undrained and
jobs will be allocated on it. I use a draining with a specific reason
to mark nodes that are offline due to power saving. What sucks is that
I have to implement part of the scheduler in the sense that I need to
match pending jobs' demands against properties of available nodes.

Maybe the internal powersaving could be made more robust, but I would
rather like to see more separation of concerns than putting everything
into one box. Things are too intertangled, even with my simple concept
of 'job' not beginning to describe what Slurm has in terms of various
steps as scheduling entities that by default also use delayed
allocation techniques (regarding prolog script behaviour, for example).


Alrighty then,

Thomas



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-07 Thread Ole Holm Nielsen

Hi Brian,

Presumably the users' home directory is NFS automounted using autofs, and 
therefore it doesn't exist when the job starts.


The job_container/tmpfs plugin ought to work correctly with autofs, but 
maybe this is still broken in 23.02?


/Ole


On 3/6/23 21:06, Brian Andrus wrote:

That looks like the users' home directory doesn't exist on the node.

If you are not using a shared home for the nodes, your onboarding process 
should be looked at to ensure it can handle any issues that may arise.


If you are using a shared home, you should do the above and have the node 
ensure the shared filesystems are mounted before allowing jobs.


-Brian Andrus

On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:

Hi all

Seems there still are some issues with the autofs - job_container/tmpfs 
functionality in Slurm 23.02.
If the required directories aren't mounted on the allocated node(s) 
before jobstart, we get:


slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead
slurmstepd: error: couldn't chdir to `/users/lutest': No such file or 
directory: going to /tmp instead


An easy workaround however, is to include this line in the slurm prolog 
on the slurmd -nodes:


/usr/bin/su - $SLURM_JOB_USER -c /usr/bin/true

-but there might exist a better way to solve the problem?




Re: [slurm-users] Cleanup of job_container/tmpfs

2023-03-01 Thread Ole Holm Nielsen

Hi Jason,

IMHO, the job_container/tmpfs is not working well in Slurm 22.05, but 
there may be some significant improvements included in 23.02 (announced 
yesterday).  I've documented our experiences in the Wiki page

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories
This page contains links to bug reports against the job_container/tmpfs 
plugin.


We're using the auto_tmpdir SPANK plugin with great success in Slurm 22.05.

Best regards,
Ole


On 01-03-2023 03:27, Jason Ellul wrote:
We have recently moved to slurm 22.05.8 and have configured 
job_container/tmpfs to allow private tmp folders.


job_container.conf contains:

AutoBasePath=true

BasePath=/slurm

And in slurm.conf we have set

JobContainerType=job_container/tmpfs

I can see the folders being created and they are being used but when a 
job completes the root folder is not being cleaned up.


Example of running job:

[root@papr-res-compute204 ~]# ls -al /slurm/14292874

total 32

drwx--   3 root  root    34 Mar  1 13:16 .

drwxr-xr-x 518 root  root 16384 Mar  1 13:16 ..

drwx--   2 mzethoven root 6 Mar  1 13:16 .14292874

-r--r--r--   1 root  root 0 Mar  1 13:16 .ns

Example once job completes /slurm/ remains:

[root@papr-res-compute204 ~]# ls -al /slurm/14292794

total 32

drwx--   2 root root 6 Mar  1 09:33 .

drwxr-xr-x 518 root root 16384 Mar  1 13:16 ..

Is this to be expected or should the folder /slurm/ also be removed?

Do I need to create an epilog script to remove the directory that is left?




Re: [slurm-users] snakemake and slurm in general

2023-02-23 Thread Ole Holm Nielsen

On 2/23/23 17:07, David Laehnemann wrote:

In addition, there are very clear limits to how many jobs slurm can
handle in its queue, see for example this discussion:
https://bugs.schedmd.com/show_bug.cgi?id=2366


My 2 cents: Slurm's job limits are configurable, see this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#maxjobcount-limit

/Ole



Re: [slurm-users] Unit Testing job_submit.lua

2023-02-18 Thread Ole Holm Nielsen

On 17-02-2023 23:10, Groner, Rob wrote:
I'm trying to setup some testing of our job_submit.lua plugin so I can 
verify that changes I make to it don't break anything.


I looked into luaunit for testing, and that seems like it would do what 
I needlet me set the value of inputs, call the slurm_job_submit() 
function with them, and then evaluate the results.


But the setting of inputs is causing me issues.  I know plenty of C, and 
very little of Lua.  I see that the slurm_job_submit() function gets 
passed a job_desc_msg_t variable or pointer (or at least, I ASSUME it's 
that) and somehow knows what's inside it, even though "slurm.h" isn't 
included anywhere.  When I try to create a local variable of that type 
and set some values in it, it doesn't go well.


Is there an easier way to unit test that file and the slurm_job_submit() 
function in particular?


FWIW, I've written about my experiences with the job_submit.lua plugin 
in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins


My example plugin at 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/plugins may 
perhaps serve as inspiration.


Maybe you'll find some of that information useful?

/Ole



Re: [slurm-users] Lua plugin job_desc fields

2023-02-08 Thread Ole Holm Nielsen
I agree with Valantis' answer below.  I've collected some information 
and experiences with the job_submit_plugin in my Wiki page:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins

FYI: My example plugin Lua script is in 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/plugins


/Ole


On 08-02-2023 17:46, Chrysovalantis Paschoulas wrote:
AFAIK no, I am always checking 
"src/plugins/job_submit/lua/job_submit_lua.c" ;)


Cheers,
Valantis


On 2/8/23 17:22, Phill Harvey-Smith wrote:

Hi all,

Is there any documentation (or even just a list), of all the fields 
that are available in the job_desc parameter to slurm_job_submit in 
the job_submit.lua plugin?





Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-06 Thread Ole Holm Nielsen

I would agree with Florian about using the Slurm power_save method.

In the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving 
there are additional details and scripts for performing node suspend and 
resume.


You would need the server to have a BMC so that you can power it down 
and up using IPMI commands from your Slurm management server.


/Ole


On 06-02-2023 21:07, Florian Zillner wrote:
follow this guide: https://slurm.schedmd.com/power_save.html 



Create poweroff / poweron scripts and configure slurm to do the poweroff 
after X minutes. Works well for us. Make sure to set an appropriate time 
(ResumeTimeout) to allow the node to come back to service.
Note that we did not achieve good power saving with suspending the 
nodes, powering them off and on saves way more power. The downside is it 
takes ~ 5 mins to resume (= power on) the nodes when needed.


Cheers,
Florian

*From:* slurm-users  on behalf of 
Analabha Roy 

*Sent:* Monday, 6 February 2023 18:21
*To:* slurm-users@lists.schedmd.com 
*Subject:* [External] [slurm-users] Hibernating a whole cluster
Hi,

I've just finished  setup of a single node "cluster" with slurm on 
ubuntu 20.04. Infrastructural limitations  prevent me from running it 
24/7, and it's only powered on during business hours.



Currently, I have a cron job running that hibernates that sole node 
before closing time.


The hibernation is done with standard systemd, and hibernates to the 
swap partition.


  I have not run any lengthy slurm jobs on it yet. Before I do, can I 
get some thoughts on a couple of things?


If it hibernated when slurm still had jobs running/queued, would they 
resume properly when the machine powers back on?


Note that my swap space is bigger than my  RAM.

Is it necessary to perhaps setup a pre-hibernate script for systemd to  
iterate scontrol to suspend all the jobs before hibernating and resume 
them post-resume?


What about the wall times? I'm uessing that slurm will count the 
downtime as elapsed for each job. Is there a way to config this, or is 
the only alternative a post-hibernate script that iteratively updates 
the wall times of the running jobs using scontrol again?





Re: [slurm-users] node health check

2023-01-30 Thread Ole Holm Nielsen

On 1/31/23 04:35, Ratnasamy, Fritz wrote:
  Currently, some of our nodes are overloaded. The nhc installed used to 
check the load and drain the node when it is overloaded. However, for the 
past few  days, it is not showing the state of the node. When I run 
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" 
on mcn26.chicagobooth.edu 
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu 
 ( )


It seems that it is not able to read the state of the node. I ran scontrol 
show node mcn26

NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
    NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node 
anymore?


What's the complete output of "scontrol show node mcn26", especially the 
State=... information?


Which version of NHC are you running?

/Ole







Re: [slurm-users] Install & Configuration of slurmdbd

2023-01-30 Thread Ole Holm Nielsen

Hi Jim,

Maybe you'll find these Wiki pages relevant for setting up your Slurm 
database:


https://wiki.fysik.dtu.dk/Niflheim_system/
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_database/

/Ole

On 1/30/23 20:43, Jim Klo wrote:
I’ve been working on updating our small slurm cluster over the last few 
days.  I’ve successfully updated the cluster. However our cluster is 
missing the slurmdbd configuration, and while I know it’s not required, I 
would like to add that as it would be helpful to access job history 
details and potentially manage and prioritize resources with Fair Share as 
we scale up.


Anyways I’ve gone through https://slurm.schedmd.com/accounting.html 
 and as opposed to other docs 
in the project, details on configuring accounting is rather vague. I’m a 
bit confused as to the exact steps, and what prerequisites and 
dependencies are needed.  Are there better instructions on setting up 
configuring slurmdbd on an existing cluster?


Initial questions I have:

  * When building slurm, did I need to have libmariadbd-dev (or other
MariaDB client or libs) present? Is this only needed by the slurmdbd
node or does slurmd / slurmctrld also need this?
  * Are there any limitations on how MariaDB is run? I typically run
containerized MariaDB so it can be easily backed up, moved, etc.  I
see the configure script has a `--with-mysql_conf`, however if running
containerized on a different system, is this path need to be
accessible post configure?

Any assistance is greatly appreciated.

Thanks,

Jim




Re: [slurm-users] job_container/tmpfs and autofs

2023-01-12 Thread Ole Holm Nielsen

Hi Magnus,

We had the same challenge some time ago.  A long description of solutions 
is in my Wiki page at 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories


The issue may have been solved in 
https://bugs.schedmd.com/show_bug.cgi?id=12567 which will be in Slurm 23.02.


At this time, the auto_tmpdir SPANK plugin seems to be the best solution.

IHTH,
Ole

On 1/12/23 08:49, Hagdorn, Magnus Karl Moritz wrote:

Hi there,
we excitedly found the job_container/tmpfs plugin which neatly allows
us to provide local scratch space and a way of ensuring that /dev/shm
gets cleaned up after a job finishes. Unfortunately we found that it
does not play nicely with autofs which we use to provide networked
project and scratch directories. We found that this is a known issue
[1]. I was wondering if that has been solved? I think it would be
really useful to have a warning about this issue in the documentation
for the job_container/tmpfs plugin.
Regards
magnus

[1]
https://cernvm-forum.cern.ch/t/intermittent-client-failures-too-many-levels-of-symbolic-links/156/4


--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] [External] ERROR: slurmctld: auth/munge: _print_cred: DECODED

2022-12-01 Thread Ole Holm Nielsen

Hi Nousheen,

It seems that you have configured incorrectly the nodes in slurm.conf.  I 
notice this:


  RealMemory=1

This means 1 Megabyte of RAM memory, we only had this with IBM PCs back in 
the 1980ies :-)


See how to configure nodes in 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#compute-node-configuration


You must run "slurmd -C" on each node to determine its actual hardware.

I hope this helps.

/Ole

On 12/1/22 21:08, Nousheen wrote:

Dear Robbert,

Thankyou so much for your response. I was so focused on sync of time that 
I missed the date on one of the nodes which was 1 day behind as you said. 
I have corrected it and now i get the following output in status.


*(base) [nousheen@nousheen slurm]$ systemctl status slurmctld.service -l*
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
preset: disabled)

    Active: active (running) since Thu 2022-12-01 21:37:34 PKT; 20min ago
  Main PID: 19475 (slurmctld)
     Tasks: 10
    Memory: 4.5M
    CGroup: /system.slice/slurmctld.service
            ├─19475 /usr/sbin/slurmctld -D -s
            └─19538 slurmctld: slurmscriptd

Dec 01 21:47:08 nousheen slurmctld[19475]: slurmctld: sched/backfill: 
_start_job: Started JobId=106 in debug on 101
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=106 WEXITSTATUS 1
Dec 01 21:47:09 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=106 done
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
JobId=107 NodeList=101 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
JobId=108 NodeList=105 #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: sched: Allocate 
JobId=109 NodeList=nousheen #CPUs=8 Partition=debug
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=107 WEXITSTATUS 1
Dec 01 21:47:11 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=107 done
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=108 WEXITSTATUS 1
Dec 01 21:47:12 nousheen slurmctld[19475]: slurmctld: _job_complete: 
JobId=108 done


I have total four nodes one of which is the server node. After submitting 
a job, the job only runs at my server compute node while all the other 
nodes are IDLE, DOWN or nonresponding. The details are given below:


*(base) [nousheen@nousheen slurm]$ scontrol show nodes*
NodeName=101 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=0.01
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.101 NodeHostName=101 Version=21.08.4
    OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
    RealMemory=1 AllocMem=0 FreeMem=641 Sockets=1 Boards=1
    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=2022-11-24T11:18:28 SlurmdStartTime=2022-12-01T21:34:57
    LastBusyTime=2022-12-02T00:58:31
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=104 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=N/A
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.114 NodeHostName=104
    RealMemory=1 AllocMem=0 FreeMem=N/A Sockets=1 Boards=1
    State=DOWN+NOT_RESPONDING ThreadsPerCore=2 TmpDisk=0 Weight=1 
Owner=N/A MCS_label=N/A

    Partitions=debug
    BootTime=None SlurmdStartTime=None
    LastBusyTime=2022-12-01T21:37:35
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Not responding [slurm@2022-12-01T16:22:28]

NodeName=105 Arch=x86_64 CoresPerSocket=6
    CPUAlloc=0 CPUTot=12 CPULoad=1.08
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.115 NodeHostName=105 Version=21.08.4
    OS=Linux 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022
    RealMemory=1 AllocMem=0 FreeMem=20723 Sockets=1 Boards=1
    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=debug
    BootTime=2022-11-24T11:15:37 SlurmdStartTime=2022-12-01T16:15:30
    LastBusyTime=2022-12-01T21:47:11
    CfgTRES=cpu=12,mem=1M,billing=12
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=nousheen Arch=x86_64 CoresPerSocket=6
    CPUAlloc=8 CPUTot=12 CPULoad=6.73
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=192.168.60.149 NodeHostName=nousheen Version=21.08.5
    OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
    RealMemory=1 AllocMem=0 FreeMem=22736 Sockets=1 Boards=1
    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    

Re: [slurm-users] [External] Re: Per-user TRES summary?

2022-11-29 Thread Ole Holm Nielsen

Hi Mike,

That sounds great!  It seems to me that "showuserlimits -q " would 
also print the QOS information, but maybe this is not what you are 
after?  Have you tried this -q option, or should the script perhaps be 
generalized to cover your needs?


/Ole


On 29-11-2022 14:39, Pacey, Mike wrote:

Hi Ole (and Jeffrey),

Thanks for the pointer - those are some very useful scripts. I couldn't get 
showslurmlimits or showslurmjobs to get quite what I was after (it wasn't 
showing me memory usage). However, it pointed me in the right direction - the 
scontrol command. I can run the following:

scontrol show assoc_mgr flags=qos

and part of the output reads:

 User Limits
   [myuid]
MaxJobsPU=N(2) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(2)

MaxTRESPU=cpu=80(2),mem=327680(1000),energy=N(0),node=N(1),billing=N(2),fs/disk=N(0),vmem=N(0),pages=N(0)

Which is exactly what I'm looking for. The values outside the brackets are the 
qos limit, and the values within are the current usage.

Regards,
Mike

-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: 28 November 2022 18:58
To: slurm-users@lists.schedmd.com
Subject: [External] Re: [slurm-users] Per-user TRES summary?

This email originated outside the University. Check before clicking links or 
attachments.

Hi Mike,

Would the "showuserlimits" tool give you the desired information?  Check out 
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOleHolmNielsen%2FSlurm_tools%2Ftree%2Fmaster%2Fshowuserlimitsdata=05%7C01%7Cpacey%40live.lancs.ac.uk%7Cbea74c16c0b34468c68908dad174dacc%7C9c9bcd11977a4e9ca9a0bc734090164a%7C0%7C0%7C638052597059366026%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=9pKyND95SW9Z1E%2BjGPsGKUKwTZIqj3juwWGQ4d5AWRw%3Dreserved=0

/Ole


On 28-11-2022 16:16, Pacey, Mike wrote:

Does anyone have suggestions as to how to produce a summary of a
user's TRES resources for running jobs? I'd like to able to see how
each user is fairing against their qos resource limits. (I'm looking
for something functionally equivalent to Grid Engine's qquota
command). The info must be in the scheduler somewhere in order for it
to enforce qos TRES limits, but as a SLURM novice I've not found any way to do 
this.

To summarise TRES qos limits I can do this:

% sacctmgr list qos format=Name,MaxTRESPerUser%50

Name  MaxTRESPU

-- --

  normalcpu=80,mem=320G

But to work out what a user is currently using in currently running
jobs, the nearest I can work out is:

% sacct -X -s R --units=G -o User,ReqTRES%50

   UserReqTRES

- --

  pacey   billing=1,cpu=1,mem=0.49G,node=1

  pacey   billing=1,cpu=1,mem=0.49G,node=1

With a little scripting I can sum those up, but there might be a
neater way to do this?







  1   2   3   4   5   >