Re: [slurm-users] slurm comunication between versions
Hi Felix, On 11/23/23 18:14, Felix wrote: Will slurm-20.02 which is installed on a management node comunicate with slurm-22.05 installed on a work nodes? They have the same configuration file slurm.conf Or do the version have to be the same. Slurm 20.02 was installed manually and slurm 22.05 was installed through dnf. It is only possible for Slurm versions in a cluster to differ by 2 major versions. The 22.05 slurmctld can therefore only work with slurmd 22.05, 21.08 and 20.11. The documentation is in https://slurm.schedmd.com/quickstart_admin.html#upgrade My Slurm Wiki page explains the details of how to upgrade Slurm versions: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm Please note that the current version of Slurm is now 23.11, and that only 23.11 and 23.02 are supported. There are important security fixes that makes it important to upgrade to a supported version of Slurm. I hope this will help you to make the upgrades. Best regards, Ole
[slurm-users] slurm comunication between versions
Hello I have a curiosity and question in the same time, Will slurm-20.02 which is installed on a management node comunicate with slurm-22.05 installed on a work nodes? They have the same configuration file slurm.conf Or do the version have to be the same. Slurm 20.02 was installed manually and slurm 22.05 was installed through dnf. Thank you Felix -- Dr. Eng. Farcas Felix National Institute of Research and Development of Isotopic and Molecular Technology, IT - Department - Cluj-Napoca, Romania Mobile: +40742195323
Re: [slurm-users] partition qos without managing users
ego...@posteo.me writes: > ok, I understand synching of users to slurm database is a task which > it not built-in, but could be added outside of slurm :-) > > With regards to the QoS or Partition QoS setting I've tried several > settings and configurations however it was not possible at all to > configure a QoS on partition level only without not adding specific > users to the slurm database. > Either I don't understand the docs properly or there is no > configuration option to limit jobs with e.g. cpu=4 globally on a > partition. > > Could anybody share a configuration which set partition QoS > (e.g. cpu=8) without managing users or a configuration to silently > change the job QoS using job_submit.lua again without maintaining > users within slurm database? We add users to the Slurm DB automatically via job_submit.lua if they do not already exist. Probably not what you want to do if you have very high throughput, which we do not. For us it means that we minimize the stuff which needs to be deleted for the case that some one applies for HPC access, but does not use it within a certain period and is therefore removed from the system. Cheers, Loris > Thanks > >> Date: Mon, 20 Nov 2023 14:37:11 -0800 >> From: Brian Andrus >> To: slurm-users@lists.schedmd.com >> Subject: Re: [slurm-users] partition qos without managing users >> Message-ID: <2f421687-40aa-4e35-bf9d-3f31984ad...@gmail.com> >> Content-Type: text/plain; charset=UTF-8; format=flowed >> You would have to do such syncing with your own scripts. There is no >> way >> slurm would be able to tell which users should have access and what >> access without the slurmdb and such info is not contained in AD. >> At our site, we iterate through the group(s) that are slurm user >> groups >> and add the users if they do not exist. We also delete users when they >> are removed from AD. This does have the effect of losing job info >> produced by said users, but since we export that into a larger historic >> repository, we don't worry about it. >> So simple case is to iterate through an AD group which your slurm >> users >> belong to and add them to slurmdbd. Once they are in there, you can set >> defaults with exceptions for specific users. >> If you are only looking to have settings apply to all users, you don't >> have to import the users. Set the QoS for the partition. >> Brian Andrus >> On 11/20/2023 1:45 PM, ego...@posteo.me wrote: >>> Hello, >>> I'd like to configure some sort of partition QoS so that the number >>> of >>> jobs or cpus is limited for a single user. >>> So far my testing always depends on creating users within the >>> accounting database however I'd like to avoid managing each user and >>> having to create or sync _all_ LDAP users also within Sturm. >>> Or - are there solutions to sync LDAP or AzureAD users to the Slurm >>> accounting database? >>> Thanks for any input. >>> Best - Eg. >>> >> > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
Re: [slurm-users] slurm power save question
Thanks for confirming, Brian. That was my understanding as well. Do you have it working that way on a machine you have access to? If so, I'd be interested to see the config file, because that's not the behavior I am experiencing in my tests. In fact, in my tests Slurm will not bring down those "X nodes" but will not bring them up either, *unless* there is a job targeted to those. I may have something misconfigured, and I'd love to fix that. Thanks! On Wed, Nov 22, 2023 at 5:46 PM Brian Andrus wrote: > As I understand it, that setting means "Always have at least X nodes up", > which includes running jobs. So it stops any wait time for the first X jobs > being submitted, but any jobs after that will need to wait for the power_up > sequence. > > Brian Andrus > On 11/22/2023 6:58 AM, Davide DelVento wrote: > > I've started playing with powersave and have a question about > SuspendExcNodes. The documentation at > https://slurm.schedmd.com/power_save.html says > > For example nid[10-20]:4 will prevent 4 usable nodes (i.e IDLE and not > DOWN, DRAINING or already powered down) in the set nid[10-20] from being > powered down. > > I initially interpreted that as "Slurm will try to keep 4 nodes idle on as > much as possible", which would have reduced the wait time for new jobs > targeting those nodes. Instead, it appears to mean "Slurm will not shut off > the last 4 nodes which are idle in that partition, however it will not turn > on nodes which it shut off earlier unless jobs are scheduled on them" > > Most notably if the 4 idle nodes will be allocated to other jobs (and so > they are no idle anymore) slurm does not turn on any nodes which have been > shut off earlier, so it's possible (and depending on workloads perhaps even > common) to have no idle nodes on regardless of the SuspendExcNode settings. > > Is that how it works, or do I have anything else in my setting which is > causing this unexpected-to-me behavior? I think I can live with it, but > IMHO it would have been better if slurm attempted to turn on nodes > preemptively trying to match the requested SuspendExcNodes, rather than > waiting for job submissions. > > Thanks and Happy Thanksgiving to people in the USA > >
Re: [slurm-users] Releasing stale allocated TRES
"Schneider, Gerald" writes: > Is there any way to release the allocation manually? I've only seen this once on our clusters, and that time it helped just restarting slurmctld. If this is a recurring problem, perhaps it will help to upgrade Slurm. You are running quite an old version. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Releasing stale allocated TRES
On 11/23/23 11:50, Markus Kötter wrote: On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. Remove the node from slurm.conf and restart slurmctld, re-add, restart. Remove from Partition definitions as well. Just my 2 cents: Do NOT remove a node from slurm.conf just as described! When adding or removing nodes, both slurmctld as well as all slurmd's must be restarted! See the SchedMD presentation https://slurm.schedmd.com/SLUG23/Field-Notes-7.pdf slides 51-56 for the recommended procedure. /Ole
Re: [slurm-users] Releasing stale allocated TRES
Hi, On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. Remove the node from slurm.conf and restart slurmctld, re-add, restart. Remove from Partition definitions as well. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] Releasing stale allocated TRES
Hi there, I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. $ scontrol show node node2 NodeName=node2 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUTot=256 CPULoad=0.11 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:tesla:8 NodeAddr=node2 NodeHostName=node2 Version=21.08.5 OS=Linux 5.15.0-89-generic #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023 RealMemory=1025593 AllocMem=0 FreeMem=1025934 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=AMPERE BootTime=2023-11-23T09:01:28 SlurmdStartTime=2023-11-23T09:02:09 LastBusyTime=2023-11-23T09:03:19 CfgTRES=cpu=256,mem=1025593M,billing=256,gres/gpu=8,gres/gpu:tesla=8 AllocTRES=gres/gpu=8 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Previously the allocation was gone after the server was turned off for a couple of hours (power conservation) but the issue occurred again and this time it persists even after the server was off over night. Is there any way to release the allocation manually? Regards, Gerald Schneider -- Gerald Schneider Fraunhofer-Institut für Graphische Datenverarbeitung IGD Joachim-Jungius-Str. 11 | 18059 Rostock | Germany Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199 gerald.schnei...@igd-r.fraunhofer.de | www.igd.fraunhofer.de