Re: [slurm-users] [EXT] Re: systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Sean Crosby
Did you build Slurm yourself from source? If so, when you build from source, on that node, you need to have the munge-devel package installed (munge-devel on EL systems, libmunge-dev on Debian) You then need to set up munge with a shared munge key between the nodes, and have the munge daemon

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen
One ting to be aware about when setting partition states to down: * Setting partition state=down will be reset if slurmctld is restarted. Read the slurmctld man-page under the -R parameter. So it's better not to restart slurmctld during the downtime. /Ole On 2/1/22 08:11, Ole Holm Nielsen

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen
Login nodes being down doesn't affect Slurm jobs at all (except if you run slurmctld/slurmdbd on the login node ;-) To stop new jobs from being scheduled for running, mark all partitions down. This is useful when recovering the cluster from a power or cooling downtime, for example. I wrote

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
Brian / Christopher, that looks like a good process, thanks guys, I will do some testing and let you know. if I mark a partition down and it has running jobs, what happens to those jobs, do they keep running? Sid Young W: https://off-grid-engineering.com W: (personal) https://sidyoung.com/ W:

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 9:25 pm, Brian Andrus wrote: touch /etc/nologin That will prevent new logins. It's also useful that if you put a message in /etc/nologin then users who are trying to login will get that message before being denied. All the best, Chris -- Chris Samuel :

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Brian Andrus
One possibility: Sounds like your concern is folks with interactive jobs from the login node that are running under screen/tmux. That being the case, you need running jobs to end and not allow new users to start tmux sessions. Definitely doing 'scontrol update state=down partition='

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
Sid Young W: https://off-grid-engineering.com W: (personal) https://sidyoung.com/ W: (personal) https://z900collector.wordpress.com/ On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel wrote: > On 1/31/22 4:41 pm, Sid Young wrote: > > > I need to replace a faulty DIMM chim in our login node so I

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 9:00 pm, Christopher Samuel wrote: That would basically be the way Thinking further on this a better way would be to mark your partitions down, as it's likely you've got fewer partitions than compute nodes. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ :

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Christopher Samuel
On 1/31/22 4:41 pm, Sid Young wrote: I need to replace a faulty DIMM chim in our login node so I need to stop new jobs being kicked off while letting the old ones end. I thought I would just set all nodes to drain to stop new jobs from being kicked off... That would basically be the way,

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Dear Ole and Hermann, I have reinstalled slurm from scratch now following this link: The error remains the same. Kindly guide me where will i find this cred/munge plugin. Please help me resolve this issue. [root@exxact slurm]# slurmd -C NodeName=exxact CPUs=12 Boards=1 SocketsPerBoard=1

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Dear Ole, Thank you for your response. I am doing it again using your suggested link. Best Regards, Nousheen Parvaiz ᐧ On Mon, Jan 31, 2022 at 2:07 PM Ole Holm Nielsen wrote: > Hi Nousheen, > > I recommend you again to follow the steps for installing Slurm on a CentOS > 7 cluster: >

[slurm-users] Fwd: systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Nousheen
Best Regards, Nousheen Parvaiz Ph.D. Scholar National Center For Bioinformatics Quaid-i-Azam University, Islamabad Dear Hermann, Thank you for your reply. I have given below my slurm.conf and log file. *# slurm.conf file generated by configurator easy.html.*# Put this file on all nodes of

[slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Sid Young
G'Day all, I need to replace a faulty DIMM chim in our login node so I need to stop new jobs being kicked off while letting the old ones end. I thought I would just set all nodes to drain to stop new jobs from being kicked off... does this sound like a good idea? Down time window would be 20-30

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Stephan Roth
Not a solution, but some ideas & experiences concerning the same topic: A few of our older GPUs used to show the error message "has fallen off the bus" which was only resolved by a full power cycle as well. Something changed, nowadays the error messages is "GPU lost" and a normal reboot

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Timo Rothenpieler
Make sure you properly configured nsswitch.conf. Most commonly this kind of issue indicates that you forgot to define initgroups correctly. It should look something like this: ... group: files [SUCCESS=merge] systemd [SUCCESS=merge] ldap ... initgroups: files [SUCCESS=continue] ldap ...

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Russell Jones
I solved this issue by adding a group to IPA that matched the same name and GID of the local groups, then using [SUCCESS=merge] in nsswitch.conf for groups, and on our CentOS 8 nodes adding "enable_files_domain = False" in the sssd.conf file*.* On Fri, Jan 28, 2022 at 5:02 PM Ratnasamy, Fritz <

Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm

2022-01-31 Thread Bas van der Vlies
This is not an answer on the MIG issue but on the question that Esben has. We at SURF have developed sharing of all the GPUs in a node. We "misuse" the SLURM mps feature. At SURF this mostly use for GPU courses, eg: jupyterhub We have tested it with slum version 20.11.8. The code is public

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Timony, Mick
I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows:

Re: [slurm-users] addressing NVIDIA MIG + non MIG devices in Slurm - solved

2022-01-31 Thread Matthias Leopold
I looked at option > 2.2.3 using partial "AutoDetect=nvml" again and saw that the reason for failure was indeed the sanity check, but it was my fault because I set an invalid "Links" value for the "hardcoded" GPUs. So this variant of gres.conf setup works and gives me everything I want, sorry

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Ole Holm Nielsen
Hi Nousheen, I recommend you again to follow the steps for installing Slurm on a CentOS 7 cluster: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation Maybe you will need to start installation from scratch, but the steps are guaranteed to work if followed correctly. IHTH, Ole On 1/31/22

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Hermann Schwärzler
Dear Nousheen, I guess there is something missing in your installation - proably your slurm.conf? Do you have logging enabled for slurmctld? If yes what do you see in that log? Or what do you get if you run slurmctld manually like this: /usr/local/sbin/slurmctld -D Regards, Hermann On