[slurm-users] any way to allow interactive jobs or ssh in Slurm 23.02 when node is draining?

2024-04-19 Thread Robert Kudyba via slurm-users
We use Bright Cluster Manager with SLurm 23.02 on RHEL9. I know about pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html which does not appear to come by default with the Bright 'cm' package of Slurm. Currently ssh to a node gets: Login not allowed: no running jobs and no WLM

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
On Bright it's set in a few places: grep -r -i SLURM_CONF /etc /etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
I would double-check where you are setting SLURM_CONF then. It is acting as if it is not set (typo maybe?) It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd). Also check what the final, actual command being run to start it is. If anyone has changed the .service file or

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Jeffrey Layton via slurm-users
I like it, however, it was working before without a slurm.conf in /etc/slurm. Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one? Thanks! Jeff On Fri, Apr 19, 2024 at 1:11 PM Brian Andrus via slurm-users <

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Robert Kudyba via slurm-users
> > Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s). > For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the correct path. > On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users

[slurm-users] Re: Integrating Slurm with WekaIO

2024-04-19 Thread Brian Andrus via slurm-users
This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config. Simple solution: put a copy of slurm.conf in

[slurm-users] Integrating Slurm with WekaIO

2024-04-19 Thread Jeffrey Layton via slurm-users
Good afternoon, I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm and Weka don't get along (presumably because Weka pins their client threads to cores). I read

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
It turns out that the Slurm job limits are *not* controlled by the normal /etc/security/limits.conf configuration. Any service running under Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2]. The limits of processes started by slurmd are defined by LimitXXX in