Re: [slurm-users] node health check

2023-01-30 Thread Ole Holm Nielsen

On 1/31/23 04:35, Ratnasamy, Fritz wrote:
  Currently, some of our nodes are overloaded. The nhc installed used to 
check the load and drain the node when it is overloaded. However, for the 
past few  days, it is not showing the state of the node. When I run 
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online 
mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" 
on mcn26.chicagobooth.edu <http://mcn26.chicagobooth.edu>
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu 
<http://mcn26.chicagobooth.edu> ( )


It seems that it is not able to read the state of the node. I ran scontrol 
show node mcn26

NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
    NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node 
anymore?


What's the complete output of "scontrol show node mcn26", especially the 
State=... information?


Which version of NHC are you running?

/Ole







Re: [slurm-users] Install & Configuration of slurmdbd

2023-01-30 Thread Ole Holm Nielsen

Hi Jim,

Maybe you'll find these Wiki pages relevant for setting up your Slurm 
database:


https://wiki.fysik.dtu.dk/Niflheim_system/
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_database/

/Ole

On 1/30/23 20:43, Jim Klo wrote:
I’ve been working on updating our small slurm cluster over the last few 
days.  I’ve successfully updated the cluster. However our cluster is 
missing the slurmdbd configuration, and while I know it’s not required, I 
would like to add that as it would be helpful to access job history 
details and potentially manage and prioritize resources with Fair Share as 
we scale up.


Anyways I’ve gone through https://slurm.schedmd.com/accounting.html 
 and as opposed to other docs 
in the project, details on configuring accounting is rather vague. I’m a 
bit confused as to the exact steps, and what prerequisites and 
dependencies are needed.  Are there better instructions on setting up 
configuring slurmdbd on an existing cluster?


Initial questions I have:

  * When building slurm, did I need to have libmariadbd-dev (or other
MariaDB client or libs) present? Is this only needed by the slurmdbd
node or does slurmd / slurmctrld also need this?
  * Are there any limitations on how MariaDB is run? I typically run
containerized MariaDB so it can be easily backed up, moved, etc.  I
see the configure script has a `--with-mysql_conf`, however if running
containerized on a different system, is this path need to be
accessible post configure?

Any assistance is greatly appreciated.

Thanks,

Jim




[slurm-users] node health check

2023-01-30 Thread Ratnasamy, Fritz
Hi,

 Currently, some of our nodes are overloaded. The nhc installed used to
check the load and drain the node when it is overloaded. However, for the
past few  days, it is not showing the state of the node. When I run
/usr/sbin/nhc manually, it says
20230130 21:25:14 [slurm] /usr/libexec/nhc/node-mark-online
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on
mcn26.chicagobooth.edu
/usr/libexec/nhc/node-mark-online:  Skipping  node mcn26.chicagobooth.edu (
)

It seems that it is not able to read the state of the node. I ran scontrol
show node mcn26
NodeName=mcn26 Arch=x86_64 CoresPerSocket=16
   NodeAddr=mcn26 NodeHostName=mcn26 Version=20.11.8

Any idea what happened and why nhc is not reading the state of the node
anymore?
Best,


*Fritz Ratnasamy*

Data Scientist

Information Technology


[slurm-users] Install & Configuration of slurmdbd

2023-01-30 Thread Jim Klo
Greetings,

I’ve been working on updating our small slurm cluster over the last few days.  
I’ve successfully updated the cluster. However our cluster is missing the 
slurmdbd configuration, and while I know it’s not required, I would like to add 
that as it would be helpful to access job history details and potentially 
manage and prioritize resources with Fair Share as we scale up.

Anyways I’ve gone through https://slurm.schedmd.com/accounting.html and as 
opposed to other docs in the project, details on configuring accounting is 
rather vague. I’m a bit confused as to the exact steps, and what prerequisites 
and dependencies are needed.  Are there better instructions on setting up 
configuring slurmdbd on an existing cluster?

Initial questions I have:

  *   When building slurm, did I need to have libmariadbd-dev (or other MariaDB 
client or libs) present? Is this only needed by the slurmdbd node or does 
slurmd / slurmctrld also need this?
  *   Are there any limitations on how MariaDB is run? I typically run 
containerized MariaDB so it can be easily backed up, moved, etc.  I see the 
configure script has a `--with-mysql_conf`, however if running containerized on 
a different system, is this path need to be accessible post configure?

Any assistance is greatly appreciated.

Thanks,

Jim