Hi there I've been helping with a SLURM installation at the NICD. This is SLURM 15.08 on Fedora 23 installed on a single node. The node has 2 x Intel Xeon E3-1220v3 CPUs with 4 cores each. Linux kernel version is 4.4.9-300.fc23.x86_64
In this config the single node, 'localhost', is both a compute node running slurmd and the controller running slurmctld. The slurm.conf is as per: https://gist.github.com/pvanheus/c22a7afdf68b008aff9e662a8df8a9de >From time to time (every few hours) this error occurs (I've included the ACPI messages that seem to happen before the SLURM error): Nov 5 02:11:44 bio-linux kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20150930/exfield-418) Nov 5 02:11:44 bio-linux kernel: ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8808478be5f0), AE_AML_BUFFER_LIMIT (20150930/psparse-542) Nov 5 02:11:44 bio-linux kernel: ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20150930/power_meter-338) Nov 5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low socket*core*thread count (4 < 8) Nov 5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low cpu count (4 < 8) Nov 5 02:12:09 bio-linux slurmctld[27239]: error: _slurm_rpc_node_registration node=localhost: Invalid argument (also at https://gist.github.com/pvanheus/f72f1dd0718f65b5536ff4690902318c ) The node then goes into drain state. I can reset that with: scontrol update nodename=localhost state=resume But does anyone have any idea why this error occurs and how to fix it? Thanks, Peter
