Hi there

I've been helping with a SLURM installation at the NICD. This is SLURM
15.08 on Fedora 23 installed on a single node. The node has 2 x Intel Xeon
E3-1220v3 CPUs with 4 cores each. Linux kernel version is
4.4.9-300.fc23.x86_64

In this config the single node, 'localhost', is both a compute node running
slurmd and the controller running slurmctld. The slurm.conf is as per:

https://gist.github.com/pvanheus/c22a7afdf68b008aff9e662a8df8a9de

>From time to time (every few hours) this error occurs (I've included the
ACPI messages that seem to happen before the SLURM error):

Nov  5 02:11:44 bio-linux kernel: ACPI Error: SMBus/IPMI/GenericSerialBus
write requires Buffer of length 66, found length 32 (20150930/exfield-418)
Nov  5 02:11:44 bio-linux kernel: ACPI Error: Method parse/execution failed
[\_SB.PMI0._PMM] (Node ffff8808478be5f0), AE_AML_BUFFER_LIMIT
(20150930/psparse-542)
Nov  5 02:11:44 bio-linux kernel: ACPI Exception: AE_AML_BUFFER_LIMIT,
Evaluating _PMM (20150930/power_meter-338)
Nov  5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low
socket*core*thread count (4 < 8)
Nov  5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low
cpu count (4 < 8)
Nov  5 02:12:09 bio-linux slurmctld[27239]: error:
_slurm_rpc_node_registration node=localhost: Invalid argument

(also at https://gist.github.com/pvanheus/f72f1dd0718f65b5536ff4690902318c )

The node then goes into drain state. I can reset that with:

scontrol update nodename=localhost state=resume

But does anyone have any idea why this error occurs and how to fix it?

Thanks,
Peter

Reply via email to