They fix this in newer versions of Slurm. We had the same issue with
older versions so we hard to run with the config_override option on to
keep the logs quiet. They changed the way logging was done in the more
recent releases and its not as chatty.
-Paul Edmon-
On 5/12/22 7:35 AM, Per Lönnborg wrote:
Greetings,
is there a way to lower the log rate on error messages in slurmctld
for nodes with hardware errors?
We see for example this for a node that has DIMM errors:
[2022-05-12T07:07:34.757] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:35.760] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:36.763] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:37.766] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:38.769] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:39.773] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:40.776] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:41.779] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:42.781] error: Node node37 has low real_memory size
(257642 < 257660)
[2022-05-12T07:07:45.143] error: Node node37 has low real_memory size
(257642 < 257660)
The log warning is correct, the node has DIMM errors, but that´s one
log entry per second. That doesn´t seem right with such high log rate?
Thanks,
/ Per Lonnborg
_______________________________________________________________
Annons: Handla enkelt och smidigt hos Clas Ohlson
<http://www.dpbolvw.net/click-5762941-10771045>