Hello,

This question will most likely need to be addressed by the developers, but one never knows.

Anyways, I skipped opening a ticket on bugs.schedmd.com because we do not have a support contract - not sure if this is standard procedure or not.

Anyways, we have experienced a random(?) slurmctld failure resulting in a segfault twice this week. We have never seen this particular failure before:

Apr 13 17:20:03 controlhost kernel: slurmctld_srvcn[23331]: segfault at 0 ip 000000367aa8967e sp 00002b4599d0bb18 error 4 in libc-2.12.so[367aa00000+18a000] Apr 16 06:18:02 controlhost kernel: slurmctld_srvcn[11009]: segfault at 0 ip 000000367aa8967e sp 00002b5f248f7b18 error 4 in libc-2.12.so[367aa00000+18a000]

Looking at the core dumps[1], it seems that both are following similar paths, yet when the threads are dumped, the LWP's which are in the core dumps don't match, e.g. core.7425 and core.23652.

The last updates to the cluster configuration was the addition and removal of some of the nodes in the cluster, which is pretty banal, so I don't think that would be the cause - besides, any real configuration issues would result in the controller not starting.

[1] https://enemy.org/~mrfusion/slurmctld_crashes.txt

I have the core dumps saved, so if anyone needs more information let me know. FWIW, I've dug a little into them and I'm not seeing anything out of the ordinary (e.g. NULL values in variables, etc.), but I'm willing to bet that I've overlooked something.

Thanks!
John DeSantis




Reply via email to