Hello,
This question will most likely need to be addressed by the developers,
but one never knows.
Anyways, I skipped opening a ticket on bugs.schedmd.com because we do
not have a support contract - not sure if this is standard procedure or not.
Anyways, we have experienced a random(?) slurmctld failure resulting in
a segfault twice this week. We have never seen this particular failure
before:
Apr 13 17:20:03 controlhost kernel: slurmctld_srvcn[23331]: segfault at
0 ip 000000367aa8967e sp 00002b4599d0bb18 error 4 in
libc-2.12.so[367aa00000+18a000]
Apr 16 06:18:02 controlhost kernel: slurmctld_srvcn[11009]: segfault at
0 ip 000000367aa8967e sp 00002b5f248f7b18 error 4 in
libc-2.12.so[367aa00000+18a000]
Looking at the core dumps[1], it seems that both are following similar
paths, yet when the threads are dumped, the LWP's which are in the core
dumps don't match, e.g. core.7425 and core.23652.
The last updates to the cluster configuration was the addition and
removal of some of the nodes in the cluster, which is pretty banal, so I
don't think that would be the cause - besides, any real configuration
issues would result in the controller not starting.
[1] https://enemy.org/~mrfusion/slurmctld_crashes.txt
I have the core dumps saved, so if anyone needs more information let me
know. FWIW, I've dug a little into them and I'm not seeing anything out
of the ordinary (e.g. NULL values in variables, etc.), but I'm willing
to bet that I've overlooked something.
Thanks!
John DeSantis