I have the core dumps saved, so if anyone needs more information let me know. FWIW, I've dug a little into them and I'm not seeing anything out of the ordinary (e.g. NULL values in variables, etc.), but I'm willing to bet that I've overlooked something.
Trouble-shooting on a single shot of espresso early in the morning, and it looks like I missed these(!):
#1 0x0000000000509ce2 in packmem (valp=0x0, size_val=27, buffer=0x2b6074140400) at pack.c:623
#1 0x0000000000509ce2 in packmem (valp=0x0, size_val=31, buffer=0x2b45b037f350) at pack.c:623
There are some references to "packmem (valp=0x0" on the bugs.schedmd.com site, and bug 2453 seems oddly familiar, although the tres format strings are properly populated in both instances.
Thanks in advance for any information! John DeSantis On 04/16/2016 07:50 AM, John DeSantis wrote:
Hello, This question will most likely need to be addressed by the developers, but one never knows. Anyways, I skipped opening a ticket on bugs.schedmd.com because we do not have a support contract - not sure if this is standard procedure or not. Anyways, we have experienced a random(?) slurmctld failure resulting in a segfault twice this week. We have never seen this particular failure before: Apr 13 17:20:03 controlhost kernel: slurmctld_srvcn[23331]: segfault at 0 ip 000000367aa8967e sp 00002b4599d0bb18 error 4 in libc-2.12.so[367aa00000+18a000] Apr 16 06:18:02 controlhost kernel: slurmctld_srvcn[11009]: segfault at 0 ip 000000367aa8967e sp 00002b5f248f7b18 error 4 in libc-2.12.so[367aa00000+18a000] Looking at the core dumps[1], it seems that both are following similar paths, yet when the threads are dumped, the LWP's which are in the core dumps don't match, e.g. core.7425 and core.23652. The last updates to the cluster configuration was the addition and removal of some of the nodes in the cluster, which is pretty banal, so I don't think that would be the cause - besides, any real configuration issues would result in the controller not starting. [1] https://enemy.org/~mrfusion/slurmctld_crashes.txt I have the core dumps saved, so if anyone needs more information let me know. FWIW, I've dug a little into them and I'm not seeing anything out of the ordinary (e.g. NULL values in variables, etc.), but I'm willing to bet that I've overlooked something. Thanks! John DeSantis
