[slurm-dev] slurmctld_srvcn segfault 15.08.4

John DeSantis Sat, 16 Apr 2016 04:50:43 -0700


Hello,

This question will most likely need to be addressed by the developers,but one never knows.

Anyways, I skipped opening a ticket on bugs.schedmd.com because we donot have a support contract - not sure if this is standard procedure or not.

Anyways, we have experienced a random(?) slurmctld failure resulting ina segfault twice this week. We have never seen this particular failurebefore:

Apr 13 17:20:03 controlhost kernel: slurmctld_srvcn[23331]: segfault at0 ip 000000367aa8967e sp 00002b4599d0bb18 error 4 inlibc-2.12.so[367aa00000+18a000]Apr 16 06:18:02 controlhost kernel: slurmctld_srvcn[11009]: segfault at0 ip 000000367aa8967e sp 00002b5f248f7b18 error 4 inlibc-2.12.so[367aa00000+18a000]

Looking at the core dumps[1], it seems that both are following similarpaths, yet when the threads are dumped, the LWP's which are in the coredumps don't match, e.g. core.7425 and core.23652.

The last updates to the cluster configuration was the addition andremoval of some of the nodes in the cluster, which is pretty banal, so Idon't think that would be the cause - besides, any real configurationissues would result in the controller not starting.


[1] https://enemy.org/~mrfusion/slurmctld_crashes.txt

I have the core dumps saved, so if anyone needs more information let meknow. FWIW, I've dug a little into them and I'm not seeing anything outof the ordinary (e.g. NULL values in variables, etc.), but I'm willingto bet that I've overlooked something.


Thanks!
John DeSantis

[slurm-dev] slurmctld_srvcn segfault 15.08.4

Reply via email to