>>> Edwin Török <[email protected]> schrieb am 30.07.2018 um 11:20 in Nachricht <[email protected]>: > On 30/07/18 08:24, Ulrich Windl wrote: >> Hi! >> >> We have a strange problem on one cluster node running Xen PV VMs (SLES11 > SP4): After updating the kernel and adding new SBD devices (to replace an old > storage system), the system just seems to freeze. > > Hi, > > Which version of Xen are you using and what Linux distribution is run in > Dom0?
As the subject says: SLES11 SP4 > >> Closter inspection showed that SBD seems to send an NMI (for reasons still > to be examined), and the current Xen/Kernel seems to be unable to handle the > NMI in a way that forces a restart of the server (see attached screen shot). > > Can you show us your kernel boot cmdline, and loaded modules? > Which watchdog module did you load? Have you tried xen_wdt? > See https://www.suse.com/support/kb/doc/?id=7016880 The server is a HP DL380 G7, so the wathdog is hpwdt. The basic kernel options are simply "earlyprintk=xen nomodeset", and xen has "dom0_mem=4096M,max:8192M". In the meantime I found out that if I disable the sbd watchdog (-W -W, who did write such terrible code?) the NMI is not sent. I suspect a problem with sbd using three devices (before the change we only had two), because on startup is says three times it's starting the first servant (" sbd: [5904]: info: First servant start - zeroing inbox", "sbd: [5903]: info: First servant start - zeroing inbox", "sbd: [5901]: info: First servant start - zeroing inbox")... The other thing is that a latency of 7s is reported (which I doubt very much): Jul 30 11:37:27 h01 sbd: [5901]: info: Latency: 7 on disk /dev/disk/by-id/dm-name-SBD_1-E3 Jul 30 11:37:27 h01 sbd: [5904]: info: Latency: 7 on disk /dev/disk/by-id/dm-name-SBD_1-3P2 Jul 30 11:37:27 h01 sbd: [5903]: info: Latency: 7 on disk /dev/disk/by-id/dm-name-SBD_1-3P1 Regards, Ulrich > > Best regards, > ‑‑Edwin > >> >> The last message I see in the node's cluster log is this: >> Jul 27 11:33:32 [15731] h01 cib: info: > cib_file_write_with_digest: Reading cluster configuration file > /var/lib/pacemaker/cib/cib.YESngs (digest: /var/lib/pacemaker/cib/cib.Yutv8O) >> >> Other nodes have these messages: >> Jul 27 11:33:32 h05 dlm_controld.pcmk[15810]: dlm_process_node: Skipped > active node 739512330: born‑on=3864, last‑seen=3936, this‑event=3936, > last‑event=3932 >> >> Jul 27 11:33:32 h10 dlm_controld.pcmk[20397]: dlm_process_node: Skipped > active node 739512325: born‑on=3856, last‑seen=3936, this‑event=3936, > last‑event=3932 >> >> Can anybody bring some light into this issue?: >> 1) Under what circumstances is an NMI sent by SBD? >> 2) What is the reaction expected after receiving an NMI? >> 3) If it did work before, what could have gone wrong? >> >> I wanted to get some feedback from here before asking SLES support... >> >> Regards, >> Ulrich >> >> >> >> >> _______________________________________________ >> Users mailing list: [email protected] >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
