>>> Klaus Wenninger <kwenn...@redhat.com> schrieb am 17.02.2022 um 10:49 in Nachricht <calrdao0ungyyybnv9xwve9v4suxvjon-y8c8vd51zr5lt1o...@mail.gmail.com>: ... >> For completeness: Yes, sbd did recover: >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant >> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated >> Feb 14 13:01:42 h18 sbd[6615]: warning: cleanup_servant_by_pid: Servant >> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated >> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1: >> notice: servant_md: Monitoring slot 4 on disk >> /dev/disk/by-id/dm-name-SBD_1-3P1 >> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2: >> notice: servant_md: Monitoring slot 4 on disk >> /dev/disk/by-id/dm-name-SBD_1-3P2 >> Feb 14 13:01:49 h18 sbd[6615]: notice: inquisitor_child: Servant >> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0) >> Feb 14 13:01:49 h18 sbd[6615]: notice: inquisitor_child: Servant >> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0) >> > > Good to see that! > Did you try several times?
Well, we only have two fabrics, and the server is productive, so both fabrics were interrupted once each (to change the cabling). sbd survived. Second fabric: Feb 14 13:03:51 h18 kernel: qla2xxx [0000:01:00.0]-500b:2: LOOP DOWN detected (2 7 0 0). Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3 Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 2 Feb 14 13:05:18 h18 kernel: qla2xxx [0000:01:00.0]-500a:2: LOOP UP detected (8 Gbps). Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: sdr - tur checker reports path is up Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3 Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: sdae - tur checker reports path is up Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4 Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: sdl - tur checker reports path is up Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3 Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: sdo - tur checker reports path is up Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4 So this time multipath reacted before SBD noticed anything (the way it should have been anyway) > I have some memory that when testing with the kernel mentioned before > behavior > changed after a couple of timeouts and it wasn't able to create the > read-request > anymore (without the fix mentioned) - assume some kind of resource depletion > due to previously hanging attempts not destroyed properly. That can be a nasty rece condition, too, however. (I had my share of signal handlers, threads and race conditions). Of course more crude programming errors are possible, too. Debugging can be very hard, but dmsetup can create bad disks for testing for you ;-) DEV=bad_disk dmsetup create "$DEV" <<EOF 0 8 zero 8 1 error 9 7 zero 16 1 error 17 255 zero EOF Regards, Ulrich ... _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/