[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Ulrich Windl Thu, 17 Feb 2022 03:38:44 -0800

>>> Klaus Wenninger <[email protected]> schrieb am 17.02.2022 um 10:49 in
Nachricht
<calrdao0ungyyybnv9xwve9v4suxvjon-y8c8vd51zr5lt1o...@mail.gmail.com>:
...
>> For completeness: Yes, sbd did recover:
>> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
>> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated
>> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
>> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated
>> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1:
>>  notice: servant_md: Monitoring slot 4 on disk
>> /dev/disk/by-id/dm-name-SBD_1-3P1
>> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2:
>>  notice: servant_md: Monitoring slot 4 on disk
>> /dev/disk/by-id/dm-name-SBD_1-3P2
>> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0)
>> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0)
>>
> 
> Good to see that!
> Did you try several times?


Well, we only have two fabrics, and the server is productive, so both fabrics 
were interrupted once each (to change the cabling).
sbd survived.

Second fabric:
Feb 14 13:03:51 h18 kernel: qla2xxx [0000:01:00.0]-500b:2: LOOP DOWN detected 
(2 7 0 0).
Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
Feb 14 13:03:57 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 2

Feb 14 13:05:18 h18 kernel: qla2xxx [0000:01:00.0]-500a:2: LOOP UP detected (8 
Gbps).
Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: sdr - tur checker reports path 
is up
Feb 14 13:05:22 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: sdae - tur checker reports 
path is up
Feb 14 13:05:23 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4
Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: sdl - tur checker reports path 
is up
Feb 14 13:05:25 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3
Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: sdo - tur checker reports path 
is up
Feb 14 13:05:26 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4

So this time multipath reacted before SBD noticed anything (the way it should 
have been anyway)

> I have some memory that when testing with the kernel mentioned before
> behavior
> changed after a couple of timeouts and it wasn't able to create the
> read-request
> anymore (without the fix mentioned) - assume some kind of resource depletion
> due to previously hanging attempts not destroyed properly.

That can be a nasty rece condition, too, however. (I had my share of signal 
handlers, threads and race conditions).
Of course more crude programming errors are possible, too.
Debugging can be very hard, but dmsetup can create bad disks for testing for 
you ;-)
DEV=bad_disk
dmsetup create "$DEV" <<EOF
0 8 zero
8 1 error
9 7 zero
16 1 error
17 255 zero
EOF

Regards,
Ulrich
...


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Reply via email to