Re: [ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Klaus Wenninger
On Thu, Feb 17, 2022 at 10:14 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Klaus Wenninger  schrieb am 16.02.2022 um 16:59
> in
> Nachricht
> :
> > On Wed, Feb 16, 2022 at 4:26 PM Klaus Wenninger 
> wrote:
> >
> >>
> >>
> >> On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl <
> >> ulrich.wi...@rz.uni-regensburg.de> wrote:
> >>
> >>> Hi!
> >>>
> >>> When changing some FC cables I noticed that sbd complained 2 seconds
> >>> after the connection went down (event though the device is multi-pathed
> >>> with other paths being still up).
> >>> I don't know any sbd parameter being set so low that after 2 seconds
> sbd
> >>> would panic. Which parameter (if any) is responsible for that?
> >>>
> >>> In fact multipath takes up to 5 seconds to adjust paths.
> >>>
> >>> Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64
> from
> >>> SLES15 SP3):
> >>> Feb 14 13:01:36 h18 kernel: qla2xxx [:41:00.0]-500b:3: LOOP DOWN
> >>> detected (2 7 0 0).
> >>> Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2:
> >>> error: servant_md: slot read failed in servant.
> >>> Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1:
> >>> error: servant_md: mbox read failed in servant.
> >>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
> >>> /dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11)
> >>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
> >>> /dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11)
> >>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Majority of
> >>> devices lost - surviving on pacemaker
> >>> Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device
> >>> Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt,
> >>> sector 2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1
> >>> Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing
> >>> path 68:112.
> >>> Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device
> >>>
> >> Sry forgotten to address the following.
> >
> > Guess your sbd-package predates
> >
> https://github.com/ClusterLabs/sbd/commit/9e6cbbad9e259de374cbf41b713419c342
> > 528db1
> > and thus doesn't properly destroy the io-context using the aio-api.
> > This flaw has been in kind of since ever and I actually found it due to a
> > kernel-issue that made
> > all block-io done the way sbd is doing it (aio + O_SYNC + O_DIRECT
> Actually
> > never successfully
> > tracked it down to the real kernel issue playing with kprobes. But it was
> > gone on the next kernel
> > update
> > ) timeout.
> > Without survival on pacemaker it would have suicided after
> > msgwait-timeout (10s in your case probably).
> > Would be interesting what happens if you raise msgwait-timeout to a value
> > that would allow
> > another read attempt.
> > Does your setup actually recover? Could be possible that it doesn't
> missing
> > the fix referenced above.
>
> For completeness: Yes, sbd did recover:
> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
> for /dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated
> Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant
> for /dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated
> Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1:
>  notice: servant_md: Monitoring slot 4 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P1
> Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2:
>  notice: servant_md: Monitoring slot 4 on disk
> /dev/disk/by-id/dm-name-SBD_1-3P2
> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
> /dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0)
> Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant
> /dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0)
>

Good to see that!
Did you try several times?
I have some memory that when testing with the kernel mentioned before
behavior
changed after a couple of timeouts and it wasn't able to create the
read-request
anymore (without the fix mentioned) - assume some kind of resource depletion
due to previously hanging attempts not destroyed properly.
But that behavior might heavily depend on the kernel-version and as your
attempts
do terminate with failure in the kernel some time later that might resolve
issues
as well (in my case they would like hang forever from kernel pov).
As you seem to have sbd rebuilt anyway for your timer-checks ... would be
interesting if with the fix the delayed timeout-messages from the kernel
would
disappear.

Regards,
Klaus

>
> Feb 14 13:02:10 h18 kernel: qla2xxx [:41:00.0]-500a:3: LOOP UP
> detected (8 Gbps).
> Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3
> Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: sdbl - tur checker
> reports path is up
> Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4
> Feb 14 13:02:16 h18 multipathd[5180]: SBD_1-3P2: s

Re: [ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Klaus Wenninger
On Thu, Feb 17, 2022 at 9:27 AM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Klaus Wenninger  schrieb am 16.02.2022 um 16:26
> in
> Nachricht
> :
> > On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl <
> > ulrich.wi...@rz.uni-regensburg.de> wrote:
> >
> >> Hi!
> >>
> >> When changing some FC cables I noticed that sbd complained 2 seconds
> after
> >> the connection went down (event though the device is multi-pathed with
> >> other paths being still up).
> >> I don't know any sbd parameter being set so low that after 2 seconds sbd
> >> would panic. Which parameter (if any) is responsible for that?
> >>
> >> In fact multipath takes up to 5 seconds to adjust paths.
> >>
> >> Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64
> from
> >> SLES15 SP3):
> >> Feb 14 13:01:36 h18 kernel: qla2xxx [:41:00.0]-500b:3: LOOP DOWN
> >> detected (2 7 0 0).
> >> Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2:
> >> error: servant_md: slot read failed in servant.
> >> Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1:
> >> error: servant_md: mbox read failed in servant.
> >> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
> >> /dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11)
> >> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
> >> /dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11)
> >> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Majority of
> >> devices lost - surviving on pacemaker
> >> Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device
> >> Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt,
> >> sector 2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1
> >> Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing
> path
> >> 68:112.
> >> Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device
> >>
> >> Most puzzling is the fact that sbd reports a problem 4 seconds before
> the
> >> kernel reports an I/O error. I guess sbd "times out" the pending read.
> >>
> > Yep - that is timeout_io defaulting to 3s.
> > You can set it with -I daemon start parameter.
> > Together with the rest of the default-timeout-scheme the 3s do make
> sense.
> > Not sure but if you increase that significantly you might have to adapt
> > other timeouts.
>
> We extended the timeouts so that sbd would survive an online firmware
> update of the storage system which may cause it not to respond for up to 30
> seconds when the controllers restart.
>
> > There are a certain number of checks regarding relationship of timeouts
> but
> > they might not be exhaustive.
> >
> >>
> >> The thing is: Both SBD disks are on different storage systems, each
> being
> >> connected by two separate FC fabrics, but still when disconnecting one
> >> cable from the host sbd panics.
> >> My guess is if "surviving on pacemaker" would not have happened, the
> node
> >> would be fenced; is that right?
> >>
> >> The other thing I wonder is the "outdated age":
> >> How can the age be 11 (seconds) when the disk was disconnected 4 seconds
> >> ago?
> >> It seems here the age is "current time - time_of_last read" instead of
> >> "current_time - time_when read_attempt_started".
> >>
> > Exactly! And that is the correct way to do it as we need to record the
> time
> > passed since last successful read.
> > There is no value in starting the clock when we start the read attempt as
> > these attempts are not synced throughout
> > the cluster.
>
> I don't understand: There is no heartbeat written to SBD that has to be
> read; instead the device is polled for messages.
> So the important point is how much the polling is delayed by some problem
> (not by some deliberate sleep).
> And I don't see why the measurement has to be synced throughout the
> cluster: Local is enough.
>
> I'm afraid I'm not really getting your point below.

Anyway the important part is that the last time having read the mailbox
successfully as empty mustn't
be older than basically msgwait. (partly a result of the polls not synced
throughout the cluster)
As we on top need to take into account signalling between the processes +
poll-cycle + x the actual read
io-timeout should be substantially shorter.
Purpose of the code is not to print any critical logs but send heartbeats
to the inquisitor so I guess there
is no real point in making it more complicated (and thus risky to fail -
especially by making the code
multi-threaded if I'm getting that right) than it has to be.
We on top shouldn't mix responsibilities of the sbd-processes like the
disk-watcher e.g. checking for msgwait-timeout.
If it takes too long we let it timeout and try again. And if the timeout is
too short for the disk-setup then the io-timeout
should be raised to an appropriate value.
Purpose of the poll cycle is to define a maximum poll-rate. That makes it
somehow related to
the io-timeout but I'm not sure if it is a good measure for some kind of a
warn

[ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 16.02.2022 um 16:59 in
Nachricht
:
> On Wed, Feb 16, 2022 at 4:26 PM Klaus Wenninger  wrote:
> 
>>
>>
>> On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl <
>> ulrich.wi...@rz.uni-regensburg.de> wrote:
>>
>>> Hi!
>>>
>>> When changing some FC cables I noticed that sbd complained 2 seconds
>>> after the connection went down (event though the device is multi-pathed
>>> with other paths being still up).
>>> I don't know any sbd parameter being set so low that after 2 seconds sbd
>>> would panic. Which parameter (if any) is responsible for that?
>>>
>>> In fact multipath takes up to 5 seconds to adjust paths.
>>>
>>> Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64 from
>>> SLES15 SP3):
>>> Feb 14 13:01:36 h18 kernel: qla2xxx [:41:00.0]-500b:3: LOOP DOWN
>>> detected (2 7 0 0).
>>> Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2:
>>> error: servant_md: slot read failed in servant.
>>> Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1:
>>> error: servant_md: mbox read failed in servant.
>>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
>>> /dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11)
>>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
>>> /dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11)
>>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Majority of
>>> devices lost - surviving on pacemaker
>>> Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device
>>> Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt,
>>> sector 2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1
>>> Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing
>>> path 68:112.
>>> Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device
>>>
>> Sry forgotten to address the following.
> 
> Guess your sbd-package predates
> https://github.com/ClusterLabs/sbd/commit/9e6cbbad9e259de374cbf41b713419c342 
> 528db1
> and thus doesn't properly destroy the io-context using the aio-api.
> This flaw has been in kind of since ever and I actually found it due to a
> kernel-issue that made
> all block-io done the way sbd is doing it (aio + O_SYNC + O_DIRECT Actually
> never successfully
> tracked it down to the real kernel issue playing with kprobes. But it was
> gone on the next kernel
> update
> ) timeout.
> Without survival on pacemaker it would have suicided after
> msgwait-timeout (10s in your case probably).
> Would be interesting what happens if you raise msgwait-timeout to a value
> that would allow
> another read attempt.
> Does your setup actually recover? Could be possible that it doesn't missing
> the fix referenced above.

For completeness: Yes, sbd did recover:
Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant for 
/dev/disk/by-id/dm-name-SBD_1-3P1 (pid: 6619) has terminated
Feb 14 13:01:42 h18 sbd[6615]:  warning: cleanup_servant_by_pid: Servant for 
/dev/disk/by-id/dm-name-SBD_1-3P2 (pid: 6621) has terminated
Feb 14 13:01:42 h18 sbd[31668]: /dev/disk/by-id/dm-name-SBD_1-3P1:   notice: 
servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P1
Feb 14 13:01:42 h18 sbd[31669]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice: 
servant_md: Monitoring slot 4 on disk /dev/disk/by-id/dm-name-SBD_1-3P2
Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant 
/dev/disk/by-id/dm-name-SBD_1-3P1 is healthy (age: 0)
Feb 14 13:01:49 h18 sbd[6615]:   notice: inquisitor_child: Servant 
/dev/disk/by-id/dm-name-SBD_1-3P2 is healthy (age: 0)

Feb 14 13:02:10 h18 kernel: qla2xxx [:41:00.0]-500a:3: LOOP UP detected (8 
Gbps).
Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 3
Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: sdbl - tur checker reports 
path is up
Feb 14 13:02:15 h18 multipathd[5180]: SBD_1-3P1: remaining active paths: 4
Feb 14 13:02:16 h18 multipathd[5180]: SBD_1-3P2: sdbu - tur checker reports 
path is up
Feb 14 13:02:16 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 3
Feb 14 13:02:16 h18 multipathd[5180]: SBD_1-3P2: sdbo - tur checker reports 
path is up
Feb 14 13:02:16 h18 multipathd[5180]: SBD_1-3P2: remaining active paths: 4

> 
> Regards,
> Klaus
> 
>>
>>> Most puzzling is the fact that sbd reports a problem 4 seconds before the
>>> kernel reports an I/O error. I guess sbd "times out" the pending read.
>>>
>> Yep - that is timeout_io defaulting to 3s.
>> You can set it with -I daemon start parameter.
>> Together with the rest of the default-timeout-scheme the 3s do make sense.
>> Not sure but if you increase that significantly you might have to adapt
>> other timeouts.
>> There are a certain number of checks regarding relationship of timeouts
>> but they might not be exhaustive.
>>
>>>
>>> The thing is: Both SBD disks are on different storage systems, each being
>>> connected by two separate FC fabrics, but still when disconnecting one
>>> cable from t

[ClusterLabs] Antw: [EXT] Re: Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

2022-02-17 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 16.02.2022 um 16:26 in
Nachricht
:
> On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl <
> ulrich.wi...@rz.uni-regensburg.de> wrote:
> 
>> Hi!
>>
>> When changing some FC cables I noticed that sbd complained 2 seconds after
>> the connection went down (event though the device is multi-pathed with
>> other paths being still up).
>> I don't know any sbd parameter being set so low that after 2 seconds sbd
>> would panic. Which parameter (if any) is responsible for that?
>>
>> In fact multipath takes up to 5 seconds to adjust paths.
>>
>> Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64 from
>> SLES15 SP3):
>> Feb 14 13:01:36 h18 kernel: qla2xxx [:41:00.0]-500b:3: LOOP DOWN
>> detected (2 7 0 0).
>> Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2:
>> error: servant_md: slot read failed in servant.
>> Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1:
>> error: servant_md: mbox read failed in servant.
>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11)
>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant
>> /dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11)
>> Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Majority of
>> devices lost - surviving on pacemaker
>> Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device
>> Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt,
>> sector 2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1
>> Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing path
>> 68:112.
>> Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device
>>
>> Most puzzling is the fact that sbd reports a problem 4 seconds before the
>> kernel reports an I/O error. I guess sbd "times out" the pending read.
>>
> Yep - that is timeout_io defaulting to 3s.
> You can set it with -I daemon start parameter.
> Together with the rest of the default-timeout-scheme the 3s do make sense.
> Not sure but if you increase that significantly you might have to adapt
> other timeouts.

We extended the timeouts so that sbd would survive an online firmware update of 
the storage system which may cause it not to respond for up to 30 seconds when 
the controllers restart.

> There are a certain number of checks regarding relationship of timeouts but
> they might not be exhaustive.
> 
>>
>> The thing is: Both SBD disks are on different storage systems, each being
>> connected by two separate FC fabrics, but still when disconnecting one
>> cable from the host sbd panics.
>> My guess is if "surviving on pacemaker" would not have happened, the node
>> would be fenced; is that right?
>>
>> The other thing I wonder is the "outdated age":
>> How can the age be 11 (seconds) when the disk was disconnected 4 seconds
>> ago?
>> It seems here the age is "current time - time_of_last read" instead of
>> "current_time - time_when read_attempt_started".
>>
> Exactly! And that is the correct way to do it as we need to record the time
> passed since last successful read.
> There is no value in starting the clock when we start the read attempt as
> these attempts are not synced throughout
> the cluster.

I don't understand: There is no heartbeat written to SBD that has to be read; 
instead the device is polled for messages.
So the important point is how much the polling is delayed by some problem (not 
by some deliberate sleep).
And I don't see why the measurement has to be synced throughout the cluster: 
Local is enough.

Well, maybe I'm a bit to picky, but I wrote some code that handles read stalls, 
and it took me a few iterations to get it right.
The interesting part is when the delay exceeds the polling interval, so 
critical alerts can be created while waiting for a response, but still counting 
as just "one" event:
20220215 201347 (W)monitor_thread: no I/O for seq 14 at 503949.8447422 since 
8.1015658, exceeding deadline at 503956.0947422 by 1.8515658
20220215 201347 (N)check_thresholds: CRITICAL (max:1): 8.1015658 > 1.000
20220215 201349 (W)monitor_thread: no I/O for seq 14 at 503949.8447422 since 
10.6018244, exceeding deadline at 503956.0947422 by 4.3518244
20220215 201349 (N)check_thresholds: CRITICAL (max:1): 10.6018244 > 1.000
20220215 201352 (W)monitor_thread: no I/O for seq 14 at 503949.8447422 since 
13.1020604, exceeding deadline at 503956.0947422 by 6.8520604
20220215 201352 (N)check_thresholds: CRITICAL (max:1): 13.1020604 > 1.000
20220215 201354 (W)monitor_thread: no I/O for seq 14 at 503949.8447422 since 
15.6022895, exceeding deadline at 503956.0947422 by 9.3522895
20220215 201354 (N)check_thresholds: CRITICAL (max:1): 15.6022895 > 1.000
20220215 201357 (W)monitor_thread: no I/O for seq 14 at 503949.8447422 since 
18.1025207, exceeding deadline at 503956.0947422 by 11.8525207
20220215 201357 (N)check_thresholds: CRITICAL (max:1): 18.1025207 > 1.00