On Wed, Feb 16, 2022 at 4:26 PM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Wed, Feb 16, 2022 at 3:09 PM Ulrich Windl < > ulrich.wi...@rz.uni-regensburg.de> wrote: > >> Hi! >> >> When changing some FC cables I noticed that sbd complained 2 seconds >> after the connection went down (event though the device is multi-pathed >> with other paths being still up). >> I don't know any sbd parameter being set so low that after 2 seconds sbd >> would panic. Which parameter (if any) is responsible for that? >> >> In fact multipath takes up to 5 seconds to adjust paths. >> >> Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64 from >> SLES15 SP3): >> Feb 14 13:01:36 h18 kernel: qla2xxx [0000:41:00.0]-500b:3: LOOP DOWN >> detected (2 7 0 0). >> Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2: >> error: servant_md: slot read failed in servant. >> Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1: >> error: servant_md: mbox read failed in servant. >> Feb 14 13:01:40 h18 sbd[6615]: warning: inquisitor_child: Servant >> /dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11) >> Feb 14 13:01:40 h18 sbd[6615]: warning: inquisitor_child: Servant >> /dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11) >> Feb 14 13:01:40 h18 sbd[6615]: warning: inquisitor_child: Majority of >> devices lost - surviving on pacemaker >> Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device >> Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt, >> sector 2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1 >> Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing >> path 68:112. >> Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device >> > Sry forgotten to address the following. Guess your sbd-package predates https://github.com/ClusterLabs/sbd/commit/9e6cbbad9e259de374cbf41b713419c342528db1 and thus doesn't properly destroy the io-context using the aio-api. This flaw has been in kind of since ever and I actually found it due to a kernel-issue that made all block-io done the way sbd is doing it (aio + O_SYNC + O_DIRECT Actually never successfully tracked it down to the real kernel issue playing with kprobes. But it was gone on the next kernel update ) timeout. Without survival on pacemaker it would have suicided after msgwait-timeout (10s in your case probably). Would be interesting what happens if you raise msgwait-timeout to a value that would allow another read attempt. Does your setup actually recover? Could be possible that it doesn't missing the fix referenced above. Regards, Klaus > >> Most puzzling is the fact that sbd reports a problem 4 seconds before the >> kernel reports an I/O error. I guess sbd "times out" the pending read. >> > Yep - that is timeout_io defaulting to 3s. > You can set it with -I daemon start parameter. > Together with the rest of the default-timeout-scheme the 3s do make sense. > Not sure but if you increase that significantly you might have to adapt > other timeouts. > There are a certain number of checks regarding relationship of timeouts > but they might not be exhaustive. > >> >> The thing is: Both SBD disks are on different storage systems, each being >> connected by two separate FC fabrics, but still when disconnecting one >> cable from the host sbd panics. >> My guess is if "surviving on pacemaker" would not have happened, the node >> would be fenced; is that right? >> >> The other thing I wonder is the "outdated age": >> How can the age be 11 (seconds) when the disk was disconnected 4 seconds >> ago? >> It seems here the age is "current time - time_of_last read" instead of >> "current_time - time_when read_attempt_started". >> > Exactly! And that is the correct way to do it as we need to record the > time passed since last successful read. > There is no value in starting the clock when we start the read attempt as > these attempts are not synced throughout > the cluster. > > Regards, > Klaus > >> >> Regards, >> Ulrich >> >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> >>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/