Your message dated Sat, 11 Aug 2018 22:16:19 +0200
with message-id <20180811201619.2b7vatnxtovz4...@percival.namespace.at>
and subject line Re: #799781: device lock race condition between udev and 
multipathd may cause systemd to abort system boot
has caused the Debian Bug report #799781,
regarding device lock race condition between udev and multipathd may cause 
systemd to abort system boot
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact ow...@bugs.debian.org
immediately.)


-- 
799781: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799781
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems
--- Begin Message ---
Package: multipath-tools
Version: 0.5.0-6+deb8u1
Severity: critical
Tags: patch


Configuration:
I have the following setup: 
Dell PowerEdge M620 + QLogic ISP2532-based 8GB Fibre Channel to PCI Express HBA 
attached to our SAN with multipath.
OS is Debian Jessie 8.1
The Servers root file system resides on a LVM logical Volume.
The packages multipath-tools and multipath-tools-boot were installed.

Symptom:
Approximately 50% of the time the server won't boot correctly. (Depending on 
the outcome of the race condition between udev and multipathd [see below])
The password prompt for entering single user mode (or rescue.target) appears.

Problem:
The problem seems to be the same, Will Aoki already reported for 
upgrade-reports in the bug report 788295.
He was using open-iscsi, while I'm using a FC-HBA with the qla2xxx module. I'm 
guessing other combinations are affected too.

Bug 788295 has a very detailed analysis of the problem. The provided logs 
correlate with mine.
Since 788295 was filed against upgrade-reports, it'll probably not get fixed, 
hence this report.

Further Information:
Existing Debian bug report: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=788295
Ubuntu fixed the issue. See 
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1431650
Ubuntu Package with fix: 
http://packages.ubuntu.com/trusty-updates/multipath-tools
See also the comment of the patch taken from Ubuntu for more technical details.

Solution:
The following patch, taken from the Ubuntu package solved the problem for me 
and Will Aoki.
Could you please add this patch to the official Debian package and if possible 
get the fixed package into jessie-updates and the next jessie release?

------------------- START OF PATCH -----------------
>From 841977fc9c3432702c296d6239e4a54291a6007a Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <h...@suse.de>
Date: Tue, 24 Jun 2014 08:49:15 +0200
Subject: [PATCH] libmultipath: use a shared lock to co-operate with udev

udev since v214 is placing a shared lock on the device node
whenever it's processing the event. This introduces a race
condition with multipathd, as multipathd is processing the
event for the block device at the same time as udev is
processing the events for the partitions.
And a lock on the partitions will also be visible on the
block device itself, hence multipathd won't be able to
lock the device.
When multipath manages to take a lock on the device,
udev will fail, and consequently ignore this entire event.
Which in turn might cause the system to malfunction as it
might have been a crucial event like 'remove' or 'link down'.

So we should better use LOCK_SH here; with that the flock
call in multipathd _and_ udev will succeed and the events
can be processed.

References: bnc#883878

Signed-off-by: Hannes Reinecke <h...@suse.de>
---
 libmultipath/configure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 0ddd3d5..dc2ebf0 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -529,7 +529,7 @@ lock_multipath (struct multipath * mpp, int lock)
                if (!pgp->paths)
                        continue;
                vector_foreach_slot(pgp->paths, pp, j) {
-                       if (lock && flock(pp->fd, LOCK_EX | LOCK_NB) &&
+                       if (lock && flock(pp->fd, LOCK_SH | LOCK_NB) &&
                            errno == EWOULDBLOCK)
                                goto fail;
                        else if (!lock)

------------------- END OF PATCH -----------------

Additional comments:
Why I rated this critical: (1) The Ubuntu bug is rated critical. (2) I think 
the "makes unrelated software on the system (or the whole system) break" clause 
applies when a system does not reliably boot anymore.
I can provide journal entries of a failed boot attempt if necessary. Since such 
logs already exist in bug 788295 and a tested patch exists, I thought it wasn't.

Kind Regards
Niels Baumgartner

--- End Message ---
--- Begin Message ---
This is marked as fixed in stretch, and only open in jessie. As we
no longer can send fixes to jessie, I'm closing this bug.

Thanks,
Chris

--- End Message ---

Reply via email to