from:"Roman Kagan"

Re: [patch v0] qapi/qmp: Add timestamps to qmp command responses.

2022-09-28 Thread Roman Kagan

On Tue, Sep 27, 2022 at 08:04:11AM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé  writes:
> > On Mon, Sep 26, 2022 at 12:59:40PM +0300, Denis Plotnikov wrote:
> >> Example of result:
> >> 
> >> ./qemu/scripts/qmp/qmp-shell /tmp/qmp.socket
> >> 
> >> (QEMU) query-status
> >> {"end": {"seconds": 1650367305, "microseconds": 831032},
> >>  "start": {"seconds": 1650367305, "microseconds": 831012},
> >>  "return": {"status": "running", "singlestep": false, "running": true}}
> >> 
> >> The responce of the qmp command contains the start & end time of
> >> the qmp command processing.
> 
> Seconds and microseconds since when?  The update to qmp-spec.txt should
> tell.
> 
> Why split the time into seconds and microseconds?

This is exactly how timestamps in QMP events are done, so we thought
we'd just follow suit

> > The mgmt app already knows when it send the QMP command and knows
> > when it gets the QMP reply.  This covers the time the QMP was
> > queued before processing (might be large if QMP is blocked on
> > another slow command) , the processing time, and the time any
> > reply was queued before sending (ought to be small).
> >
> > So IIUC, the value these fields add is that they let the mgmt
> > app extract only the command processing time, eliminating
> > any variance do to queue before/after.

Right, this is precisely the motivation.

Thanks,
Roman.

Re: [PATCH] virtio: add VIRTQUEUE_ERROR QAPI event

2022-09-20 Thread Roman Kagan

On Tue, Sep 20, 2022 at 06:10:08PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 9/20/22 17:47, Markus Armbruster wrote:
> > Vladimir Sementsov-Ogievskiy  writes:
> > 
> > > For now we only log the vhost device error, when virtqueue is actually
> > > stopped. Let's add a QAPI event, which makes possible:
> > > 
> > >   - collect statistics of such errors
> > >   - make immediate actions: take coredums or do some other debugging

+ inform the user through a management API or UI, so that (s)he can
  react somehow, e.g. reset the device driver in the guest or even build
  up some automation to do so

Note that basically every inconsistency discovered during virtqueue
processing results in a silent virtqueue stop.  The guest then just sees
the requests getting stuck somewhere in the device for no visible
reason.  This event provides a means to inform the management layer of
this situation in a timely fashion.

> > 
> > Core dumps, I presume.
> > 
> > Is QMP the right tool for the job?  Or could a trace point do?
> 
> Management tool already can collect QMP events. So, if we want to
> forward some QMP events to other subsystems (to immediately inform
> support team, or to update some statistics) it's simple to realize for
> QMP events. But I'm not sure how to do it for trace-events.. Scanning
> trace logs is not convenient.

Right.  Trace points are a debugging tool: when you expect the problem
to reproduce, you activate them and watch the logs.  On the contrary,
QMP events can trigger some logic in the management layer and provide
for some recovery action.

> > > +##
> > > +# @VIRTQUEUE_ERROR:
> > > +#
> > > +# Emitted when a device virtqueue fails in runtime.
> > > +#
> > > +# @device: the device's ID if it has one
> > > +# @path: the device's QOM path
> > > +# @virtqueue: virtqueue index
> > > +# @error: error identifier
> > > +# @description: human readable description
> > > +#
> > > +# Since: 7.2
> > > +##
> > > +{ 'event': 'VIRTQUEUE_ERROR',
> > > + 'data': { '*device': 'str', 'path': 'str', 'virtqueue': 'int',
> > > +'error': 'VirtqueueError', 'description': 'str'} }
> > 
> > Can the guest trigger the event?
> 
> Yes, but as I understand, only once per virtqueue.

Right, in the sense that every relevant dataplane implementation would
stop the virtqueue on such an error, so in order to trigger a new one
the driver would need to reset the device first.  I guess rate-limiting
is unnecessary here.

Thanks,
Roman.

Re: [PATCH v3] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-21 Thread Roman Kagan

On Thu, Jul 21, 2022 at 05:05:38PM +0100, Mark Cave-Ayland wrote:
> On 21/07/2022 16:56, Daniel P. Berrangé wrote:
> 
> > On Thu, Jul 21, 2022 at 04:51:51PM +0100, Mark Cave-Ayland wrote:
> > > On 21/07/2022 15:28, Roman Kagan wrote:
> > > 
> > > (lots cut)
> > > 
> > > > In the guest (Fedora 34):
> > > > 
> > > > [root@test ~]# lspci -tv
> > > > -[:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM 
> > > > Controller
> > > >  +-01.0  Device 1234:
> > > >  +-02.0  Red Hat, Inc. QEMU XHCI Host Controller
> > > >  +-05.0-[01]00.0  Red Hat, Inc. Virtio block device
> > > >  +-05.1-[02]00.0  Red Hat, Inc. Virtio network device
> > > >  +-05.2-[03]--
> > > >  +-05.3-[04]--
> > > >  +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface 
> > > > Controller
> > > >  \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus 
> > > > Controller
> > > > 
> > > > Changing addr of the second disk from 4 to 0 makes it appear in the
> > > > guest.
> > > > 
> > > > What exactly do you find odd?
> > > 
> > > Thanks for this, the part I wasn't sure about was whether the device ids 
> > > in
> > > the command line matched the primary PCI bus or the secondary PCI bus.
> > > 
> > > In that case I suspect that the enumeration of non-zero PCIe devices fails
> > > in Linux because of the logic here:
> > > https://github.com/torvalds/linux/blob/master/drivers/pci/probe.c#L2622.
> > 
> > Just above that though is logic that handles 'pci=pcie_scan_all'
> > kernel parameter, to make it look for non-zero devices.
> > 
> > > I don't have a copy of the PCIe specification, but assuming the comment is
> > > true then your patch looks correct to me. I think it would be worth 
> > > adding a
> > > similar comment and reference to your patch to explain why the logic is
> > > required, which should also help the PCI maintainers during review.
> > 
> > The docs above with the pci=pcie_scan_all suggest it is unusual but not
> > forbidden.
> 
> That's interesting as I read it completely the other way around, i.e. PCIe
> downstream ports should only have device 0 and the PCI_SCAN_ALL_PCIE_DEVS
> flag is there for broken/exotic hardware :)

Me too :)

The commit that introduced it suggested the same:

commit 284f5f9dbac170b054c1e386ef92cbf654e91bba
Author: Bjorn Helgaas 
Date:   Mon Apr 30 15:21:02 2012 -0600

PCI: work around Stratus ftServer broken PCIe hierarchy

A PCIe downstream port is a P2P bridge.  Its secondary interface is
a link that should lead only to device 0 (unless ARI is enabled)[1], so
we don't probe for non-zero device numbers.

Some Stratus ftServer systems have a PCIe downstream port (02:00.0) that
leads to both an upstream port (03:00.0) and a downstream port (03:01.0),
and 03:01.0 has important devices below it:

  [:02]-+-00.0-[03-3c]--+-00.0-[04-09]--...
\-01.0-[0a-0d]--+-[USB]
+-[NIC]
+-...

Previously, we didn't enumerate device 03:01.0, so USB and the network
didn't work.  This patch adds a DMI quirk to scan all device numbers,
not just 0, below a downstream port.

Based on a patch by Prarit Bhargava.

[1] PCIe spec r3.0, sec 7.3.1

CC: Myron Stowe 
CC: Don Dutile 
CC: James Paradis 
CC: Matthew Wilcox 
CC: Jesse Barnes 
CC: Prarit Bhargava 
Signed-off-by: Bjorn Helgaas 

> Perhaps if someone has a copy of the PCIe specification they can check the
> wording in section 7.3.1 to see exactly what the correct behaviour should
> be?

I don't, sorry

Thanks,
Roman.

Re: [PATCH v3] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-21 Thread Roman Kagan

On Wed, Jul 20, 2022 at 02:21:38PM +0100, Mark Cave-Ayland wrote:
> On 20/07/2022 12:00, Roman Kagan wrote:
> 
> > On Wed, Jul 20, 2022 at 11:44:26AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Jul 20, 2022 at 01:25:55PM +0300, Roman Kagan wrote:
> > > > It's possible to create non-working configurations by attaching a device
> > > > to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
> > > > specifying a slot number other that zero, e.g.:
> > > > 
> > > >  -device pcie-root-port,id=s0,... \
> > > >  -device virtio-blk-pci,bus=s0,addr=4,...
> > > > 
> > > > Make QEMU reject such configurations and only allow addr=0 on the
> > > > secondary bus of a PCIe slot.
> > > 
> > > What do you mean by 'non-working' in this case.  The guest OS boots
> > > OK, but I indeed don't see the device in the guest, but IIUC it was
> > > said that was just because Linux doesn't scan for a non-zero slot.
> > 
> > Right.  I don't remember if it was Linux or firmware or both but indeed
> > at least Linux guests don't see devices if attached to a PCIe slot at
> > addr != 0.  (Which is kinda natural for a thing called "slot", isn't it?)
> > 
> > > That wouldn't be a broken config from QEMU's POV though, merely a
> > > guest OS limitation ?
> > 
> > Strictly speaking it wouldn't, indeed.  But we've had created such a
> > configuration (due to a bug in our management layer) and spent
> > non-negligible time trying to figure out why the attached device didn't
> > appear in the guest.  So I thought it made sense to reject a
> > configuration which is known to confuse guests.  Doesn't it?
> 
> This does seem a bit odd. What does the output of "info qtree" look like for
> your non-working configuration?

Sure:

command line:

# qemu-system-x86_64 \
-name test,debug-threads=on \
-msg timestamp=on \
-machine q35,sata=false,usb=off \
-accel kvm \
-cpu Haswell-noTSX \
-smp 4,sockets=1,cores=4,threads=1 \
-vga std \
-m 4096M \
-object memory-backend-memfd,id=mem0,size=4096M,share=on \
-numa node,memdev=mem0,cpus=0-3 \
-nodefaults \
-no-user-config \
-chardev stdio,id=mon0 \
-mon chardev=mon0,mode=readline \
-boot strict=on \
-device vmcoreinfo \
-device pvpanic \
-device qemu-xhci,id=usb0 \
-device usb-tablet,bus=usb0.0 \
-chardev 
socket,path=serial0.sock,logfile=serial0.log,id=charserial0,reconnect=1 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev file,path=bios0.log,id=debugcon \
-device isa-debugcon,iobase=0x402,chardev=debugcon \
-device 
pcie-root-port,id=s0,slot=0,bus=pcie.0,addr=05.0,multifunction=on,io-reserve=0 \
-device 
pcie-root-port,id=s1,slot=1,bus=pcie.0,addr=05.1,multifunction=on,io-reserve=0 \
-device 
pcie-root-port,id=s2,slot=2,bus=pcie.0,addr=05.2,multifunction=on,io-reserve=0 \
-device 
pcie-root-port,id=s3,slot=3,bus=pcie.0,addr=05.3,multifunction=on,io-reserve=0 \
-drive 
format=qcow2,file=f34.qcow2,id=hdd0,if=none,aio=native,cache=directsync,discard=unmap
 \
-netdev user,id=netdev0,hostfwd=tcp::0-:22 \
-device virtio-net-pci,disable-legacy=off,netdev=netdev0,id=net0,bus=s1 \
-object iothread,id=iot0 \
-device 
virtio-blk-pci,disable-legacy=off,scsi=off,rerror=report,werror=report,id=vblk0,drive=hdd0,bus=s0,iothread=iot0,bootindex=1
 \
-object iothread,id=iot2 \
-drive driver=null-co,size=64M,id=hdd2,if=none \
-device 
virtio-blk-pci,disable-legacy=off,scsi=off,rerror=report,werror=report,id=vblk2,drive=hdd2,bus=s2,iothread=iot2,num-queues=4,addr=4
 \
-nographic


qemu HMP:

(qemu) info qtree
bus: main-system-bus
  type System
  dev: ps2-mouse, id ""
gpio-out "" 1
  dev: ps2-kbd, id ""
gpio-out "" 1
  dev: hpet, id ""
gpio-in "" 2
gpio-out "" 1
gpio-out "sysbus-irq" 32
timers = 3 (0x3)
msi = false
hpet-intcap = 16711940 (0xff0104)
hpet-offset-saved = true
mmio fed0/0400
  dev: kvm-ioapic, id ""
gpio-in "" 24
gsi_base = 0 (0x0)
mmio fec0/1000
  dev: q35-pcihost, id ""
MCFG = 2952790016 (0xb000)
pci-hole64-size = 34359738368 (32 GiB)
short_root_bus = 0 (0x0)
below-4g-mem-size = 2147483648 (2 GiB)
above-4g-mem-size = 2147483648 (2 GiB)
x-pci-hole64-fix = true
x-config-reg-migration-enabled = true
bypass-iommu = false
bus: pcie.0
  type PCIE
  dev: pcie-root-port, id "s3"
x-migrate-msix = true
bus-reserve = 4294967295 (0x)
io-reserve = 0 (0 B)
mem-reserve = 18446744073

Re: [PATCH v3] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-20 Thread Roman Kagan

On Wed, Jul 20, 2022 at 12:04:58PM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 20, 2022 at 02:00:16PM +0300, Roman Kagan wrote:
> > On Wed, Jul 20, 2022 at 11:44:26AM +0100, Daniel P. Berrangé wrote:
> > > On Wed, Jul 20, 2022 at 01:25:55PM +0300, Roman Kagan wrote:
> > > > It's possible to create non-working configurations by attaching a device
> > > > to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
> > > > specifying a slot number other that zero, e.g.:
> > > 
> > > What do you mean by 'non-working' in this case.  The guest OS boots
> > > OK, but I indeed don't see the device in the guest, but IIUC it was
> > > said that was just because Linux doesn't scan for a non-zero slot.
> > 
> > Right.  I don't remember if it was Linux or firmware or both but indeed
> > at least Linux guests don't see devices if attached to a PCIe slot at
> > addr != 0.  (Which is kinda natural for a thing called "slot", isn't it?)
> 
> If a configuration is a permissible per the hardware design / spec, then
> QEMU should generally allow it.  We don't want to constrain host side
> configs based on the current limitations of guest OS whose behaviour can
> change over time, or where a different guest OS may have a different POV.

OK I'll try to find out if PCIe slot devices may allow addresses
different from zero.  If anybody can advise on this that would be much
appreciated.

Thanks,
Roman.

Re: [PATCH v3] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-20 Thread Roman Kagan

On Wed, Jul 20, 2022 at 11:44:26AM +0100, Daniel P. Berrangé wrote:
> On Wed, Jul 20, 2022 at 01:25:55PM +0300, Roman Kagan wrote:
> > It's possible to create non-working configurations by attaching a device
> > to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
> > specifying a slot number other that zero, e.g.:
> > 
> > -device pcie-root-port,id=s0,... \
> > -device virtio-blk-pci,bus=s0,addr=4,...
> > 
> > Make QEMU reject such configurations and only allow addr=0 on the
> > secondary bus of a PCIe slot.
> 
> What do you mean by 'non-working' in this case.  The guest OS boots
> OK, but I indeed don't see the device in the guest, but IIUC it was
> said that was just because Linux doesn't scan for a non-zero slot.

Right.  I don't remember if it was Linux or firmware or both but indeed
at least Linux guests don't see devices if attached to a PCIe slot at
addr != 0.  (Which is kinda natural for a thing called "slot", isn't it?)

> That wouldn't be a broken config from QEMU's POV though, merely a
> guest OS limitation ?

Strictly speaking it wouldn't, indeed.  But we've had created such a
configuration (due to a bug in our management layer) and spent
non-negligible time trying to figure out why the attached device didn't
appear in the guest.  So I thought it made sense to reject a
configuration which is known to confuse guests.  Doesn't it?

Thanks,
Roman.

[PATCH v3] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-20 Thread Roman Kagan

It's possible to create non-working configurations by attaching a device
to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
specifying a slot number other that zero, e.g.:

-device pcie-root-port,id=s0,... \
-device virtio-blk-pci,bus=s0,addr=4,...

Make QEMU reject such configurations and only allow addr=0 on the
secondary bus of a PCIe slot.

To verify this new behavior, add two basic qtests for the PCIe bridges
that may be affected by change: pcie-root-port and x3130.  For the
former, two testcases are included, one positive for slot #0 and one
negative for (arbitrary) slot #4; for the latter, only a positive
testcase for slot #4 is included.

Signed-off-by: Roman Kagan 
---
v2 -> v3:
- do not use qtest-single stuff [Thomas]

v1 -> v2:
- use object_dynamic_cast (without assert) [Vladimir]
- add explaining comment [Michael]
- add tests

 hw/pci/pci_bridge.c   |  6 +++
 tests/qtest/pcie-root-port-test.c | 75 +++
 tests/qtest/xio3130-test.c| 52 +
 tests/qtest/meson.build   |  2 +
 4 files changed, 135 insertions(+)
 create mode 100644 tests/qtest/pcie-root-port-test.c
 create mode 100644 tests/qtest/xio3130-test.c

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index da34c8ebcd..23e1701d06 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -33,6 +33,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/pci/pci_bus.h"
+#include "hw/pci/pcie_port.h"
 #include "qemu/module.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
@@ -386,6 +387,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char 
*typename)
 br->windows = pci_bridge_region_init(br);
 QLIST_INIT(_bus->child);
 QLIST_INSERT_HEAD(>child, sec_bus, sibling);
+
+/* PCIe slot derivatives are bridges with a single slot; enforce that */
+if (object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT)) {
+sec_bus->slot_reserved_mask = ~1u;
+}
 }
 
 /* default qdev clean up function for PCI-to-PCI bridge */
diff --git a/tests/qtest/pcie-root-port-test.c 
b/tests/qtest/pcie-root-port-test.c
new file mode 100644
index 00..c462f03fda
--- /dev/null
+++ b/tests/qtest/pcie-root-port-test.c
@@ -0,0 +1,75 @@
+/*
+ * QTest testcase for generic PCIe root port
+ *
+ * Copyright (c) 2022 Yandex N.V.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+
+/*
+ * Let QEMU choose the bus and slot for the device under test.  It may even be
+ * a non-PCIe bus but it's ok for the purpose of the test.
+ */
+static const char *common_args = "-device pcie-root-port,id=s0"
+ ",port=1,chassis=1,multifunction=on";
+
+static void test_slot0(void)
+{
+QTestState *qts;
+QDict *resp;
+
+/* attach a PCIe device into slot0 of the root port */
+qts = qtest_init(common_args);
+/* PCIe root port is known to be supported, use it as a leaf device too */
+resp = qtest_qmp(qts, "{'execute': 'device_add', 'arguments': {"
+ "'driver': 'pcie-root-port', "
+ "'id': 'port1', "
+ "'bus': 's0', "
+ "'chassis': 5, "
+ "'addr': '0'"
+ "} }");
+g_assert_nonnull(resp);
+g_assert(!qdict_haskey(resp, "event"));
+g_assert(!qdict_haskey(resp, "error"));
+qobject_unref(resp);
+
+qtest_quit(qts);
+}
+
+static void test_slot4(void)
+{
+QTestState *qts;
+QDict *resp;
+
+/* attach a PCIe device into slot4 of the root port should be rejected */
+qts = qtest_init(common_args);
+/* PCIe root port is known to be supported, use it as a leaf device too */
+resp = qtest_qmp(qts, "{'execute': 'device_add', 'arguments': {"
+ "'driver': 'pcie-root-port', "
+ "'id': 'port1', "
+ "'bus': 's0', "
+ "'chassis': 5, "
+ "'addr': '4'"
+ "} }");
+qmp_expect_error_and_unref(resp, "GenericError");
+
+qtest_quit(qts);
+}
+
+int main(int argc, char **argv)
+{
+int ret;
+
+g_test_init(, , NULL);
+
+qtest_add_func("/pcie-root-port/slot0", test_slot0);
+qtest_add_func("/pcie-root-port/slot4", test_slot4);
+
+ret = g_test_run();
+
+return ret;
+}
diff --git a/tests/qtest/xio3130-test.c b/tests/qtest/xio3130-test.c
new file mode 100644
index 00..8306da4aea
--- /dev/null
+++ b/tests/qtest/xio3130-test.c
@@ -0,0 +1,52 @@
+/*
+ * QTest testcase for

Re: [PATCH v2] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-20 Thread Roman Kagan

On Tue, Jul 19, 2022 at 12:42:47PM +0200, Thomas Huth wrote:
> On 19/07/2022 10.01, Roman Kagan wrote:
> > +#include "qemu/osdep.h"
> > +#include "libqtest-single.h"
> 
> Do you really need libqtest-single.h here? libqtest.h should be enough,
> shouldn't it?

Indeed, will replace with libqtest.h

> > +
> > +qtest_end();
> 
> Please drop the qtest_end() - you're already using qtest_quit in the
> individual tests.

ok

Thanks,
Roman.

[PATCH v2] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-19 Thread Roman Kagan

It's possible to create non-working configurations by attaching a device
to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
specifying a slot number other that zero, e.g.:

-device pcie-root-port,id=s0,... \
-device virtio-blk-pci,bus=s0,addr=4,...

Make QEMU reject such configurations and only allow addr=0 on the
secondary bus of a PCIe slot.

To verify this new behavior, add two basic qtests for the PCIe bridges
that may be affected by change: pcie-root-port and x3130.  For the
former, two testcases are included, one positive for slot #0 and one
negative for (arbitrary) slot #4; for the latter, only a positive
testcase for slot #4 is included.

Signed-off-by: Roman Kagan 
---
v1 -> v2:
- use object_dynamic_cast (without assert) [Vladimir]
- add explaining comment [Michael]
- add tests
(I've only had a chance to run tests against x86; hope I didn't mess
them up for other arches)

 hw/pci/pci_bridge.c   |  6 +++
 tests/qtest/pcie-root-port-test.c | 77 +++
 tests/qtest/xio3130-test.c| 54 ++
 tests/qtest/meson.build   |  2 +
 4 files changed, 139 insertions(+)
 create mode 100644 tests/qtest/pcie-root-port-test.c
 create mode 100644 tests/qtest/xio3130-test.c

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index da34c8ebcd..23e1701d06 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -33,6 +33,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/pci/pci_bus.h"
+#include "hw/pci/pcie_port.h"
 #include "qemu/module.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
@@ -386,6 +387,11 @@ void pci_bridge_initfn(PCIDevice *dev, const char 
*typename)
 br->windows = pci_bridge_region_init(br);
 QLIST_INIT(_bus->child);
 QLIST_INSERT_HEAD(>child, sec_bus, sibling);
+
+/* PCIe slot derivatives are bridges with a single slot; enforce that */
+if (object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT)) {
+sec_bus->slot_reserved_mask = ~1u;
+}
 }
 
 /* default qdev clean up function for PCI-to-PCI bridge */
diff --git a/tests/qtest/pcie-root-port-test.c 
b/tests/qtest/pcie-root-port-test.c
new file mode 100644
index 00..a1f3d84d75
--- /dev/null
+++ b/tests/qtest/pcie-root-port-test.c
@@ -0,0 +1,77 @@
+/*
+ * QTest testcase for generic PCIe root port
+ *
+ * Copyright (c) 2022 Yandex N.V.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest-single.h"
+
+/*
+ * Let QEMU choose the bus and slot for the device under test.  It may even be
+ * a non-PCIe bus but it's ok for the purpose of the test.
+ */
+static const char *common_args = "-device pcie-root-port,id=s0"
+ ",port=1,chassis=1,multifunction=on";
+
+static void test_slot0(void)
+{
+QTestState *qts;
+QDict *resp;
+
+/* attach a PCIe device into slot0 of the root port */
+qts = qtest_init(common_args);
+/* PCIe root port is known to be supported, use it as a leaf device too */
+resp = qtest_qmp(qts, "{'execute': 'device_add', 'arguments': {"
+ "'driver': 'pcie-root-port', "
+ "'id': 'port1', "
+ "'bus': 's0', "
+ "'chassis': 5, "
+ "'addr': '0'"
+ "} }");
+g_assert_nonnull(resp);
+g_assert(!qdict_haskey(resp, "event"));
+g_assert(!qdict_haskey(resp, "error"));
+qobject_unref(resp);
+
+qtest_quit(qts);
+}
+
+static void test_slot4(void)
+{
+QTestState *qts;
+QDict *resp;
+
+/* attach a PCIe device into slot4 of the root port should be rejected */
+qts = qtest_init(common_args);
+/* PCIe root port is known to be supported, use it as a leaf device too */
+resp = qtest_qmp(qts, "{'execute': 'device_add', 'arguments': {"
+ "'driver': 'pcie-root-port', "
+ "'id': 'port1', "
+ "'bus': 's0', "
+ "'chassis': 5, "
+ "'addr': '4'"
+ "} }");
+qmp_expect_error_and_unref(resp, "GenericError");
+
+qtest_quit(qts);
+}
+
+int main(int argc, char **argv)
+{
+int ret;
+
+g_test_init(, , NULL);
+
+qtest_add_func("/pcie-root-port/slot0", test_slot0);
+qtest_add_func("/pcie-root-port/slot4", test_slot4);
+
+ret = g_test_run();
+
+qtest_end();
+
+return ret;
+}
diff --git a/tests/qtest/xio3130-test.c b/tests/qtest/xio3130-test.c
new file mode 100644
index 00..3a937889b9
--- /dev/null
+++ b/te

Re: [PATCH] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-07 Thread Roman Kagan

On Thu, Jul 07, 2022 at 01:19:18AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 06, 2022 at 10:43:12PM +0300, Roman Kagan wrote:
> > On Wed, Jul 06, 2022 at 09:38:39PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > On 7/4/22 13:25, Roman Kagan wrote:
> > > > It's possible to create non-working configurations by attaching a device
> > > > to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
> > > > specifying a slot number other that zero, e.g.:
> > > > 
> > > >  -device pcie-root-port,id=s0,... \
> > > >  -device virtio-blk-pci,bus=s0,addr=4,...
> > > > 
> > > > Make QEMU reject such configurations and only allow addr=0 on the
> > > > secondary bus of a PCIe slot.
> > > > 
> > > > Signed-off-by: Roman Kagan 
> > > > ---
> > > >   hw/pci/pci_bridge.c | 5 +
> > > >   1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> > > > index da34c8ebcd..8b38d5ad3d 100644
> > > > --- a/hw/pci/pci_bridge.c
> > > > +++ b/hw/pci/pci_bridge.c
> > > > @@ -33,6 +33,7 @@
> > > >   #include "qemu/units.h"
> > > >   #include "hw/pci/pci_bridge.h"
> > > >   #include "hw/pci/pci_bus.h"
> > > > +#include "hw/pci/pcie_port.h"
> > > >   #include "qemu/module.h"
> > > >   #include "qemu/range.h"
> > > >   #include "qapi/error.h"
> > > > @@ -386,6 +387,10 @@ void pci_bridge_initfn(PCIDevice *dev, const char 
> > > > *typename)
> > > >   br->windows = pci_bridge_region_init(br);
> > > >   QLIST_INIT(_bus->child);
> > > >   QLIST_INSERT_HEAD(>child, sec_bus, sibling);
> > > > +
> > > > +if (PCIE_SLOT(dev)) {
> > > 
> > > Hmm, wouldn't PCIE_SLOT just crash if dev is not pcie slot? As I 
> > > understand, PCIE_SLOT is finally an OBJECT_CHECK(), which say:
> > > 
> > >  * If an invalid object is passed to this function, a run time assert 
> > > will be
> > >  * generated.
> > 
> > Well, the assertion is there only if configured with
> > --enable-qom-cast-debug which is off by default, that's why it even
> > passed make check.  As it stands, it's just a typecast which is a no-op
> > here, and basically it makes every bridge have only a single slot, which
> > is wrong of course.
> > 
> > Will rework, thanks!
> > Roman.
> 
> Which probably means it was not actually tested that the patch
> rejects the invalid configuration, was it?

Yes it was.  What wasn't tested was that other PCI bridges remained
unaffected.  In the default configuration (--enable-qom-cast-debug=no)
the patch turns every bridge using pci_bridge_initfn into single-slot
bridges.  This renders e.g. switches like x3130 useless, but the
testsuite doesn't trigger that path.

I'll try and add a test for this in the next iteration.

Thanks,
Roman.

Re: [PATCH] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-06 Thread Roman Kagan

On Wed, Jul 06, 2022 at 09:38:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 7/4/22 13:25, Roman Kagan wrote:
> > It's possible to create non-working configurations by attaching a device
> > to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
> > specifying a slot number other that zero, e.g.:
> > 
> >  -device pcie-root-port,id=s0,... \
> >  -device virtio-blk-pci,bus=s0,addr=4,...
> > 
> > Make QEMU reject such configurations and only allow addr=0 on the
> > secondary bus of a PCIe slot.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> >   hw/pci/pci_bridge.c | 5 +
> >   1 file changed, 5 insertions(+)
> > 
> > diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> > index da34c8ebcd..8b38d5ad3d 100644
> > --- a/hw/pci/pci_bridge.c
> > +++ b/hw/pci/pci_bridge.c
> > @@ -33,6 +33,7 @@
> >   #include "qemu/units.h"
> >   #include "hw/pci/pci_bridge.h"
> >   #include "hw/pci/pci_bus.h"
> > +#include "hw/pci/pcie_port.h"
> >   #include "qemu/module.h"
> >   #include "qemu/range.h"
> >   #include "qapi/error.h"
> > @@ -386,6 +387,10 @@ void pci_bridge_initfn(PCIDevice *dev, const char 
> > *typename)
> >   br->windows = pci_bridge_region_init(br);
> >   QLIST_INIT(_bus->child);
> >   QLIST_INSERT_HEAD(>child, sec_bus, sibling);
> > +
> > +if (PCIE_SLOT(dev)) {
> 
> Hmm, wouldn't PCIE_SLOT just crash if dev is not pcie slot? As I understand, 
> PCIE_SLOT is finally an OBJECT_CHECK(), which say:
> 
>  * If an invalid object is passed to this function, a run time assert will be
>  * generated.

Well, the assertion is there only if configured with
--enable-qom-cast-debug which is off by default, that's why it even
passed make check.  As it stands, it's just a typecast which is a no-op
here, and basically it makes every bridge have only a single slot, which
is wrong of course.

Will rework, thanks!
Roman.

[PATCH] hw/pci/pci_bridge: ensure PCIe slots have only one slot

2022-07-04 Thread Roman Kagan

It's possible to create non-working configurations by attaching a device
to a derivative of PCIe slot (pcie-root-port, ioh3420, etc) and
specifying a slot number other that zero, e.g.:

-device pcie-root-port,id=s0,... \
-device virtio-blk-pci,bus=s0,addr=4,...

Make QEMU reject such configurations and only allow addr=0 on the
secondary bus of a PCIe slot.

Signed-off-by: Roman Kagan 
---
 hw/pci/pci_bridge.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index da34c8ebcd..8b38d5ad3d 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -33,6 +33,7 @@
 #include "qemu/units.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/pci/pci_bus.h"
+#include "hw/pci/pcie_port.h"
 #include "qemu/module.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
@@ -386,6 +387,10 @@ void pci_bridge_initfn(PCIDevice *dev, const char 
*typename)
 br->windows = pci_bridge_region_init(br);
 QLIST_INIT(_bus->child);
 QLIST_INSERT_HEAD(>child, sec_bus, sibling);
+
+if (PCIE_SLOT(dev)) {
+sec_bus->slot_reserved_mask = ~1u;
+}
 }
 
 /* default qdev clean up function for PCI-to-PCI bridge */
-- 
2.36.1

Re: [PATCH v2 2/2] vhost: setup error eventfd and dump errors

2022-06-24 Thread Roman Kagan

On Thu, Jun 23, 2022 at 07:13:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> From: Konstantin Khlebnikov 
> 
> Vhost has error notifications, let's log them like other errors.
> For each virt-queue setup eventfd for vring error notifications.
> 
> Signed-off-by: Konstantin Khlebnikov 
> [vsementsov: rename patch, change commit message  and dump error like
>  other errors in the file]
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  hw/virtio/vhost.c | 37 +
>  include/hw/virtio/vhost.h |  1 +
>  2 files changed, 38 insertions(+)

Reviewed-by: Roman Kagan

Re: [PATCH v2 1/2] vhost: add method vhost_set_vring_err

2022-06-24 Thread Roman Kagan

On Thu, Jun 23, 2022 at 07:13:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> From: Konstantin Khlebnikov 
> 
> Kernel and user vhost may report virtqueue errors via eventfd.
> This is only reliable way to get notification about protocol error.
^^^
the

> 
> Signed-off-by: Konstantin Khlebnikov 

As long as you pick this series over from Konstantin you need to append
your s-o-b.

Other than that,

Reviewed-by: Roman Kagan

Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

2022-06-21 Thread Roman Kagan

On Tue, Jun 21, 2022 at 01:55:25PM +0200, Markus Armbruster wrote:
> Roman Kagan  writes:
> 
> > On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> >> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> >> > Roman Kagan  writes:
> >> > 
> >> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> > >> Konstantin Khlebnikov  writes:
> >> > >> 
> >> > >> > This event represents device runtime errors to give time and
> >> > >> > reason why device is broken.
> >> > >> 
> >> > >> Can you give an or more examples of the "device runtime errors" you 
> >> > >> have
> >> > >> in mind?
> >> > >
> >> > > Initially we wanted to address a situation when a vhost device
> >> > > discovered an inconsistency during virtqueue processing and silently
> >> > > stopped the virtqueue.  This resulted in device stall (partial for
> >> > > multiqueue devices) and we were the last to notice that.
> >> > >
> >> > > The solution appeared to be to employ errfd and, upon receiving a
> >> > > notification through it, to emit a QMP event which is actionable in the
> >> > > management layer or further up the stack.
> >> > >
> >> > > Then we observed that virtio (non-vhost) devices suffer from the same
> >> > > issue: they only log the error but don't signal it to the management
> >> > > layer.  The case was very similar so we thought it would make sense to
> >> > > share the infrastructure and the QMP event between virtio and vhost.
> >> > >
> >> > > Then Konstantin went a bit further and generalized the concept into
> >> > > generic "device runtime error".  I'm personally not completely 
> >> > > convinced
> >> > > this generalization is appropriate here; we'd appreciate the opinions
> >> > > from the community on the matter.
> >> > 
> >> > "Device emulation sending an even on entering certain error states, so
> >> > that a management application can do something about it" feels
> >> > reasonable enough to me as a general concept.
> >> > 
> >> > The key point is of course "can do something": the event needs to be
> >> > actionable.  Can you describe possible actions for the cases you
> >> > implement?
> >> 
> >> The first one that we had in mind was informational, like triggering an
> >> alert in the monitoring system and/or painting the VM as malfunctioning
> >> in the owner's UI.
> >> 
> >> There can be more advanced scenarios like autorecovery by resetting the
> >> faulty VM, or fencing it if it's a cluster member.
> >
> > The discussion kind of stalled here.
> 
> My apologies...
> 
> >   Do you think the approach makes
> > sense or not?  Should we try and resubmit the series with a proper cover
> > letter and possibly other improvements or is it a dead end?
> 
> As QAPI schema maintainer, my concern is interface design.  To sell this
> interface to me (so to speak), you have to show it's useful and
> reasonably general.  Reasonably general, because we don't want to
> accumulate one-offs, even if they have their uses.
> 
> I think this is mostly a matter of commit message(s) and documentation
> here.  Explain your intended use cases.  Maybe hand-wave at other use
> cases you can think of.  Document that you're implementing the event
> only for the specific errors you need, but that it could be implemented
> more widely as needed.  "Complete" feels impractical, though.
> 
> Makes sense?

Absolutely.  We'll rework and resubmit the series addressing the issues
you've noted, and we'll see how it goes.

Thanks,
Roman.

Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

2022-06-20 Thread Roman Kagan

On Mon, May 30, 2022 at 06:04:32PM +0300, Roman Kagan wrote:
> On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> > Roman Kagan  writes:
> > 
> > > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> > >> Konstantin Khlebnikov  writes:
> > >> 
> > >> > This event represents device runtime errors to give time and
> > >> > reason why device is broken.
> > >> 
> > >> Can you give an or more examples of the "device runtime errors" you have
> > >> in mind?
> > >
> > > Initially we wanted to address a situation when a vhost device
> > > discovered an inconsistency during virtqueue processing and silently
> > > stopped the virtqueue.  This resulted in device stall (partial for
> > > multiqueue devices) and we were the last to notice that.
> > >
> > > The solution appeared to be to employ errfd and, upon receiving a
> > > notification through it, to emit a QMP event which is actionable in the
> > > management layer or further up the stack.
> > >
> > > Then we observed that virtio (non-vhost) devices suffer from the same
> > > issue: they only log the error but don't signal it to the management
> > > layer.  The case was very similar so we thought it would make sense to
> > > share the infrastructure and the QMP event between virtio and vhost.
> > >
> > > Then Konstantin went a bit further and generalized the concept into
> > > generic "device runtime error".  I'm personally not completely convinced
> > > this generalization is appropriate here; we'd appreciate the opinions
> > > from the community on the matter.
> > 
> > "Device emulation sending an even on entering certain error states, so
> > that a management application can do something about it" feels
> > reasonable enough to me as a general concept.
> > 
> > The key point is of course "can do something": the event needs to be
> > actionable.  Can you describe possible actions for the cases you
> > implement?
> 
> The first one that we had in mind was informational, like triggering an
> alert in the monitoring system and/or painting the VM as malfunctioning
> in the owner's UI.
> 
> There can be more advanced scenarios like autorecovery by resetting the
> faulty VM, or fencing it if it's a cluster member.

The discussion kind of stalled here.  Do you think the approach makes
sense or not?  Should we try and resubmit the series with a proper cover
letter and possibly other improvements or is it a dead end?

Thanks,
Roman.

Re: [PATCH] aio_wait_kick: add missing memory barrier

2022-06-04 Thread Roman Kagan

On Tue, May 24, 2022 at 01:30:54PM -0400, Emanuele Giuseppe Esposito wrote:
> It seems that aio_wait_kick always required a memory barrier
> or atomic operation in the caller, but nobody actually
> took care of doing it.
> 
> Let's put the barrier in the function instead, and pair it
> with another one in AIO_WAIT_WHILE. Read aio_wait_kick()
> comment for further explanation.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  include/block/aio-wait.h |  2 ++
>  util/aio-wait.c  | 16 +++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index b39eefb38d..54840f8622 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -81,6 +81,8 @@ extern AioWait global_aio_wait;
>  AioContext *ctx_ = (ctx);  \
>  /* Increment wait_->num_waiters before evaluating cond. */ \
>  qatomic_inc(_->num_waiters);  \
> +/* Paired with smp_mb in aio_wait_kick(). */   \
> +smp_mb();  \

IIRC qatomic_inc() ensures sequential consistency, isn't it enough here?

>  if (ctx_ && in_aio_context_home_thread(ctx_)) {\
>  while ((cond)) {   \
>  aio_poll(ctx_, true);  \

Roman.

Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

2022-05-30 Thread Roman Kagan

On Mon, May 30, 2022 at 01:28:17PM +0200, Markus Armbruster wrote:
> Roman Kagan  writes:
> 
> > On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> >> Konstantin Khlebnikov  writes:
> >> 
> >> > This event represents device runtime errors to give time and
> >> > reason why device is broken.
> >> 
> >> Can you give an or more examples of the "device runtime errors" you have
> >> in mind?
> >
> > Initially we wanted to address a situation when a vhost device
> > discovered an inconsistency during virtqueue processing and silently
> > stopped the virtqueue.  This resulted in device stall (partial for
> > multiqueue devices) and we were the last to notice that.
> >
> > The solution appeared to be to employ errfd and, upon receiving a
> > notification through it, to emit a QMP event which is actionable in the
> > management layer or further up the stack.
> >
> > Then we observed that virtio (non-vhost) devices suffer from the same
> > issue: they only log the error but don't signal it to the management
> > layer.  The case was very similar so we thought it would make sense to
> > share the infrastructure and the QMP event between virtio and vhost.
> >
> > Then Konstantin went a bit further and generalized the concept into
> > generic "device runtime error".  I'm personally not completely convinced
> > this generalization is appropriate here; we'd appreciate the opinions
> > from the community on the matter.
> 
> "Device emulation sending an even on entering certain error states, so
> that a management application can do something about it" feels
> reasonable enough to me as a general concept.
> 
> The key point is of course "can do something": the event needs to be
> actionable.  Can you describe possible actions for the cases you
> implement?

The first one that we had in mind was informational, like triggering an
alert in the monitoring system and/or painting the VM as malfunctioning
in the owner's UI.

There can be more advanced scenarios like autorecovery by resetting the
faulty VM, or fencing it if it's a cluster member.

Thanks,
Roman.

Re: [PATCH 1/4] qdev: add DEVICE_RUNTIME_ERROR event

2022-05-27 Thread Roman Kagan

On Wed, May 25, 2022 at 12:54:47PM +0200, Markus Armbruster wrote:
> Konstantin Khlebnikov  writes:
> 
> > This event represents device runtime errors to give time and
> > reason why device is broken.
> 
> Can you give an or more examples of the "device runtime errors" you have
> in mind?

Initially we wanted to address a situation when a vhost device
discovered an inconsistency during virtqueue processing and silently
stopped the virtqueue.  This resulted in device stall (partial for
multiqueue devices) and we were the last to notice that.

The solution appeared to be to employ errfd and, upon receiving a
notification through it, to emit a QMP event which is actionable in the
management layer or further up the stack.

Then we observed that virtio (non-vhost) devices suffer from the same
issue: they only log the error but don't signal it to the management
layer.  The case was very similar so we thought it would make sense to
share the infrastructure and the QMP event between virtio and vhost.

Then Konstantin went a bit further and generalized the concept into
generic "device runtime error".  I'm personally not completely convinced
this generalization is appropriate here; we'd appreciate the opinions
from the community on the matter.

HTH,
Roman.

Re: [PATCH 00/10] vhost: stick to -errno error return convention

2021-11-29 Thread Roman Kagan

On Sun, Nov 28, 2021 at 04:47:20PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2021 at 06:33:44PM +0300, Roman Kagan wrote:
> > Error propagation between the generic vhost code and the specific backends 
> > is
> > not quite consistent: some places follow "return -1 and set errno" 
> > convention,
> > while others assume "return negated errno".  Furthermore, not enough care is
> > taken not to clobber errno.
> > 
> > As a result, on certain code paths the errno resulting from a failure may 
> > get
> > overridden by another function call, and then that zero errno inidicating
> > success is propagated up the stack, leading to failures being lost.  In
> > particular, we've seen errors in the communication with a vhost-user-blk 
> > slave
> > not trigger an immediate connection drop and reconnection, leaving it in a
> > broken state.
> > 
> > Rework error propagation to always return negated errno on errors and
> > correctly pass it up the stack.
> 
> Hi Roman,
> if there are bugfixes here I'll be happy to take them right now.
> The wholesale rework seems inappropriate for 6.2, I'll be
> happy to tag it for after 6.2. Pls ping me aftre release to help
> make sure it's not lost.

All these patches are bugfixes in one way or another.  That said, none
of the problems being addressed are recent regressions.  OTOH the
patches introduce non-zero churn and change behavior on some error
paths, so I'd suggest to postpone the whole series till after 6.2 is
out.

Thanks,
Roman.

Re: [PATCH v1] hw/i386/amd_iommu: clean up broken event logging

2021-11-17 Thread Roman Kagan

On Wed, Nov 17, 2021 at 11:13:27PM +0500, Valentin Sinitsyn wrote:
> On 17.11.2021 19:46, Daniil Tatianin wrote:
> > -/*
> > - * AMDVi event structure
> > - *0:15   -> DeviceID
> > - *55:63  -> event type + miscellaneous info
> > - *63:127 -> related address
> Did you mean 64:127? Fields shouldn't overlap, and 65-bit address looks
> suspicious.

These lines are being removed by this patch.  And yes, they were wrong
WRT the spec.

Roman.

Re: [PATCH v4 0/1] hw/hyperv/vmbus: Is it maintained?

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 09:32:31PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add Den and Roman (his new address)

Thanks, I missed it on the list indeed.

> 06.11.2021 16:41, Philippe Mathieu-Daudé wrote:
> > This is the 4th time I send this patch. Is the VMBus infrastructure
> > used / maintained? Should we deprecate & remove?

I think it's fair to say it's not maintained.  The whole
hw/hyperv/vmbus.c was submitted as a part of the work by Jon to enable
some obscure windows debugging feature which only worked in presence of
VMBus.  It was mostly taken from the respective branch of the (now
effectively abandoned) downstream tree with an implementation of the
core VMBus infrastructure and the devices using it; however, none of the
actual VMBus devices ever made it into the mainline tree.

> > 
> >$ ./scripts/get_maintainer.pl -f hw/hyperv/vmbus.c -f 
> > include/hw/hyperv/vmbus.h
> >get_maintainer.pl: No maintainers found
> > 
> > Philippe Mathieu-Daudé (1):
> >hw/hyperv/vmbus: Remove unused vmbus_load/save_req()
> > 
> >   include/hw/hyperv/vmbus.h |  3 --
> >   hw/hyperv/vmbus.c | 59 ---
> >   2 files changed, 62 deletions(-)

This seems to basically be the revert of 4dd8a7064b "vmbus: add
infrastructure to save/load vmbus requests"; it was originally meant to
be submitted with the code that would use it, vmbus scsi controller, but
that never happened.  It believe it's safe to remove without affecting
Jon's work, but I'd rather check with him.

Thanks,
Roman.

Re: [PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 12:37:59PM +0100, Kevin Wolf wrote:
> Am 12.11.2021 um 08:39 hat Roman Kagan geschrieben:
> > On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> > > Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > > > vhost-user-blk realize only attempts to reconnect if the previous
> > > > connection attempt failed on "a problem with the connection and not an
> > > > error related to the content (which would fail again the same way in the
> > > > next attempt)".
> > > > 
> > > > However this distinction is very subtle, and may be inadvertently broken
> > > > if the code changes somewhere deep down the stack and a new error gets
> > > > propagated up to here.
> > > > 
> > > > OTOH now that the number of reconnection attempts is limited it seems
> > > > harmless to try reconnecting on any error.
> > > > 
> > > > So relax the condition of whether to retry connecting to check for any
> > > > error.
> > > > 
> > > > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > > > during realize".
> > > > 
> > > > Signed-off-by: Roman Kagan 
> > > 
> > > It results in less than perfect error messages. With a modified export
> > > that just crashes qemu-storage-daemon during get_features, I get:
> > > 
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read 
> > > msg header. Read 0 instead of 12. Original request 1.
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: vhost_backend_init failed: Protocol error
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting 
> > > after error: Failed to connect to '/tmp/vsock': Connection refused
> > > qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to 
> > > connect to '/tmp/vsock': Connection refused
> > 
> > This patch doesn't change any error messages.  Which ones specifically
> > became less than perfect as a result of this patch?
> 
> But it adds error messages (for each retry), which are different from
> the first error message. As I said this is not the end of the world, but
> maybe a bit more confusing.

Ah, now I see what you mean: it adds reconnection attempts where there
used to be immediate failure return, so now every failed attempt logs
its own message.

> > > I guess this might be tolerable. On the other hand, the patch doesn't
> > > really fix anything either, but just gets rid of possible subtleties.
> > 
> > The remaining patches in the series make other errors beside -EPROTO
> > propagate up to this point, and some (most) of them are retryable.  This
> > was the reason to include this patch at the beginning of the series (I
> > guess I should've mentioned that in the patch log).
> 
> I see. I hadn't looked at the rest of the series yet because I ran out
> of time, but now that I'm skimming them, I see quite a few places that
> use non-EPROTO, but I wonder which of them actually should be
> reconnected. So far all I saw were presumably persistent errors where a
> retry won't help. Can you give me some examples?

E.g. the particular case you mention earlier, -ECONNREFUSED, is not
unlikely to happen due to the vhost-user server restart for maintenance;
in this case retying looks like a reasonable thing to do, doesn't it?

Thanks,
Roman.

Re: [PATCH 04/10] chardev/char-fe: don't allow EAGAIN from blocking read

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 12:24:06PM +0400, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Nov 11, 2021 at 7:44 PM Roman Kagan  wrote:
> 
> > As its name suggests, ChardevClass.chr_sync_read is supposed to do a
> > blocking read.  The only implementation of it, tcp_chr_sync_read, does
> > set the underlying io channel to the blocking mode indeed.
> >
> > Therefore a failure return with EAGAIN is not expected from this call.
> >
> > So do not retry it in qemu_chr_fe_read_all; instead place an assertion
> > that it doesn't fail with EAGAIN.
> >
> 
> The code was introduced in :
> commit 7b0bfdf52d694c9a3a96505aa42ce3f8d63acd35
> Author: Nikolay Nikolaev 
> Date:   Tue May 27 15:03:48 2014 +0300
> 
> Add chardev API qemu_chr_fe_read_all

Right, but at that point chr_sync_read wasn't made to block.  It
happened later in

commit bcdeb9be566ded2eb35233aaccf38742a21e5daa
Author: Marc-André Lureau 
Date:   Thu Jul 6 19:03:53 2017 +0200

chardev: block during sync read

A sync read should block until all requested data is
available (instead of retrying in qemu_chr_fe_read_all). Change the
channel to blocking during sync_read.

> > @@ -68,13 +68,10 @@ int qemu_chr_fe_read_all(CharBackend *be, uint8_t
> > *buf, int len)
> >  }
> >
> >  while (offset < len) {
> > -retry:
> >  res = CHARDEV_GET_CLASS(s)->chr_sync_read(s, buf + offset,
> >len - offset);
> > -if (res == -1 && errno == EAGAIN) {
> > -g_usleep(100);
> > -goto retry;
> > -}
> > +/* ->chr_sync_read should block */
> > +assert(!(res < 0 && errno == EAGAIN));
> >
> >
> While I agree with the rationale to clean this code a bit, I am not so sure
> about replacing it with an assert(). In the past, when we did such things
> we had unexpected regressions :)

Valid point, qemu may be run against some OS where a blocking call may
sporadically return -EAGAIN, and it would be hard to reliably catch this
with testing.

> A slightly better approach perhaps is g_warn_if_fail(), although it's not
> very popular in qemu.

I think the first thing to decide is whether -EAGAIN from a blocking
call isn't broken enough, and justifies (unlimited) retries.  I'm
tempted to just remove any special handling of -EAGAIN and treat it as
any other error, leaving up to the caller to handle (most probably to
fail the call and initiate a recovery, if possible).

Does this make sense?

Thanks,
Roman.

Re: [PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-12 Thread Roman Kagan

On Fri, Nov 12, 2021 at 09:56:17AM +, Daniel P. Berrangé wrote:
> On Fri, Nov 12, 2021 at 10:46:46AM +0300, Roman Kagan wrote:
> > On Thu, Nov 11, 2021 at 06:59:43PM +0100, Philippe Mathieu-Daudé wrote:
> > > On 11/11/21 16:33, Roman Kagan wrote:
> > > > Fix the (hypothetical) potential problem when the value parsed out of
> > > > the vhost module parameter in sysfs overflows the return value from
> > > > vhost_kernel_memslots_limit.
> > > > 
> > > > Signed-off-by: Roman Kagan 
> > > > ---
> > > >  hw/virtio/vhost-backend.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > > > index b65f8f7e97..44f7dbb243 100644
> > > > --- a/hw/virtio/vhost-backend.c
> > > > +++ b/hw/virtio/vhost-backend.c
> > > > @@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct 
> > > > vhost_dev *dev)
> > > >  if 
> > > > (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
> > > >  , NULL, NULL)) {
> > > >  uint64_t val = g_ascii_strtoull(s, NULL, 10);
> > > 
> > > Would using qemu_strtou64() simplify this?
> > 
> > I'm afraid not.  None of the existing strtoXX converting functions has
> > the desired output range (0 < retval < INT_MAX), so the following
> > condition will remain necessary anyway; then it doesn't seem to matter
> > which particular parser is used to extract the value which is in the
> > range, so I left the one that was already there to reduce churn.
> 
> If  qemu_strtou64() can't handle all values in (0 < retval < INT_MAX)
> isn't that a bug in qemu_strtou64 ?

I must have been unclear.  It sure can handle all values in this range;
the point is that the range check after it would still be needed, so
switching from g_ascii_strtoull to qemu_strtoXX saves nothing, therefore
I left it as it was.

Thanks,
Roman.

Re: [PATCH 00/10] vhost: stick to -errno error return convention

2021-11-12 Thread Roman Kagan

On Thu, Nov 11, 2021 at 03:14:56PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2021 at 06:33:44PM +0300, Roman Kagan wrote:
> > Error propagation between the generic vhost code and the specific backends 
> > is
> > not quite consistent: some places follow "return -1 and set errno" 
> > convention,
> > while others assume "return negated errno".  Furthermore, not enough care is
> > taken not to clobber errno.
> > 
> > As a result, on certain code paths the errno resulting from a failure may 
> > get
> > overridden by another function call, and then that zero errno inidicating
> > success is propagated up the stack, leading to failures being lost.  In
> > particular, we've seen errors in the communication with a vhost-user-blk 
> > slave
> > not trigger an immediate connection drop and reconnection, leaving it in a
> > broken state.
> > 
> > Rework error propagation to always return negated errno on errors and
> > correctly pass it up the stack.
> 
> Looks like something we want post release. I'll tag it
> but pls ping me after the release to help make sure
> it's not lost.

It doesn't introduce new features so I guess it might qualify for rc0,
but the churn is somewhat too big indeed.

OK I'll reiterate once 6.2 is out; meanwhile if anyone has spare cycles
to review it, it'll be much appreciated.

Thanks,
Roman.

> 
> 
> > Roman Kagan (10):
> >   vhost-user-blk: reconnect on any error during realize
> >   chardev/char-socket: tcp_chr_recv: don't clobber errno
> >   chardev/char-socket: tcp_chr_sync_read: don't clobber errno
> >   chardev/char-fe: don't allow EAGAIN from blocking read
> >   vhost-backend: avoid overflow on memslots_limit
> >   vhost-backend: stick to -errno error return convention
> >   vhost-vdpa: stick to -errno error return convention
> >   vhost-user: stick to -errno error return convention
> >   vhost: stick to -errno error return convention
> >   vhost-user-blk: propagate error return from generic vhost
> > 
> >  chardev/char-fe.c |   7 +-
> >  chardev/char-socket.c |  17 +-
> >  hw/block/vhost-user-blk.c |   4 +-
> >  hw/virtio/vhost-backend.c |   4 +-
> >  hw/virtio/vhost-user.c| 401 +-
> >  hw/virtio/vhost-vdpa.c|  37 ++--
> >  hw/virtio/vhost.c |  98 +-
> >  7 files changed, 307 insertions(+), 261 deletions(-)
> > 
> > -- 
> > 2.33.1
> > 
>

Re: [PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-11 Thread Roman Kagan

On Thu, Nov 11, 2021 at 06:59:43PM +0100, Philippe Mathieu-Daudé wrote:
> On 11/11/21 16:33, Roman Kagan wrote:
> > Fix the (hypothetical) potential problem when the value parsed out of
> > the vhost module parameter in sysfs overflows the return value from
> > vhost_kernel_memslots_limit.
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> >  hw/virtio/vhost-backend.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index b65f8f7e97..44f7dbb243 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct vhost_dev 
> > *dev)
> >  if (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
> >  , NULL, NULL)) {
> >  uint64_t val = g_ascii_strtoull(s, NULL, 10);
> 
> Would using qemu_strtou64() simplify this?

I'm afraid not.  None of the existing strtoXX converting functions has
the desired output range (0 < retval < INT_MAX), so the following
condition will remain necessary anyway; then it doesn't seem to matter
which particular parser is used to extract the value which is in the
range, so I left the one that was already there to reduce churn.

> 
> > -if (!((val == G_MAXUINT64 || !val) && errno)) {
> > +if (val < INT_MAX && val > 0) {
> >  g_free(s);
> >  return val;
> >  }

Thanks,
Roman.

Re: [PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-11 Thread Roman Kagan

On Thu, Nov 11, 2021 at 06:52:30PM +0100, Kevin Wolf wrote:
> Am 11.11.2021 um 16:33 hat Roman Kagan geschrieben:
> > vhost-user-blk realize only attempts to reconnect if the previous
> > connection attempt failed on "a problem with the connection and not an
> > error related to the content (which would fail again the same way in the
> > next attempt)".
> > 
> > However this distinction is very subtle, and may be inadvertently broken
> > if the code changes somewhere deep down the stack and a new error gets
> > propagated up to here.
> > 
> > OTOH now that the number of reconnection attempts is limited it seems
> > harmless to try reconnecting on any error.
> > 
> > So relax the condition of whether to retry connecting to check for any
> > error.
> > 
> > This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
> > during realize".
> > 
> > Signed-off-by: Roman Kagan 
> 
> It results in less than perfect error messages. With a modified export
> that just crashes qemu-storage-daemon during get_features, I get:
> 
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to read msg 
> header. Read 0 instead of 12. Original request 1.
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: vhost_backend_init failed: Protocol error
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Reconnecting after 
> error: Failed to connect to '/tmp/vsock': Connection refused
> qemu-system-x86_64: -device vhost-user-blk-pci,chardev=c: Failed to connect 
> to '/tmp/vsock': Connection refused

This patch doesn't change any error messages.  Which ones specifically
became less than perfect as a result of this patch?

> I guess this might be tolerable. On the other hand, the patch doesn't
> really fix anything either, but just gets rid of possible subtleties.

The remaining patches in the series make other errors beside -EPROTO
propagate up to this point, and some (most) of them are retryable.  This
was the reason to include this patch at the beginning of the series (I
guess I should've mentioned that in the patch log).

Thanks,
Roman.

[PATCH 09/10] vhost: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

The generic vhost code expects that many of the VhostOps methods in the
respective backends set errno on errors.  However, none of the existing
backends actually bothers to do so.  In a number of those methods errno
from the failed call is clobbered by successful later calls to some
library functions; on a few code paths the generic vhost code then
negates and returns that errno, thus making failures look as successes
to the caller.

As a result, in certain scenarios (e.g. live migration) the device
doesn't notice the first failure and goes on through its state
transitions as if everything is ok, instead of taking recovery actions
(break and reestablish the vhost-user connection, cancel migration, etc)
before it's too late.

To fix this, consolidate on the convention to return negated errno on
failures throughout generic vhost, and use it for error propagation.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost.c | 98 ++-
 1 file changed, 45 insertions(+), 53 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 437347ad01..4f20d4a714 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -33,11 +33,13 @@
 #define _VHOST_DEBUG 1
 
 #ifdef _VHOST_DEBUG
-#define VHOST_OPS_DEBUG(fmt, ...) \
-do { error_report(fmt ": %s (%d)", ## __VA_ARGS__, \
-  strerror(errno), errno); } while (0)
+#define VHOST_OPS_DEBUG(retval, fmt, ...) \
+do { \
+error_report(fmt ": %s (%d)", ## __VA_ARGS__, \
+ strerror(-retval), -retval); \
+} while (0)
 #else
-#define VHOST_OPS_DEBUG(fmt, ...) \
+#define VHOST_OPS_DEBUG(retval, fmt, ...) \
 do { } while (0)
 #endif
 
@@ -297,7 +299,7 @@ static inline void vhost_dev_log_resize(struct vhost_dev 
*dev, uint64_t size)
releasing the current log, to ensure no logging is lost */
 r = dev->vhost_ops->vhost_set_log_base(dev, log_base, log);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_log_base failed");
+VHOST_OPS_DEBUG(r, "vhost_set_log_base failed");
 }
 
 vhost_log_put(dev, true);
@@ -550,7 +552,7 @@ static void vhost_commit(MemoryListener *listener)
 if (!dev->log_enabled) {
 r = dev->vhost_ops->vhost_set_mem_table(dev, dev->mem);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_mem_table failed");
+VHOST_OPS_DEBUG(r, "vhost_set_mem_table failed");
 }
 goto out;
 }
@@ -564,7 +566,7 @@ static void vhost_commit(MemoryListener *listener)
 }
 r = dev->vhost_ops->vhost_set_mem_table(dev, dev->mem);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_mem_table failed");
+VHOST_OPS_DEBUG(r, "vhost_set_mem_table failed");
 }
 /* To log less, can only decrease log size after table update. */
 if (dev->log_size > log_size + VHOST_LOG_BUFFER) {
@@ -803,8 +805,8 @@ static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
 if (dev->vhost_ops->vhost_vq_get_addr) {
 r = dev->vhost_ops->vhost_vq_get_addr(dev, , vq);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_vq_get_addr failed");
-return -errno;
+VHOST_OPS_DEBUG(r, "vhost_vq_get_addr failed");
+return r;
 }
 } else {
 addr.desc_user_addr = (uint64_t)(unsigned long)vq->desc;
@@ -816,10 +818,9 @@ static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
 addr.flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0;
 r = dev->vhost_ops->vhost_set_vring_addr(dev, );
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_vring_addr failed");
-return -errno;
+VHOST_OPS_DEBUG(r, "vhost_set_vring_addr failed");
 }
-return 0;
+return r;
 }
 
 static int vhost_dev_set_features(struct vhost_dev *dev,
@@ -840,19 +841,19 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
 }
 r = dev->vhost_ops->vhost_set_features(dev, features);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_features failed");
+VHOST_OPS_DEBUG(r, "vhost_set_features failed");
 goto out;
 }
 if (dev->vhost_ops->vhost_set_backend_cap) {
 r = dev->vhost_ops->vhost_set_backend_cap(dev);
 if (r < 0) {
-VHOST_OPS_DEBUG("vhost_set_backend_cap failed");
+VHOST_OPS_DEBUG(r, "vhost_set_backend_cap failed");
 goto out;
 }
 }
 
 out:
-return r < 0 ? -errno : 0;
+return r;
 }
 
 static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
@@ -999,22 +1000,17 @@ static int 
vhost_virtqueue_set_vring_endian_legacy(struct vhost_dev *dev,
bool is_big_endian,
int vhost_vq

[PATCH 04/10] chardev/char-fe: don't allow EAGAIN from blocking read

2021-11-11 Thread Roman Kagan

As its name suggests, ChardevClass.chr_sync_read is supposed to do a
blocking read.  The only implementation of it, tcp_chr_sync_read, does
set the underlying io channel to the blocking mode indeed.

Therefore a failure return with EAGAIN is not expected from this call.

So do not retry it in qemu_chr_fe_read_all; instead place an assertion
that it doesn't fail with EAGAIN.

Signed-off-by: Roman Kagan 
---
 chardev/char-fe.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/chardev/char-fe.c b/chardev/char-fe.c
index 7789f7be9c..f94efe928e 100644
--- a/chardev/char-fe.c
+++ b/chardev/char-fe.c
@@ -68,13 +68,10 @@ int qemu_chr_fe_read_all(CharBackend *be, uint8_t *buf, int 
len)
 }
 
 while (offset < len) {
-retry:
 res = CHARDEV_GET_CLASS(s)->chr_sync_read(s, buf + offset,
   len - offset);
-if (res == -1 && errno == EAGAIN) {
-g_usleep(100);
-goto retry;
-}
+/* ->chr_sync_read should block */
+assert(!(res < 0 && errno == EAGAIN));
 
 if (res == 0) {
 break;
-- 
2.33.1

[PATCH 07/10] vhost-vdpa: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Almost all VhostOps methods in vdpa_ops follow the convention of
returning negated errno on error.

Adjust the few that don't.  To that end, rework vhost_vdpa_add_status to
check if setting of the requested status bits has succeeded and return
the respective error code it hasn't, and propagate the error codes
wherever it's appropriate.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-vdpa.c | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 0d8051426c..a3b885902a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -292,18 +292,34 @@ static int vhost_vdpa_call(struct vhost_dev *dev, 
unsigned long int request,
 return ret < 0 ? -errno : ret;
 }
 
-static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
+static int vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
 {
 uint8_t s;
+int ret;
 
 trace_vhost_vdpa_add_status(dev, status);
-if (vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, )) {
-return;
+ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
+if (ret < 0) {
+return ret;
 }
 
 s |= status;
 
-vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, );
+ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, );
+if (ret < 0) {
+return ret;
+}
+
+ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
+if (ret < 0) {
+return ret;
+}
+
+if (!(s & status)) {
+return -EIO;
+}
+
+return 0;
 }
 
 static void vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
@@ -484,7 +500,7 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
 }
 }
 if (mem->padding) {
-return -1;
+return -EINVAL;
 }
 
 return 0;
@@ -501,14 +517,11 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
 
 trace_vhost_vdpa_set_features(dev, features);
 ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, );
-uint8_t status = 0;
 if (ret) {
 return ret;
 }
-vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
-vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
 
-return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
+return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
 }
 
 static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
@@ -650,12 +663,8 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 }
 
 if (started) {
-uint8_t status = 0;
 memory_listener_register(>listener, _space_memory);
-vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
-vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, );
-
-return !(status & VIRTIO_CONFIG_S_DRIVER_OK);
+return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
 } else {
 vhost_vdpa_reset_device(dev);
 vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
-- 
2.33.1

[PATCH 10/10] vhost-user-blk: propagate error return from generic vhost

2021-11-11 Thread Roman Kagan

Fix the only callsite that doesn't propagate the error code from the
generic vhost code.

Signed-off-by: Roman Kagan 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index f9b17f6813..ab11ce8252 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -100,7 +100,7 @@ static int vhost_user_blk_handle_config_change(struct 
vhost_dev *dev)
_err);
 if (ret < 0) {
 error_report_err(local_err);
-return -1;
+return ret;
 }
 
 /* valid for resize only */
-- 
2.33.1

[PATCH 08/10] vhost-user: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

VhostOps methods in user_ops are not very consistent in their error
returns: some return negated errno while others just -1.

Make sure all of them consistently return negated errno.  This also
helps error propagation from the functions being called inside.
Besides, this synchronizes the error return convention with the other
two vhost backends, kernel and vdpa, and will therefore allow for
consistent error propagation in the generic vhost code (in a followup
patch).

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-user.c | 401 +++--
 1 file changed, 223 insertions(+), 178 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index bf6e50223c..662853513e 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -280,9 +280,10 @@ static int vhost_user_read_header(struct vhost_dev *dev, 
VhostUserMsg *msg)
 
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
+int saved_errno = errno;
 error_report("Failed to read msg header. Read %d instead of %d."
  " Original request %d.", r, size, msg->hdr.request);
-return -1;
+return r < 0 ? -saved_errno : -EIO;
 }
 
 /* validate received flags */
@@ -290,7 +291,7 @@ static int vhost_user_read_header(struct vhost_dev *dev, 
VhostUserMsg *msg)
 error_report("Failed to read msg header."
 " Flags 0x%x instead of 0x%x.", msg->hdr.flags,
 VHOST_USER_REPLY_MASK | VHOST_USER_VERSION);
-return -1;
+return -EPROTO;
 }
 
 return 0;
@@ -314,8 +315,9 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 uint8_t *p = (uint8_t *) msg;
 int r, size;
 
-if (vhost_user_read_header(dev, msg) < 0) {
-data->ret = -1;
+r = vhost_user_read_header(dev, msg);
+if (r < 0) {
+data->ret = r;
 goto end;
 }
 
@@ -324,7 +326,7 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 error_report("Failed to read msg header."
 " Size %d exceeds the maximum %zu.", msg->hdr.size,
 VHOST_USER_PAYLOAD_SIZE);
-data->ret = -1;
+data->ret = -EPROTO;
 goto end;
 }
 
@@ -333,9 +335,10 @@ static gboolean vhost_user_read_cb(void *do_not_use, 
GIOCondition condition,
 size = msg->hdr.size;
 r = qemu_chr_fe_read_all(chr, p, size);
 if (r != size) {
+int saved_errno = errno;
 error_report("Failed to read msg payload."
  " Read %d instead of %d.", r, msg->hdr.size);
-data->ret = -1;
+data->ret = r < 0 ? -saved_errno : -EIO;
 goto end;
 }
 }
@@ -418,24 +421,26 @@ static int vhost_user_read(struct vhost_dev *dev, 
VhostUserMsg *msg)
 static int process_message_reply(struct vhost_dev *dev,
  const VhostUserMsg *msg)
 {
+int ret;
 VhostUserMsg msg_reply;
 
 if ((msg->hdr.flags & VHOST_USER_NEED_REPLY_MASK) == 0) {
 return 0;
 }
 
-if (vhost_user_read(dev, _reply) < 0) {
-return -1;
+ret = vhost_user_read(dev, _reply);
+if (ret < 0) {
+return ret;
 }
 
 if (msg_reply.hdr.request != msg->hdr.request) {
 error_report("Received unexpected msg type. "
  "Expected %d received %d",
  msg->hdr.request, msg_reply.hdr.request);
-return -1;
+return -EPROTO;
 }
 
-return msg_reply.payload.u64 ? -1 : 0;
+return msg_reply.payload.u64 ? -EIO : 0;
 }
 
 static bool vhost_user_one_time_request(VhostUserRequest request)
@@ -472,14 +477,15 @@ static int vhost_user_write(struct vhost_dev *dev, 
VhostUserMsg *msg,
 
 if (qemu_chr_fe_set_msgfds(chr, fds, fd_num) < 0) {
 error_report("Failed to set msg fds.");
-return -1;
+return -EINVAL;
 }
 
 ret = qemu_chr_fe_write_all(chr, (const uint8_t *) msg, size);
 if (ret != size) {
+int saved_errno = errno;
 error_report("Failed to write msg."
  " Wrote %d instead of %d.", ret, size);
-return -1;
+return ret < 0 ? -saved_errno : -EIO;
 }
 
 return 0;
@@ -502,6 +508,7 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 size_t fd_num = 0;
 bool shmfd = virtio_has_feature(dev->protocol_features,
 VHOST_USER_PROTOCOL_F_LOG_SHMFD);
+int ret;
 VhostUserMsg msg = {
 .hdr.request = VHOST_USER_SET_LOG_BASE,
 .hdr.flags = VHOST_USER_VERSION,
@@ -514,21 +521,23 @@ static int vhost_user_set_log_base(struct vhost_dev *dev, 
uint64_t base,
 fds[fd_num++] = l

[PATCH 02/10] chardev/char-socket: tcp_chr_recv: don't clobber errno

2021-11-11 Thread Roman Kagan

tcp_chr_recv communicates the specific error condition to the caller via
errno.  However, after setting it, it may call into some system calls or
library functions which can clobber the errno.

Avoid this by moving the errno assignment to the end of the function.

Signed-off-by: Roman Kagan 
---
 chardev/char-socket.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 836cfa0bc2..90054ce58c 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -346,13 +346,6 @@ static ssize_t tcp_chr_recv(Chardev *chr, char *buf, 
size_t len)
  NULL);
 }
 
-if (ret == QIO_CHANNEL_ERR_BLOCK) {
-errno = EAGAIN;
-ret = -1;
-} else if (ret == -1) {
-errno = EIO;
-}
-
 if (msgfds_num) {
 /* close and clean read_msgfds */
 for (i = 0; i < s->read_msgfds_num; i++) {
@@ -381,6 +374,13 @@ static ssize_t tcp_chr_recv(Chardev *chr, char *buf, 
size_t len)
 #endif
 }
 
+if (ret == QIO_CHANNEL_ERR_BLOCK) {
+errno = EAGAIN;
+ret = -1;
+} else if (ret == -1) {
+errno = EIO;
+}
+
 return ret;
 }
 
-- 
2.33.1

[PATCH 05/10] vhost-backend: avoid overflow on memslots_limit

2021-11-11 Thread Roman Kagan

Fix the (hypothetical) potential problem when the value parsed out of
the vhost module parameter in sysfs overflows the return value from
vhost_kernel_memslots_limit.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index b65f8f7e97..44f7dbb243 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -58,7 +58,7 @@ static int vhost_kernel_memslots_limit(struct vhost_dev *dev)
 if (g_file_get_contents("/sys/module/vhost/parameters/max_mem_regions",
 , NULL, NULL)) {
 uint64_t val = g_ascii_strtoull(s, NULL, 10);
-if (!((val == G_MAXUINT64 || !val) && errno)) {
+if (val < INT_MAX && val > 0) {
 g_free(s);
 return val;
 }
-- 
2.33.1

[PATCH 00/10] vhost: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Error propagation between the generic vhost code and the specific backends is
not quite consistent: some places follow "return -1 and set errno" convention,
while others assume "return negated errno".  Furthermore, not enough care is
taken not to clobber errno.

As a result, on certain code paths the errno resulting from a failure may get
overridden by another function call, and then that zero errno inidicating
success is propagated up the stack, leading to failures being lost.  In
particular, we've seen errors in the communication with a vhost-user-blk slave
not trigger an immediate connection drop and reconnection, leaving it in a
broken state.

Rework error propagation to always return negated errno on errors and
correctly pass it up the stack.

Roman Kagan (10):
  vhost-user-blk: reconnect on any error during realize
  chardev/char-socket: tcp_chr_recv: don't clobber errno
  chardev/char-socket: tcp_chr_sync_read: don't clobber errno
  chardev/char-fe: don't allow EAGAIN from blocking read
  vhost-backend: avoid overflow on memslots_limit
  vhost-backend: stick to -errno error return convention
  vhost-vdpa: stick to -errno error return convention
  vhost-user: stick to -errno error return convention
  vhost: stick to -errno error return convention
  vhost-user-blk: propagate error return from generic vhost

 chardev/char-fe.c |   7 +-
 chardev/char-socket.c |  17 +-
 hw/block/vhost-user-blk.c |   4 +-
 hw/virtio/vhost-backend.c |   4 +-
 hw/virtio/vhost-user.c| 401 +-
 hw/virtio/vhost-vdpa.c|  37 ++--
 hw/virtio/vhost.c |  98 +-
 7 files changed, 307 insertions(+), 261 deletions(-)

-- 
2.33.1

[PATCH 01/10] vhost-user-blk: reconnect on any error during realize

2021-11-11 Thread Roman Kagan

vhost-user-blk realize only attempts to reconnect if the previous
connection attempt failed on "a problem with the connection and not an
error related to the content (which would fail again the same way in the
next attempt)".

However this distinction is very subtle, and may be inadvertently broken
if the code changes somewhere deep down the stack and a new error gets
propagated up to here.

OTOH now that the number of reconnection attempts is limited it seems
harmless to try reconnecting on any error.

So relax the condition of whether to retry connecting to check for any
error.

This patch amends a527e312b5 "vhost-user-blk: Implement reconnection
during realize".

Signed-off-by: Roman Kagan 
---
 hw/block/vhost-user-blk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index ba13cb87e5..f9b17f6813 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -511,7 +511,7 @@ static void vhost_user_blk_device_realize(DeviceState *dev, 
Error **errp)
 *errp = NULL;
 }
 ret = vhost_user_blk_realize_connect(s, errp);
-} while (ret == -EPROTO && retries--);
+} while (ret < 0 && retries--);
 
 if (ret < 0) {
 goto virtio_err;
-- 
2.33.1

[PATCH 03/10] chardev/char-socket: tcp_chr_sync_read: don't clobber errno

2021-11-11 Thread Roman Kagan

After the return from tcp_chr_recv, tcp_chr_sync_read calls into a
function which eventually makes a system call and may clobber errno.

Make a copy of errno right after tcp_chr_recv and restore the errno on
return from tcp_chr_sync_read.

Signed-off-by: Roman Kagan 
---
 chardev/char-socket.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 90054ce58c..cf7f2ba65a 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -581,6 +581,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 {
 SocketChardev *s = SOCKET_CHARDEV(chr);
 int size;
+int saved_errno;
 
 if (s->state != TCP_CHARDEV_STATE_CONNECTED) {
 return 0;
@@ -588,6 +589,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 
 qio_channel_set_blocking(s->ioc, true, NULL);
 size = tcp_chr_recv(chr, (void *) buf, len);
+saved_errno = errno;
 if (s->state != TCP_CHARDEV_STATE_DISCONNECTED) {
 qio_channel_set_blocking(s->ioc, false, NULL);
 }
@@ -596,6 +598,7 @@ static int tcp_chr_sync_read(Chardev *chr, const uint8_t 
*buf, int len)
 tcp_chr_disconnect(chr);
 }
 
+errno = saved_errno;
 return size;
 }
 
-- 
2.33.1

[PATCH 06/10] vhost-backend: stick to -errno error return convention

2021-11-11 Thread Roman Kagan

Almost all VhostOps methods in kernel_ops follow the convention of
returning negated errno on error.

Adjust the only one that doesn't.

Signed-off-by: Roman Kagan 
---
 hw/virtio/vhost-backend.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 44f7dbb243..e409a865ae 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -47,7 +47,7 @@ static int vhost_kernel_cleanup(struct vhost_dev *dev)
 
 assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_KERNEL);
 
-return close(fd);
+return close(fd) < 0 ? -errno : 0;
 }
 
 static int vhost_kernel_memslots_limit(struct vhost_dev *dev)
-- 
2.33.1

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-04 Thread Roman Kagan

On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > It might be useful for the cases when a slow block layer should be replaced
> > with a more performant one on running VM without stopping, i.e. with very 
> > low
> > downtime comparable with the one on migration.
> > 
> > It's possible to achive that for two reasons:
> > 
> > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
> >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> >   each other in the values of migration service fields only.
> > 2.The device driver used in the guest is the same: virtio-blk
> > 
> > In the series cross-migration is achieved by adding a new type.
> > The new type uses virtio-blk VMState instead of vhost-user-blk specific
> > VMstate, also it implements migration save/load callbacks to be compatible
> > with migration stream produced by "virtio-blk" device.
> > 
> > Adding the new type instead of modifying the existing one is convenent.
> > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > device from the existing non-compatible one using qemu machinery without any
> > other modifiactions. That gives all the variety of qemu device related
> > constraints out of box.
> 
> Hmm I'm not sure I understand. What is the advantage for the user?
> What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> We could add some hacks to make it compatible for old machine types.

The point is that virtio-blk and vhost-user-blk are not
migration-compatible ATM.  OTOH they are the same device from the guest
POV so there's nothing fundamentally preventing the migration between
the two.  In particular, we see it as a means to switch between the
storage backend transports via live migration without disrupting the
guest.

Migration-wise virtio-blk and vhost-user-blk have in common

- the content of the VMState -- VMSTATE_VIRTIO_DEVICE

The two differ in

- the name and the version of the VMStateDescription

- virtio-blk has an extra migration section (via .save/.load callbacks
  on VirtioDeviceClass) containing requests in flight

It looks like to become migration-compatible with virtio-blk,
vhost-user-blk has to start using VMStateDescription of virtio-blk and
provide compatible .save/.load callbacks.  It isn't entirely obvious how
to make this machine-type-dependent, so we came up with a simpler idea
of defining a new device that shares most of the implementation with the
original vhost-user-blk except for the migration stuff.  We're certainly
open to suggestions on how to reconcile this under a single
vhost-user-blk device, as this would be more user-friendly indeed.

We considered using a class property for this and defining the
respective compat clause, but IIUC the class constructors (where .vmsd
and .save/.load are defined) are not supposed to depend on class
properties.

Thanks,
Roman.

Re: [PATCH v3 06/33] util/async: aio_co_schedule(): support reschedule in same ctx

2021-05-14 Thread Roman Kagan

On Thu, May 13, 2021 at 11:04:37PM +0200, Paolo Bonzini wrote:
> On 12/05/21 09:15, Vladimir Sementsov-Ogievskiy wrote:
> > > > 
> > > 
> > > I don't understand.  Why doesn't aio_co_enter go through the ctx !=
> > > qemu_get_current_aio_context() branch and just do aio_co_schedule?
> > > That was at least the idea behind aio_co_wake and aio_co_enter.
> > > 
> > 
> > Because ctx is exactly qemu_get_current_aio_context(), as we are not in
> > iothread but in nbd connection thread. So,
> > qemu_get_current_aio_context() returns qemu_aio_context.
> 
> So the problem is that threads other than the main thread and
> the I/O thread should not return qemu_aio_context.  The vCPU thread
> may need to return it when running with BQL taken, though.

I'm not sure this is the only case.

AFAICS your patch has basically the same effect as Vladimir's
patch "util/async: aio_co_enter(): do aio_co_schedule in general case"
(https://lore.kernel.org/qemu-devel/20210408140827.332915-4-vsement...@virtuozzo.com/).
That one was found to break e.g. aio=threads cases.  I guessed it
implicitly relied upon aio_co_enter() acquiring the aio_context but we
didn't dig further to pinpoint the exact scenario.

Roman.

> Something like this (untested):
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 5f342267d5..10fcae1515 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -691,10 +691,13 @@ void aio_co_enter(AioContext *ctx, struct Coroutine 
> *co);
>   * Return the AioContext whose event loop runs in the current thread.
>   *
>   * If called from an IOThread this will be the IOThread's AioContext.  If
> - * called from another thread it will be the main loop AioContext.
> + * called from the main thread or with the "big QEMU lock" taken it
> + * will be the main loop AioContext.
>   */
>  AioContext *qemu_get_current_aio_context(void);
> +void qemu_set_current_aio_context(AioContext *ctx);
> +
>  /**
>   * aio_context_setup:
>   * @ctx: the aio context
> diff --git a/iothread.c b/iothread.c
> index 7f086387be..22b967e77c 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -39,11 +39,23 @@ DECLARE_CLASS_CHECKERS(IOThreadClass, IOTHREAD,
>  #define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
>  #endif
> -static __thread IOThread *my_iothread;
> +static __thread AioContext *my_aiocontext;
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +assert(!my_aiocontext);
> +my_aiocontext = ctx;
> +}
>  AioContext *qemu_get_current_aio_context(void)
>  {
> -return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
> +if (my_aiocontext) {
> +return my_aiocontext;
> +}
> +if (qemu_mutex_iothread_locked()) {
> +return qemu_get_aio_context();
> +}
> +return NULL;
>  }
>  static void *iothread_run(void *opaque)
> @@ -56,7 +68,7 @@ static void *iothread_run(void *opaque)
>   * in this new thread uses glib.
>   */
>  g_main_context_push_thread_default(iothread->worker_context);
> -my_iothread = iothread;
> +qemu_set_current_aio_context(iothread->ctx);
>  iothread->thread_id = qemu_get_thread_id();
>  qemu_sem_post(>init_done_sem);
> diff --git a/stubs/iothread.c b/stubs/iothread.c
> index 8cc9e28c55..25ff398894 100644
> --- a/stubs/iothread.c
> +++ b/stubs/iothread.c
> @@ -6,3 +6,7 @@ AioContext *qemu_get_current_aio_context(void)
>  {
>  return qemu_get_aio_context();
>  }
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +}
> diff --git a/tests/unit/iothread.c b/tests/unit/iothread.c
> index afde12b4ef..cab38b3da8 100644
> --- a/tests/unit/iothread.c
> +++ b/tests/unit/iothread.c
> @@ -30,13 +30,26 @@ struct IOThread {
>  bool stopping;
>  };
> -static __thread IOThread *my_iothread;
> +static __thread AioContext *my_aiocontext;
> +
> +void qemu_set_current_aio_context(AioContext *ctx)
> +{
> +assert(!my_aiocontext);
> +my_aiocontext = ctx;
> +}
>  AioContext *qemu_get_current_aio_context(void)
>  {
> -return my_iothread ? my_iothread->ctx : qemu_get_aio_context();
> +if (my_aiocontext) {
> +return my_aiocontext;
> +}
> +if (qemu_mutex_iothread_locked()) {
> +return qemu_get_aio_context();
> +}
> +return NULL;
>  }
> +
>  static void iothread_init_gcontext(IOThread *iothread)
>  {
>  GSource *source;
> @@ -54,7 +67,7 @@ static void *iothread_run(void *opaque)
>  rcu_register_thread();
> -my_iothread = iothread;
> +qemu_set_current_aio_context(iothread->ctx);
>  qemu_mutex_lock(>init_done_lock);
>  iothread->ctx = aio_context_new(_abort);
> diff --git a/util/main-loop.c b/util/main-loop.c
> index d9c55df6f5..4ae5b23e99 100644
> --- a/util/main-loop.c
> +++ b/util/main-loop.c
> @@ -170,6 +170,7 @@ int qemu_init_main_loop(Error **errp)
>  if (!qemu_aio_context) {
>  return -EMFILE;
>  }
> +qemu_set_current_aio_context(qemu_aio_context);
>  qemu_notify_bh = qemu_bh_new(notify_event_cb, NULL);
>  gpollfds =

Re: [PATCH v3 21/33] qemu-socket: pass monitor link to socket_get_fd directly

2021-05-12 Thread Roman Kagan

On Mon, Apr 19, 2021 at 10:34:49AM +0100, Daniel P. Berrangé wrote:
> On Fri, Apr 16, 2021 at 11:08:59AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > Detecting monitor by current coroutine works bad when we are not in
> > coroutine context. And that's exactly so in nbd reconnect code, where
> > qio_channel_socket_connect_sync() is called from thread.
> > 
> > Add a possibility to pass monitor by hand, to be used in the following
> > commit.
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > ---
> >  include/io/channel-socket.h| 20 
> >  include/qemu/sockets.h |  2 +-
> >  io/channel-socket.c| 17 +
> >  tests/unit/test-util-sockets.c | 16 
> >  util/qemu-sockets.c| 10 +-
> >  5 files changed, 47 insertions(+), 18 deletions(-)
> > 
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..6d0915420d 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -78,6 +78,23 @@ qio_channel_socket_new_fd(int fd,
> >Error **errp);
> >  
> >  
> > +/**
> > + * qio_channel_socket_connect_sync_mon:
> > + * @ioc: the socket channel object
> > + * @addr: the address to connect to
> > + * @mon: current monitor. If NULL, it will be detected by
> > + *   current coroutine.
> > + * @errp: pointer to a NULL-initialized error object
> > + *
> > + * Attempt to connect to the address @addr. This method
> > + * will run in the foreground so the caller will not regain
> > + * execution control until the connection is established or
> > + * an error occurs.
> > + */
> > +int qio_channel_socket_connect_sync_mon(QIOChannelSocket *ioc,
> > +SocketAddress *addr,
> > +Monitor *mon,
> > +Error **errp);
> 
> I don't really like exposing the concept of the QEMU monitor in
> the IO layer APIs. IMHO these ought to remain completely separate
> subsystems from the API pov,

Agreed. 

> and we ought to fix this problem by
> making monitor_cur() work better in the scenario required.

Would it make sense instead to resolve the fdstr into actual file
descriptor number in the context where monitor_cur() works and makes
sense, prior to passing it to the connection thread?

Roman.

Re: [PATCH v3 19/33] block/nbd: split nbd_handle_updated_info out of nbd_client_handshake()

2021-05-12 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:57AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> To be reused in the following patch.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 99 ++---
>  1 file changed, 57 insertions(+), 42 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v3 18/33] nbd/client-connection: shutdown connection on release

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:56AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Now, when thread can do negotiation and retry, it may run relatively
> long. We need a mechanism to stop it, when user is not interested in
> result anymore. So, on nbd_client_connection_release() let's shutdown
> the socked, and do not retry connection if thread is detached.

This kinda answers my question to the previous patch about reconnect
cancellation.

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 36 ++--
>  1 file changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 002bd91f42..54f73c6c24 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -158,9 +158,13 @@ static void *connect_thread_func(void *opaque)
>  uint64_t timeout = 1;
>  uint64_t max_timeout = 16;
>  
> -while (true) {
> +qemu_mutex_lock(>mutex);
> +while (!conn->detached) {
> +assert(!conn->sioc);
>  conn->sioc = qio_channel_socket_new();
>  
> +qemu_mutex_unlock(>mutex);
> +
>  error_free(conn->err);
>  conn->err = NULL;
>  conn->updated_info = conn->initial_info;
> @@ -171,14 +175,20 @@ static void *connect_thread_func(void *opaque)
>  conn->updated_info.x_dirty_bitmap = NULL;
>  conn->updated_info.name = NULL;
>  
> +qemu_mutex_lock(>mutex);
> +
>  if (ret < 0) {
>  object_unref(OBJECT(conn->sioc));
>  conn->sioc = NULL;
> -if (conn->do_retry) {
> +if (conn->do_retry && !conn->detached) {
> +qemu_mutex_unlock(>mutex);
> +
>  sleep(timeout);
>  if (timeout < max_timeout) {
>  timeout *= 2;
>  }
> +
> +qemu_mutex_lock(>mutex);
>  continue;
>  }
>  }
> @@ -186,15 +196,17 @@ static void *connect_thread_func(void *opaque)
>  break;
>  }
>  
> -WITH_QEMU_LOCK_GUARD(>mutex) {
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> -}
> -do_free = conn->detached;
> +/* mutex is locked */
> +
> +assert(conn->running);
> +conn->running = false;
> +if (conn->wait_co) {
> +aio_co_schedule(NULL, conn->wait_co);
> +conn->wait_co = NULL;
>  }
> +do_free = conn->detached;
> +
> +qemu_mutex_unlock(>mutex);

This basically reverts some hunks from patch 15 "nbd/client-connection:
use QEMU_LOCK_GUARD".  How about dropping them there (they weren't an
obvious improvement even then).

Roman.

>  
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
> @@ -215,6 +227,10 @@ void nbd_client_connection_release(NBDClientConnection 
> *conn)
>  if (conn->running) {
>  conn->detached = true;
>  }
> +if (conn->sioc) {
> +qio_channel_shutdown(QIO_CHANNEL(conn->sioc),
> + QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
> +}
>  do_free = !conn->running && !conn->detached;
>  qemu_mutex_unlock(>mutex);
>  
> -- 
> 2.29.2
>

Re: [PATCH v3 17/33] nbd/client-connection: implement connection retry

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:55AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add an option for thread to retry connection until success. We'll use
> nbd/client-connection both for reconnect and for initial connection in
> nbd_open(), so we need a possibility to use same NBDClientConnection
> instance to connect once in nbd_open() and then use retry semantics for
> reconnect.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |  2 ++
>  nbd/client-connection.c | 55 +
>  2 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 5d86e6a393..5bb54d831c 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -409,6 +409,8 @@ const char *nbd_err_lookup(int err);
>  /* nbd/client-connection.c */
>  typedef struct NBDClientConnection NBDClientConnection;
>  
> +void nbd_client_connection_enable_retry(NBDClientConnection *conn);
> +
>  NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> bool do_negotiation,
> const char *export_name,
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index ae4a77f826..002bd91f42 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -36,6 +36,8 @@ struct NBDClientConnection {
>  NBDExportInfo initial_info;
>  bool do_negotiation;
>  
> +bool do_retry;
> +
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
>   * If you want to steal error, don't forget to set pointer to NULL.
> @@ -52,6 +54,15 @@ struct NBDClientConnection {
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  };
>  
> +/*
> + * The function isn't protected by any mutex, so call it when thread is not
> + * running.
> + */
> +void nbd_client_connection_enable_retry(NBDClientConnection *conn)
> +{
> +conn->do_retry = true;
> +}
> +
>  NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> bool do_negotiation,
> const char *export_name,
> @@ -144,24 +155,37 @@ static void *connect_thread_func(void *opaque)
>  NBDClientConnection *conn = opaque;
>  bool do_free;
>  int ret;
> +uint64_t timeout = 1;
> +uint64_t max_timeout = 16;
> +
> +while (true) {
> +conn->sioc = qio_channel_socket_new();
> +
> +error_free(conn->err);
> +conn->err = NULL;
> +conn->updated_info = conn->initial_info;
> +
> +ret = nbd_connect(conn->sioc, conn->saddr,
> +  conn->do_negotiation ? >updated_info : NULL,
> +  conn->tlscreds, >ioc, >err);
> +conn->updated_info.x_dirty_bitmap = NULL;
> +conn->updated_info.name = NULL;
> +
> +if (ret < 0) {
> +object_unref(OBJECT(conn->sioc));
> +conn->sioc = NULL;
> +if (conn->do_retry) {
> +sleep(timeout);
> +if (timeout < max_timeout) {
> +timeout *= 2;
> +}
> +continue;
> +}
> +}

How is it supposed to get canceled?

Roman.

> -conn->sioc = qio_channel_socket_new();
> -
> -error_free(conn->err);
> -conn->err = NULL;
> -conn->updated_info = conn->initial_info;
> -
> -ret = nbd_connect(conn->sioc, conn->saddr,
> -  conn->do_negotiation ? >updated_info : NULL,
> -  conn->tlscreds, >ioc, >err);
> -if (ret < 0) {
> -object_unref(OBJECT(conn->sioc));
> -conn->sioc = NULL;
> +break;
>  }
>  
> -conn->updated_info.x_dirty_bitmap = NULL;
> -conn->updated_info.name = NULL;
> -
>  WITH_QEMU_LOCK_GUARD(>mutex) {
>  assert(conn->running);
>  conn->running = false;
> @@ -172,7 +196,6 @@ static void *connect_thread_func(void *opaque)
>  do_free = conn->detached;
>  }
>  
> -
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
>  }
> -- 
> 2.29.2
>

Re: [PATCH v3 16/33] nbd/client-connection: add possibility of negotiation

2021-05-11 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:54AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add arguments and logic to support nbd negotiation in the same thread
> after successful connection.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |   9 +++-
>  block/nbd.c |   4 +-
>  nbd/client-connection.c | 105 ++--
>  3 files changed, 109 insertions(+), 9 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 57381be76f..5d86e6a393 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -409,11 +409,16 @@ const char *nbd_err_lookup(int err);
>  /* nbd/client-connection.c */
>  typedef struct NBDClientConnection NBDClientConnection;
>  
> -NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr);
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> +   bool do_negotiation,
> +   const char *export_name,
> +   const char *x_dirty_bitmap,
> +   QCryptoTLSCreds *tlscreds);
>  void nbd_client_connection_release(NBDClientConnection *conn);
>  
>  QIOChannelSocket *coroutine_fn
> -nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> +nbd_co_establish_connection(NBDClientConnection *conn, NBDExportInfo *info,
> +QIOChannel **ioc, Error **errp);
>  
>  void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn);
>  
> diff --git a/block/nbd.c b/block/nbd.c
> index 9bd68dcf10..5e63caaf4b 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -361,7 +361,7 @@ static coroutine_fn void 
> nbd_reconnect_attempt(BDRVNBDState *s)
>  s->ioc = NULL;
>  }
>  
> -s->sioc = nbd_co_establish_connection(s->conn, NULL);
> +s->sioc = nbd_co_establish_connection(s->conn, NULL, NULL, NULL);
>  if (!s->sioc) {
>  ret = -ECONNREFUSED;
>  goto out;
> @@ -2033,7 +2033,7 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail;
>  }
>  
> -s->conn = nbd_client_connection_new(s->saddr);
> +s->conn = nbd_client_connection_new(s->saddr, false, NULL, NULL, NULL);
>  
>  /*
>   * establish TCP connection, return error if it fails
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index b45a0bd5f6..ae4a77f826 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -30,14 +30,19 @@
>  #include "qapi/clone-visitor.h"
>  
>  struct NBDClientConnection {
> -/* Initialization constants */
> +/* Initialization constants, never change */
>  SocketAddress *saddr; /* address to connect to */
> +QCryptoTLSCreds *tlscreds;
> +NBDExportInfo initial_info;
> +bool do_negotiation;
>  
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
>   * If you want to steal error, don't forget to set pointer to NULL.
>   */
> +NBDExportInfo updated_info;
>  QIOChannelSocket *sioc;
> +QIOChannel *ioc;
>  Error *err;
>  
>  QemuMutex mutex;
> @@ -47,12 +52,25 @@ struct NBDClientConnection {
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  };
>  
> -NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr)
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
> +   bool do_negotiation,
> +   const char *export_name,
> +   const char *x_dirty_bitmap,
> +   QCryptoTLSCreds *tlscreds)
>  {
>  NBDClientConnection *conn = g_new(NBDClientConnection, 1);
>  
> +object_ref(OBJECT(tlscreds));
>  *conn = (NBDClientConnection) {
>  .saddr = QAPI_CLONE(SocketAddress, saddr),
> +.tlscreds = tlscreds,
> +.do_negotiation = do_negotiation,
> +
> +.initial_info.request_sizes = true,
> +.initial_info.structured_reply = true,
> +.initial_info.base_allocation = true,
> +.initial_info.x_dirty_bitmap = g_strdup(x_dirty_bitmap),
> +.initial_info.name = g_strdup(export_name ?: "")
>  };
>  
>  qemu_mutex_init(>mutex);
> @@ -68,9 +86,59 @@ static void 
> nbd_client_connection_do_free(NBDClientConnection *conn)
>  }
>  error_free(conn->err);
>  qapi_free_SocketAddress(conn->saddr);
> +object_unref(OBJECT(conn->tlscreds));
> +g_free(conn->initial_info.x_dirty_bitmap);
> +g_free(conn->initial_info.name);
>  g_free(conn);
>  }
>  
> +/*
> + * Connect to @addr and do NBD negotiation if @info is not null. If @tlscreds
> + * given @outioc is provided. @outioc is provided only on success.  The call 
> may

s/given/are given/
s/provided/returned/g

> + *

Re: [PATCH v3 15/33] nbd/client-connection: use QEMU_LOCK_GUARD

2021-04-28 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  nbd/client-connection.c | 94 ++---
>  1 file changed, 42 insertions(+), 52 deletions(-)
> 
> diff --git a/nbd/client-connection.c b/nbd/client-connection.c
> index 4e39a5b1af..b45a0bd5f6 100644
> --- a/nbd/client-connection.c
> +++ b/nbd/client-connection.c
> @@ -87,17 +87,16 @@ static void *connect_thread_func(void *opaque)
>  conn->sioc = NULL;
>  }
>  
> -qemu_mutex_lock(>mutex);
> -
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +assert(conn->running);
> +conn->running = false;
> +if (conn->wait_co) {
> +aio_co_schedule(NULL, conn->wait_co);
> +conn->wait_co = NULL;
> +}
>  }
>  do_free = conn->detached;

->detached is now accessed outside the mutex

>  
> -qemu_mutex_unlock(>mutex);
>  
>  if (do_free) {
>  nbd_client_connection_do_free(conn);
> @@ -136,61 +135,54 @@ void nbd_client_connection_release(NBDClientConnection 
> *conn)
>  QIOChannelSocket *coroutine_fn
>  nbd_co_establish_connection(NBDClientConnection *conn, Error **errp)
>  {
> -QIOChannelSocket *sioc = NULL;
>  QemuThread thread;
>  
> -qemu_mutex_lock(>mutex);
> -
> -/*
> - * Don't call nbd_co_establish_connection() in several coroutines in
> - * parallel. Only one call at once is supported.
> - */
> -assert(!conn->wait_co);
> -
> -if (!conn->running) {
> -if (conn->sioc) {
> -/* Previous attempt finally succeeded in background */
> -sioc = g_steal_pointer(>sioc);
> -qemu_mutex_unlock(>mutex);
> -
> -return sioc;
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +/*
> + * Don't call nbd_co_establish_connection() in several coroutines in
> + * parallel. Only one call at once is supported.
> + */
> +assert(!conn->wait_co);
> +
> +if (!conn->running) {
> +if (conn->sioc) {
> +/* Previous attempt finally succeeded in background */
> +return g_steal_pointer(>sioc);
> +}
> +
> +conn->running = true;
> +error_free(conn->err);
> +conn->err = NULL;
> +qemu_thread_create(, "nbd-connect",
> +   connect_thread_func, conn, 
> QEMU_THREAD_DETACHED);
>  }
>  
> -conn->running = true;
> -error_free(conn->err);
> -conn->err = NULL;
> -qemu_thread_create(, "nbd-connect",
> -   connect_thread_func, conn, QEMU_THREAD_DETACHED);
> +conn->wait_co = qemu_coroutine_self();
>  }
>  
> -conn->wait_co = qemu_coroutine_self();
> -
> -qemu_mutex_unlock(>mutex);
> -
>  /*
>   * We are going to wait for connect-thread finish, but
>   * nbd_co_establish_connection_cancel() can interrupt.
>   */
>  qemu_coroutine_yield();
>  
> -qemu_mutex_lock(>mutex);
> -
> -if (conn->running) {
> -/*
> - * Obviously, drained section wants to start. Report the attempt as
> - * failed. Still connect thread is executing in background, and its
> - * result may be used for next connection attempt.
> - */
> -error_setg(errp, "Connection attempt cancelled by other operation");
> -} else {
> -error_propagate(errp, conn->err);
> -conn->err = NULL;
> -sioc = g_steal_pointer(>sioc);
> +WITH_QEMU_LOCK_GUARD(>mutex) {
> +if (conn->running) {
> +/*
> + * Obviously, drained section wants to start. Report the attempt 
> as
> + * failed. Still connect thread is executing in background, and 
> its
> + * result may be used for next connection attempt.
> + */
> +error_setg(errp, "Connection attempt cancelled by other 
> operation");
> +return NULL;
> +} else {
> +error_propagate(errp, conn->err);
> +conn->err = NULL;
> +return g_steal_pointer(>sioc);
> +}
>  }
>  
> -qemu_mutex_unlock(>mutex);
> -
> -return sioc;
> +abort(); /* unreachable */
>  }
>  
>  /*
> @@ -201,12 +193,10 @@ nbd_co_establish_connection(NBDClientConnection *conn, 
> Error **errp)
>   */
>  void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn)
>  {
> -qemu_mutex_lock(>mutex);
> +QEMU_LOCK_GUARD(>mutex);
>  
>  if (conn->wait_co) {
>  aio_co_schedule(NULL, conn->wait_co);
>  conn->wait_co = NULL;
>  }
> -
> -qemu_mutex_unlock(>mutex);
>  }
> -- 
> 2.29.2
>

Re: [PATCH v3 14/33] nbd: move connection code from block/nbd to nbd/client-connection

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:52AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We now have bs-independent connection API, which consists of four
> functions:
> 
>   nbd_client_connection_new()
>   nbd_client_connection_unref()
>   nbd_co_establish_connection()
>   nbd_co_establish_connection_cancel()
> 
> Move them to a separate file together with NBDClientConnection
> structure which becomes private to the new API.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/nbd.h |  11 +++
>  block/nbd.c | 187 ---
>  nbd/client-connection.c | 212 
>  nbd/meson.build |   1 +
>  4 files changed, 224 insertions(+), 187 deletions(-)
>  create mode 100644 nbd/client-connection.c
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 5f34d23bb0..57381be76f 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -406,4 +406,15 @@ const char *nbd_info_lookup(uint16_t info);
>  const char *nbd_cmd_lookup(uint16_t info);
>  const char *nbd_err_lookup(int err);
>  
> +/* nbd/client-connection.c */
> +typedef struct NBDClientConnection NBDClientConnection;
> +
> +NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr);
> +void nbd_client_connection_release(NBDClientConnection *conn);
> +
> +QIOChannelSocket *coroutine_fn
> +nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> +
> +void coroutine_fn nbd_co_establish_connection_cancel(NBDClientConnection 
> *conn);
> +
>  #endif
> diff --git a/block/nbd.c b/block/nbd.c
> index 8531d019b2..9bd68dcf10 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -66,24 +66,6 @@ typedef enum NBDClientState {
>  NBD_CLIENT_QUIT
>  } NBDClientState;
>  
> -typedef struct NBDClientConnection {
> -/* Initialization constants */
> -SocketAddress *saddr; /* address to connect to */
> -
> -/*
> - * Result of last attempt. Valid in FAIL and SUCCESS states.
> - * If you want to steal error, don't forget to set pointer to NULL.
> - */
> -QIOChannelSocket *sioc;
> -Error *err;
> -
> -QemuMutex mutex;
> -/* All further fields are protected by mutex */
> -bool running; /* thread is running now */
> -bool detached; /* thread is detached and should cleanup the state */
> -Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
> -} NBDClientConnection;
> -
>  typedef struct BDRVNBDState {
>  QIOChannelSocket *sioc; /* The master data channel */
>  QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
> @@ -118,12 +100,8 @@ typedef struct BDRVNBDState {
>  NBDClientConnection *conn;
>  } BDRVNBDState;
>  
> -static void nbd_client_connection_release(NBDClientConnection *conn);
>  static int nbd_establish_connection(BlockDriverState *bs, SocketAddress 
> *saddr,
>  Error **errp);
> -static coroutine_fn QIOChannelSocket *
> -nbd_co_establish_connection(NBDClientConnection *conn, Error **errp);
> -static void nbd_co_establish_connection_cancel(NBDClientConnection *conn);
>  static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
>  static void nbd_yank(void *opaque);
>  
> @@ -340,171 +318,6 @@ static bool nbd_client_connecting_wait(BDRVNBDState *s)
>  return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
>  }
>  
> -static NBDClientConnection *
> -nbd_client_connection_new(const SocketAddress *saddr)
> -{
> -NBDClientConnection *conn = g_new(NBDClientConnection, 1);
> -
> -*conn = (NBDClientConnection) {
> -.saddr = QAPI_CLONE(SocketAddress, saddr),
> -};
> -
> -qemu_mutex_init(>mutex);
> -
> -return conn;
> -}
> -
> -static void nbd_client_connection_do_free(NBDClientConnection *conn)
> -{
> -if (conn->sioc) {
> -qio_channel_close(QIO_CHANNEL(conn->sioc), NULL);
> -object_unref(OBJECT(conn->sioc));
> -}
> -error_free(conn->err);
> -qapi_free_SocketAddress(conn->saddr);
> -g_free(conn);
> -}
> -
> -static void *connect_thread_func(void *opaque)
> -{
> -NBDClientConnection *conn = opaque;
> -bool do_free;
> -int ret;
> -
> -conn->sioc = qio_channel_socket_new();
> -
> -error_free(conn->err);
> -conn->err = NULL;
> -ret = qio_channel_socket_connect_sync(conn->sioc, conn->saddr, 
> >err);
> -if (ret < 0) {
> -object_unref(OBJECT(conn->sioc));
> -conn->sioc = NULL;
> -}
> -
> -qemu_mutex_lock(>mutex);
> -
> -assert(conn->running);
> -conn->running = false;
> -if (conn->wait_co) {
> -aio_co_schedule(NULL, conn->wait_co);
> -conn->wait_co = NULL;
> -}
> -do_free = conn->detached;
> -
> -qemu_mutex_unlock(>mutex);
> -
> -if (do_free) {
> -nbd_client_connection_do_free(conn);
> -}
> -
> -return NULL;
> -}
> -
> -static void nbd_client_connection_release(NBDClientConnection *conn)
> -{
> -

Re: [PATCH v3 13/33] block/nbd: introduce nbd_client_connection_release()

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:51AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 43 ++-
>  1 file changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 21a4039359..8531d019b2 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -118,7 +118,7 @@ typedef struct BDRVNBDState {
>  NBDClientConnection *conn;
>  } BDRVNBDState;
>  
> -static void nbd_free_connect_thread(NBDClientConnection *conn);
> +static void nbd_client_connection_release(NBDClientConnection *conn);
>  static int nbd_establish_connection(BlockDriverState *bs, SocketAddress 
> *saddr,
>  Error **errp);
>  static coroutine_fn QIOChannelSocket *
> @@ -130,20 +130,9 @@ static void nbd_yank(void *opaque);
>  static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> -NBDClientConnection *conn = s->conn;
> -bool do_free;
> -
> -qemu_mutex_lock(>mutex);
> -if (conn->running) {
> -conn->detached = true;
> -}
> -do_free = !conn->running && !conn->detached;
> -qemu_mutex_unlock(>mutex);
>  
> -/* the runaway thread will clean it up itself */
> -if (do_free) {
> -nbd_free_connect_thread(conn);
> -}
> +nbd_client_connection_release(s->conn);
> +s->conn = NULL;
>  
>  yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
>  
> @@ -365,7 +354,7 @@ nbd_client_connection_new(const SocketAddress *saddr)
>  return conn;
>  }
>  
> -static void nbd_free_connect_thread(NBDClientConnection *conn)
> +static void nbd_client_connection_do_free(NBDClientConnection *conn)
>  {
>  if (conn->sioc) {
>  qio_channel_close(QIO_CHANNEL(conn->sioc), NULL);
> @@ -379,8 +368,8 @@ static void nbd_free_connect_thread(NBDClientConnection 
> *conn)
>  static void *connect_thread_func(void *opaque)
>  {
>  NBDClientConnection *conn = opaque;
> +bool do_free;
>  int ret;
> -bool do_free = false;
>  

This hunk belongs in patch 8.

Roman.

>  conn->sioc = qio_channel_socket_new();
>  
> @@ -405,12 +394,32 @@ static void *connect_thread_func(void *opaque)
>  qemu_mutex_unlock(>mutex);
>  
>  if (do_free) {
> -nbd_free_connect_thread(conn);
> +nbd_client_connection_do_free(conn);
>  }
>  
>  return NULL;
>  }
>  
> +static void nbd_client_connection_release(NBDClientConnection *conn)
> +{
> +bool do_free;
> +
> +if (!conn) {
> +return;
> +}
> +
> +qemu_mutex_lock(>mutex);
> +if (conn->running) {
> +conn->detached = true;
> +}
> +do_free = !conn->running && !conn->detached;
> +qemu_mutex_unlock(>mutex);
> +
> +if (do_free) {
> +nbd_client_connection_do_free(conn);
> +}
> +}
> +
>  /*
>   * Get a new connection in context of @conn:
>   *   if thread is running, wait for completion
> -- 
> 2.29.2
>

Re: [PATCH v3 11/33] block/nbd: rename NBDConnectThread to NBDClientConnection

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:49AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to move connection code to own file and want clear names
> and APIs.
> 
> The structure is shared between user and (possibly) several runs of
> connect-thread. So it's wrong to call it "thread". Let's rename to
> something more generic.
> 
> Appropriately rename connect_thread and thr variables to conn.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 137 ++--
>  1 file changed, 68 insertions(+), 69 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v3 08/33] block/nbd: drop thr->state

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:46AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We don't need all these states. The code refactored to use two boolean
> variables looks simpler.

Indeed.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 125 ++--
>  1 file changed, 34 insertions(+), 91 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index e1f39eda6c..2b26a033a4 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -66,24 +66,6 @@ typedef enum NBDClientState {
>  NBD_CLIENT_QUIT
>  } NBDClientState;
>  
> -typedef enum NBDConnectThreadState {
> -/* No thread, no pending results */
> -CONNECT_THREAD_NONE,
> -
> -/* Thread is running, no results for now */
> -CONNECT_THREAD_RUNNING,
> -
> -/*
> - * Thread is running, but requestor exited. Thread should close
> - * the new socket and free the connect state on exit.
> - */
> -CONNECT_THREAD_RUNNING_DETACHED,
> -
> -/* Thread finished, results are stored in a state */
> -CONNECT_THREAD_FAIL,
> -CONNECT_THREAD_SUCCESS
> -} NBDConnectThreadState;
> -
>  typedef struct NBDConnectThread {
>  /* Initialization constants */
>  SocketAddress *saddr; /* address to connect to */
> @@ -97,7 +79,8 @@ typedef struct NBDConnectThread {
>  
>  QemuMutex mutex;
>  /* All further fields are protected by mutex */
> -NBDConnectThreadState state; /* current state of the thread */
> +bool running; /* thread is running now */
> +bool detached; /* thread is detached and should cleanup the state */
>  Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  } NBDConnectThread;
>  
> @@ -147,17 +130,17 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  NBDConnectThread *thr = s->connect_thread;
> -bool thr_running;
> +bool do_free;
>  
>  qemu_mutex_lock(>mutex);
> -thr_running = thr->state == CONNECT_THREAD_RUNNING;
> -if (thr_running) {
> -thr->state = CONNECT_THREAD_RUNNING_DETACHED;
> +if (thr->running) {
> +thr->detached = true;
>  }
> +do_free = !thr->running && !thr->detached;

This is redundant.  You can unconditionally set ->detached and only
depend on ->running for the rest of this function.

>  qemu_mutex_unlock(>mutex);
>  
>  /* the runaway thread will clean it up itself */
> -if (!thr_running) {
> +if (do_free) {
>  nbd_free_connect_thread(thr);
>  }
>  
> @@ -373,7 +356,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  
>  *s->connect_thread = (NBDConnectThread) {
>  .saddr = QAPI_CLONE(SocketAddress, s->saddr),
> -.state = CONNECT_THREAD_NONE,
>  };
>  
>  qemu_mutex_init(>connect_thread->mutex);
> @@ -408,20 +390,13 @@ static void *connect_thread_func(void *opaque)
>  
>  qemu_mutex_lock(>mutex);
>  
> -switch (thr->state) {
> -case CONNECT_THREAD_RUNNING:
> -thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> -if (thr->wait_co) {
> -aio_co_schedule(NULL, thr->wait_co);
> -thr->wait_co = NULL;
> -}
> -break;
> -case CONNECT_THREAD_RUNNING_DETACHED:
> -do_free = true;
> -break;
> -default:
> -abort();
> +assert(thr->running);
> +thr->running = false;
> +if (thr->wait_co) {
> +aio_co_schedule(NULL, thr->wait_co);
> +thr->wait_co = NULL;
>  }
> +do_free = thr->detached;
>  
>  qemu_mutex_unlock(>mutex);
>  
> @@ -435,36 +410,24 @@ static void *connect_thread_func(void *opaque)
>  static int coroutine_fn
>  nbd_co_establish_connection(BlockDriverState *bs, Error **errp)
>  {
> -int ret;
>  QemuThread thread;
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectThread *thr = s->connect_thread;
>  
> +assert(!s->sioc);
> +
>  qemu_mutex_lock(>mutex);
>  
> -switch (thr->state) {
> -case CONNECT_THREAD_FAIL:
> -case CONNECT_THREAD_NONE:
> +if (!thr->running) {
> +if (thr->sioc) {
> +/* Previous attempt finally succeeded in background */
> +goto out;
> +}
> +thr->running = true;
>  error_free(thr->err);
>  thr->err = NULL;
> -thr->state = CONNECT_THREAD_RUNNING;
>  qemu_thread_create(, "nbd-connect",
> connect_thread_func, thr, QEMU_THREAD_DETACHED);
> -break;
> -case CONNECT_THREAD_SUCCESS:
> -/* Previous attempt finally succeeded in background */
> -thr->state = CONNECT_THREAD_NONE;
> -s->sioc = thr->sioc;
> -thr->sioc = NULL;
> -yank_register_function(BLOCKDEV_YANK_INSTANCE(bs->node_name),
> -   nbd_yank, bs);
> -qemu_mutex_unlock(>mutex);
> -return 0;
> -case CONNECT_THREAD_RUNNING:
> -/* Already running, will

Re: [PATCH v3 07/33] block/nbd: simplify waking of nbd_co_establish_connection()

2021-04-27 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:45AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Instead of connect_bh, bh_ctx and wait_connect fields we can live with
> only one link to waiting coroutine, protected by mutex.
> 
> So new logic is:
> 
> nbd_co_establish_connection() sets wait_co under mutex, release the
> mutex and do yield(). Note, that wait_co may be scheduled by thread
> immediately after unlocking the mutex. Still, in main thread (or
> iothread) we'll not reach the code for entering the coroutine until the
> yield() so we are safe.
> 
> Both connect_thread_func() and nbd_co_establish_connection_cancel() do
> the following to handle wait_co:
> 
> Under mutex, if thr->wait_co is not NULL, call aio_co_wake() (which
> never tries to acquire aio context since previous commit, so we are
> safe to do it under thr->mutex) and set thr->wait_co to NULL.
> This way we protect ourselves of scheduling it twice.
> 
> Also this commit make nbd_co_establish_connection() less connected to
> bs (we have generic pointer to the coroutine, not use s->connection_co
> directly). So, we are on the way of splitting connection API out of
> nbd.c (which is overcomplicated now).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +
>  1 file changed, 9 insertions(+), 40 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index d67556c7ee..e1f39eda6c 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -87,12 +87,6 @@ typedef enum NBDConnectThreadState {
>  typedef struct NBDConnectThread {
>  /* Initialization constants */
>  SocketAddress *saddr; /* address to connect to */
> -/*
> - * Bottom half to schedule on completion. Scheduled only if bh_ctx is not
> - * NULL
> - */
> -QEMUBHFunc *bh_func;
> -void *bh_opaque;
>  
>  /*
>   * Result of last attempt. Valid in FAIL and SUCCESS states.
> @@ -101,10 +95,10 @@ typedef struct NBDConnectThread {
>  QIOChannelSocket *sioc;
>  Error *err;
>  
> -/* state and bh_ctx are protected by mutex */
>  QemuMutex mutex;
> +/* All further fields are protected by mutex */
>  NBDConnectThreadState state; /* current state of the thread */
> -AioContext *bh_ctx; /* where to schedule bh (NULL means don't schedule) 
> */
> +Coroutine *wait_co; /* nbd_co_establish_connection() wait in yield() */
>  } NBDConnectThread;
>  
>  typedef struct BDRVNBDState {
> @@ -138,7 +132,6 @@ typedef struct BDRVNBDState {
>  char *x_dirty_bitmap;
>  bool alloc_depth;
>  
> -bool wait_connect;
>  NBDConnectThread *connect_thread;
>  } BDRVNBDState;
>  
> @@ -374,15 +367,6 @@ static bool nbd_client_connecting_wait(BDRVNBDState *s)
>  return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
>  }
>  
> -static void connect_bh(void *opaque)
> -{
> -BDRVNBDState *state = opaque;
> -
> -assert(state->wait_connect);
> -state->wait_connect = false;
> -aio_co_wake(state->connection_co);
> -}
> -
>  static void nbd_init_connect_thread(BDRVNBDState *s)
>  {
>  s->connect_thread = g_new(NBDConnectThread, 1);
> @@ -390,8 +374,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  *s->connect_thread = (NBDConnectThread) {
>  .saddr = QAPI_CLONE(SocketAddress, s->saddr),
>  .state = CONNECT_THREAD_NONE,
> -.bh_func = connect_bh,
> -.bh_opaque = s,
>  };
>  
>  qemu_mutex_init(>connect_thread->mutex);
> @@ -429,11 +411,9 @@ static void *connect_thread_func(void *opaque)
>  switch (thr->state) {
>  case CONNECT_THREAD_RUNNING:
>  thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> -if (thr->bh_ctx) {
> -aio_bh_schedule_oneshot(thr->bh_ctx, thr->bh_func, 
> thr->bh_opaque);
> -
> -/* play safe, don't reuse bh_ctx on further connection attempts 
> */
> -thr->bh_ctx = NULL;
> +if (thr->wait_co) {
> +aio_co_schedule(NULL, thr->wait_co);
> +thr->wait_co = NULL;
>  }
>  break;
>  case CONNECT_THREAD_RUNNING_DETACHED:
> @@ -487,20 +467,14 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
> **errp)
>  abort();
>  }
>  
> -thr->bh_ctx = qemu_get_current_aio_context();
> +thr->wait_co = qemu_coroutine_self();
>  
>  qemu_mutex_unlock(>mutex);
>  
> -
>  /*
>   * We are going to wait for connect-thread finish, but
>   * nbd_client_co_drain_begin() can interrupt.
> - *
> - * Note that wait_connect variable is not visible for connect-thread. It
> - * doesn't need mutex protection, it used only inside home aio context of
> - * bs.
>   */
> -s->wait_connect = true;
>  qemu_coroutine_yield();
>  
>  qemu_mutex_lock(>mutex);
> @@ -555,24 +529,19 @@ static void 
> nbd_co_establish_connection_cancel(BlockDriverState *bs)
>  {
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectThread *thr =

Re: [PATCH v3 06/33] util/async: aio_co_schedule(): support reschedule in same ctx

2021-04-23 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:44AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the following patch we want to call wake coroutine from thread.
> And it doesn't work with aio_co_wake:
> Assume we have no iothreads.
> Assume we have a coroutine A, which waits in the yield point for
> external aio_co_wake(), and no progress can be done until it happen.
> Main thread is in blocking aio_poll() (for example, in blk_read()).
> 
> Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> which goes through last "else" branch and do aio_context_acquire(ctx).
> 
> Now we have a deadlock, as aio_poll() will not release the context lock
> until some progress is done, and progress can't be done until
> aio_co_wake() wake the coroutine A. And it can't because it wait for
> aio_context_acquire().
> 
> Still, aio_co_schedule() works well in parallel with blocking
> aio_poll(). So we want use it. Let's add a possibility of rescheduling
> coroutine in same ctx where it was yield'ed.
> 
> Fetch co->ctx in same way as it is done in aio_co_wake().
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  include/block/aio.h | 2 +-
>  util/async.c| 8 
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 5f342267d5..744b695525 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -643,7 +643,7 @@ static inline bool aio_node_check(AioContext *ctx, bool 
> is_external)
>  
>  /**
>   * aio_co_schedule:
> - * @ctx: the aio context
> + * @ctx: the aio context, if NULL, the current ctx of @co will be used.
>   * @co: the coroutine
>   *
>   * Start a coroutine on a remote AioContext.
> diff --git a/util/async.c b/util/async.c
> index 674dbefb7c..750be555c6 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -545,6 +545,14 @@ fail:
>  
>  void aio_co_schedule(AioContext *ctx, Coroutine *co)
>  {
> +if (!ctx) {
> +/*
> + * Read coroutine before co->ctx.  Matches smp_wmb in
> + * qemu_coroutine_enter.
> + */
> +smp_read_barrier_depends();
> +ctx = qatomic_read(>ctx);
> +}

I'd rather not extend this interface, but add a new one on top.  And
document how it's different from aio_co_wake().

Roman.

>  trace_aio_co_schedule(ctx, co);
>  const char *scheduled = qatomic_cmpxchg(>scheduled, NULL,
> __func__);
> -- 
> 2.29.2
>

Re: [PATCH v3 04/33] block/nbd: nbd_client_handshake(): fix leak of s->ioc

2021-04-22 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:42AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Roman Kagan

Re: [PATCH v3 02/33] block/nbd: fix how state is cleared on nbd_open() failure paths

2021-04-22 Thread Roman Kagan

On Thu, Apr 22, 2021 at 01:27:22AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 21.04.2021 17:00, Roman Kagan wrote:
> > On Fri, Apr 16, 2021 at 11:08:40AM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > @@ -2305,20 +2301,23 @@ static int nbd_open(BlockDriverState *bs, QDict 
> > > *options, int flags,
> > >   return -EEXIST;
> > >   }
> > > +ret = nbd_process_options(bs, options, errp);
> > > +if (ret < 0) {
> > > +goto fail;
> > > +}
> > > +
> > >   /*
> > >* establish TCP connection, return error if it fails
> > >* TODO: Configurable retry-until-timeout behaviour.
> > >*/
> > >   if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
> > > -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> > > -return -ECONNREFUSED;
> > > +ret = -ECONNREFUSED;
> > > +goto fail;
> > >   }
> > >   ret = nbd_client_handshake(bs, errp);
> > Not that this was introduced by this patch, but once you're at it:
> > AFAICT nbd_client_handshake() calls yank_unregister_instance() on some
> > error path(s); I assume this needs to go too, otherwise it's called
> > twice (and asserts).
> > 
> > Roman.
> > 
> 
> No, nbd_client_handshake() only calls yank_unregister_function(), not 
> instance.

Indeed.  Sorry for confusion.

Reviewed-by: Roman Kagan

Re: [PATCH v3 02/33] block/nbd: fix how state is cleared on nbd_open() failure paths

2021-04-21 Thread Roman Kagan

On Fri, Apr 16, 2021 at 11:08:40AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We have two "return error" paths in nbd_open() after
> nbd_process_options(). Actually we should call nbd_clear_bdrvstate()
> on these paths. Interesting that nbd_process_options() calls
> nbd_clear_bdrvstate() by itself.
> 
> Let's fix leaks and refactor things to be more obvious:
> 
> - intialize yank at top of nbd_open()
> - move yank cleanup to nbd_clear_bdrvstate()
> - refactor nbd_open() so that all failure paths except for
>   yank-register goes through nbd_clear_bdrvstate()
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 36 ++--
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 739ae2941f..a407a3814b 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -152,8 +152,12 @@ static void 
> nbd_co_establish_connection_cancel(BlockDriverState *bs,
>  static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
>  static void nbd_yank(void *opaque);
>  
> -static void nbd_clear_bdrvstate(BDRVNBDState *s)
> +static void nbd_clear_bdrvstate(BlockDriverState *bs)
>  {
> +BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +
> +yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> +
>  object_unref(OBJECT(s->tlscreds));
>  qapi_free_SocketAddress(s->saddr);
>  s->saddr = NULL;
> @@ -2279,9 +2283,6 @@ static int nbd_process_options(BlockDriverState *bs, 
> QDict *options,
>  ret = 0;
>  
>   error:
> -if (ret < 0) {
> -nbd_clear_bdrvstate(s);
> -}
>  qemu_opts_del(opts);
>  return ret;
>  }
> @@ -2292,11 +2293,6 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
>  
> -ret = nbd_process_options(bs, options, errp);
> -if (ret < 0) {
> -return ret;
> -}
> -
>  s->bs = bs;
>  qemu_co_mutex_init(>send_mutex);
>  qemu_co_queue_init(>free_sema);
> @@ -2305,20 +2301,23 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  return -EEXIST;
>  }
>  
> +ret = nbd_process_options(bs, options, errp);
> +if (ret < 0) {
> +goto fail;
> +}
> +
>  /*
>   * establish TCP connection, return error if it fails
>   * TODO: Configurable retry-until-timeout behaviour.
>   */
>  if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -return -ECONNREFUSED;
> +ret = -ECONNREFUSED;
> +goto fail;
>  }
>  
>  ret = nbd_client_handshake(bs, errp);

Not that this was introduced by this patch, but once you're at it:
AFAICT nbd_client_handshake() calls yank_unregister_instance() on some
error path(s); I assume this needs to go too, otherwise it's called
twice (and asserts).

Roman.

>  if (ret < 0) {
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> -return ret;
> +goto fail;
>  }
>  /* successfully connected */
>  s->state = NBD_CLIENT_CONNECTED;
> @@ -2330,6 +2329,10 @@ static int nbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
>  
>  return 0;
> +
> +fail:
> +nbd_clear_bdrvstate(bs);
> +return ret;
>  }
>  
>  static int nbd_co_flush(BlockDriverState *bs)
> @@ -2373,11 +2376,8 @@ static void nbd_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  
>  static void nbd_close(BlockDriverState *bs)
>  {
> -BDRVNBDState *s = bs->opaque;
> -
>  nbd_client_close(bs);
> -yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
> -nbd_clear_bdrvstate(s);
> +nbd_clear_bdrvstate(bs);
>  }
>  
>  /*

Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread

2021-04-12 Thread Roman Kagan

On Tue, Apr 06, 2021 at 06:51:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> If on nbd_close() we detach the thread (in
> nbd_co_establish_connection_cancel() thr->state becomes
> CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use
> s->connect_thread (which is set to NULL), as running thread may free it
> at any time.
> 
> Still nbd_co_establish_connection() does exactly this: it saves
> s->connect_thread to local variable (just for better code style) and
> use it even after yield point, when thread may be already detached.
> 
> Fix that. Also check thr to be non-NULL on
> nbd_co_establish_connection() start for safety.
> 
> After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes
> impossible in the second switch in nbd_co_establish_connection().
> Still, don't add extra abort() just before the release. If it somehow
> possible to reach this "case:" it won't hurt. Anyway, good refactoring
> of all this reconnect mess will come soon.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> Hi all! I faced a crash, just running 277 iotest in a loop. I can't
> reproduce it on master, it reproduces only on my branch with nbd
> reconnect refactorings.
> 
> Still, it seems very possible that it may crash under some conditions.
> So I propose this patch for 6.0. It's written so that it's obvious that
> it will not hurt:
> 
>  pre-patch, on first hunk we'll just crash if thr is NULL,
>  on second hunk it's safe to return -1, and using thr when
>  s->connect_thread is already zeroed is obviously wrong.
> 
>  block/nbd.c | 11 +++
>  1 file changed, 11 insertions(+)

Can we please get it merged in 6.0 as it's a genuine crasher fix?

Reviewed-by: Roman Kagan

Re: [PATCH for-6.0 2/2] block/nbd: ensure ->connection_thread is always valid

2021-04-10 Thread Roman Kagan

On Sat, Apr 10, 2021 at 12:56:34PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 10.04.2021 11:38, Vladimir Sementsov-Ogievskiy wrote:
> > 10.04.2021 11:06, Vladimir Sementsov-Ogievskiy wrote:
> > > 09.04.2021 19:04, Roman Kagan wrote:
> > > > Simplify lifetime management of BDRVNBDState->connection_thread by
> > > > delaying the possible cleanup of it until the BDRVNBDState itself goes
> > > > away.
> > > > 
> > > > This also fixes possible use-after-free in nbd_co_establish_connection
> > > > when it races with nbd_co_establish_connection_cancel.
> > > > 
> > > > Signed-off-by: Roman Kagan
> > > 
> > > Reviewed-by: Vladimir Sementsov-Ogievskiy 
> > > 
> > 
> > Ha stop, it crashes iotest 51, as nbd_clear_bdrvstate is called also from 
> > nbd_process_options.
> > 
> > And this shows that we also do wrong thing when simply return from two ifs 
> > pre-patch (and one after-patch). Yes, after successful nbd_process options 
> > we should call nbd_clear_bdrvstate() on failure path.

The test didn't crash at me somehow but you're right there's bug here.

> And also it shows that patch is more difficult than it seems. I still think 
> that for 6.0 we should take a simple use-after-free-only fix, and don't care 
> about anything else.

I agree it turned out more complex than apporpriate for 6.0.  Let's
proceed with the one you've posted.

Thanks,
Roman.

[PATCH for-6.0 2/2] block/nbd: ensure ->connection_thread is always valid

2021-04-09 Thread Roman Kagan

Simplify lifetime management of BDRVNBDState->connection_thread by
delaying the possible cleanup of it until the BDRVNBDState itself goes
away.

This also fixes possible use-after-free in nbd_co_establish_connection
when it races with nbd_co_establish_connection_cancel.

Signed-off-by: Roman Kagan 
---
This is a more future-proof version of the fix for use-after-free but
less intrusive than Vladimir's series so that it can go in 6.0.

 block/nbd.c | 58 ++---
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d86df3afcb..6e3879c0c5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -144,16 +144,31 @@ typedef struct BDRVNBDState {
 NBDConnectThread *connect_thread;
 } BDRVNBDState;
 
+static void nbd_free_connect_thread(NBDConnectThread *thr);
 static int nbd_establish_connection(BlockDriverState *bs, SocketAddress *saddr,
 Error **errp);
 static int nbd_co_establish_connection(BlockDriverState *bs, Error **errp);
-static void nbd_co_establish_connection_cancel(BlockDriverState *bs,
-   bool detach);
+static void nbd_co_establish_connection_cancel(BlockDriverState *bs);
 static int nbd_client_handshake(BlockDriverState *bs, Error **errp);
 static void nbd_yank(void *opaque);
 
 static void nbd_clear_bdrvstate(BDRVNBDState *s)
 {
+NBDConnectThread *thr = s->connect_thread;
+bool thr_running;
+
+qemu_mutex_lock(>mutex);
+thr_running = thr->state == CONNECT_THREAD_RUNNING;
+if (thr_running) {
+thr->state = CONNECT_THREAD_RUNNING_DETACHED;
+}
+qemu_mutex_unlock(>mutex);
+
+/* the runaway thread will clean it up itself */
+if (!thr_running) {
+nbd_free_connect_thread(thr);
+}
+
 object_unref(OBJECT(s->tlscreds));
 qapi_free_SocketAddress(s->saddr);
 s->saddr = NULL;
@@ -293,7 +308,7 @@ static void coroutine_fn 
nbd_client_co_drain_begin(BlockDriverState *bs)
 qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
 }
 
-nbd_co_establish_connection_cancel(bs, false);
+nbd_co_establish_connection_cancel(bs);
 
 reconnect_delay_timer_del(s);
 
@@ -333,7 +348,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 if (s->connection_co_sleep_ns_state) {
 qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
 }
-nbd_co_establish_connection_cancel(bs, true);
+nbd_co_establish_connection_cancel(bs);
 }
 if (qemu_in_coroutine()) {
 s->teardown_co = qemu_coroutine_self();
@@ -534,18 +549,12 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
  * nbd_co_establish_connection_cancel
  * Cancel nbd_co_establish_connection asynchronously: it will finish soon, to
  * allow drained section to begin.
- *
- * If detach is true, also cleanup the state (or if thread is running, move it
- * to CONNECT_THREAD_RUNNING_DETACHED state). s->connect_thread becomes NULL if
- * detach is true.
  */
-static void nbd_co_establish_connection_cancel(BlockDriverState *bs,
-   bool detach)
+static void nbd_co_establish_connection_cancel(BlockDriverState *bs)
 {
 BDRVNBDState *s = bs->opaque;
 NBDConnectThread *thr = s->connect_thread;
 bool wake = false;
-bool do_free = false;
 
 qemu_mutex_lock(>mutex);
 
@@ -556,21 +565,10 @@ static void 
nbd_co_establish_connection_cancel(BlockDriverState *bs,
 s->wait_connect = false;
 wake = true;
 }
-if (detach) {
-thr->state = CONNECT_THREAD_RUNNING_DETACHED;
-s->connect_thread = NULL;
-}
-} else if (detach) {
-do_free = true;
 }
 
 qemu_mutex_unlock(>mutex);
 
-if (do_free) {
-nbd_free_connect_thread(thr);
-s->connect_thread = NULL;
-}
-
 if (wake) {
 aio_co_wake(s->connection_co);
 }
@@ -2294,31 +2292,33 @@ static int nbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 return -EEXIST;
 }
 
+nbd_init_connect_thread(s);
+
 /*
  * establish TCP connection, return error if it fails
  * TODO: Configurable retry-until-timeout behaviour.
  */
 if (nbd_establish_connection(bs, s->saddr, errp) < 0) {
-yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
-return -ECONNREFUSED;
+ret = -ECONNREFUSED;
+goto fail;
 }
 
 ret = nbd_client_handshake(bs, errp);
 if (ret < 0) {
-yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
-nbd_clear_bdrvstate(s);
-return ret;
+goto fail;
 }
 /* successfully connected */
 s->state = NBD_CLIENT_CONNECTED;
 
-nbd_init_connect_thread(s);
-
 s->connection_co = qemu_coroutine_create(nbd_connectio

[PATCH for-6.0 1/2] block/nbd: fix channel object leak

2021-04-09 Thread Roman Kagan

nbd_free_connect_thread leaks the channel object if it hasn't been
stolen.

Unref it and fix the leak.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..d86df3afcb 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -385,6 +385,7 @@ static void nbd_free_connect_thread(NBDConnectThread *thr)
 {
 if (thr->sioc) {
 qio_channel_close(QIO_CHANNEL(thr->sioc), NULL);
+object_unref(OBJECT(thr->sioc));
 }
 error_free(thr->err);
 qapi_free_SocketAddress(thr->saddr);
-- 
2.30.2

[PATCH for-6.0 0/2] block/nbd: assorted bugfixes

2021-04-09 Thread Roman Kagan

A couple of bugfixes to block/nbd that look appropriate for 6.0.

Roman Kagan (2):
  block/nbd: fix channel object leak
  block/nbd: ensure ->connection_thread is always valid

 block/nbd.c | 59 +++--
 1 file changed, 30 insertions(+), 29 deletions(-)

-- 
2.30.2

Re: [PATCH v2 03/10] util/async: aio_co_enter(): do aio_co_schedule in general case

2021-04-09 Thread Roman Kagan

On Thu, Apr 08, 2021 at 06:54:30PM +0300, Roman Kagan wrote:
> On Thu, Apr 08, 2021 at 05:08:20PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > With the following patch we want to call aio_co_wake() from thread.
> > And it works bad.
> > Assume we have no iothreads.
> > Assume we have a coroutine A, which waits in the yield point for external
> > aio_co_wake(), and no progress can be done until it happen.
> > Main thread is in blocking aio_poll() (for example, in blk_read()).
> > 
> > Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> > which goes through last "else" branch and do aio_context_acquire(ctx).
> > 
> > Now we have a deadlock, as aio_poll() will not release the context lock
> > until some progress is done, and progress can't be done until
> > aio_co_wake() wake the coroutine A. And it can't because it wait for
> > aio_context_acquire().
> > 
> > Still, aio_co_schedule() works well in parallel with blocking
> > aio_poll(). So let's use it in generic case and drop
> > aio_context_acquire/aio_context_release branch from aio_co_enter().
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > ---
> >  util/async.c | 11 ++-
> >  1 file changed, 2 insertions(+), 9 deletions(-)
> > 
> > diff --git a/util/async.c b/util/async.c
> > index 674dbefb7c..f05b883a39 100644
> > --- a/util/async.c
> > +++ b/util/async.c
> > @@ -614,19 +614,12 @@ void aio_co_wake(struct Coroutine *co)
> >  
> >  void aio_co_enter(AioContext *ctx, struct Coroutine *co)
> >  {
> > -if (ctx != qemu_get_current_aio_context()) {
> > -aio_co_schedule(ctx, co);
> > -return;
> > -}
> > -
> > -if (qemu_in_coroutine()) {
> > +if (ctx == qemu_get_current_aio_context() && qemu_in_coroutine()) {
> >  Coroutine *self = qemu_coroutine_self();
> >  assert(self != co);
> >  QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
> >  } else {
> > -aio_context_acquire(ctx);
> > -qemu_aio_coroutine_enter(ctx, co);
> > -aio_context_release(ctx);
> > +aio_co_schedule(ctx, co);
> >  }
> >  }
> 
> I'm fine with the change, but I find the log message to be a bit
> confusing (although correct).  AFAICS the problem is that calling
> aio_co_enter from a thread which has no associated aio_context works
> differently compared to calling it from a proper iothread: if the target
> context was qemu_aio_context, an iothread would just schedule the
> coroutine there, while a "dumb" thread would try lock the context
> potentially resulting in a deadlock.  This patch makes "dumb" threads
> and iothreads behave identically when entering a coroutine on a foreign
> context.
> 
> You may want to rephrase the log message to that end.
> 
> Anyway
> 
> Reviewed-by: Roman Kagan 

I was too quick to reply.  Turns out this patch breaks a lot of stuff;
apparently the original behavior is relied upon somewhere.
In particular, in iotest 008 basic checks with '--aio threads' fail due
to the device being closed before the operation is performed, so the
latter returns with -ENOMEDIUM:

# .../qemu-io --cache writeback --aio threads -f qcow2 \
-c 'aio_read -P 0xa 0 128M' .../scratch/t.qcow2 -T blk_\*
blk_root_attach child 0x560e9fb65a70 blk 0x560e9fb647b0 bs 0x560e9fb5f4b0
blk_root_detach child 0x560e9fb65a70 blk 0x560e9fb647b0 bs 0x560e9fb5f4b0
blk_root_attach child 0x560e9fb65a70 blk 0x560e9fb4c420 bs 0x560e9fb581a0
blk_root_detach child 0x560e9fb65a70 blk 0x560e9fb4c420 bs 0x560e9fb581a0
blk_co_preadv blk 0x560e9fb4c420 bs (nil) offset 0 bytes 134217728 flags 0x0
readv failed: No medium found

Let's drop this and get back to the original scheme with wakeup via BH.
It'll look just as nice, but won't touch the generic infrastructure with
unpredictable consequences.

Thanks,
Roman.

Re: [PATCH v2 00/10] block/nbd: move connection code to separate file

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:17PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> This substitutes "[PATCH 00/14] nbd: move reconnect-thread to separate file"
> Supersedes: <20210407104637.36033-1-vsement...@virtuozzo.com>
> 
> I want to simplify block/nbd.c which is overcomplicated now. First step
> is splitting out what could be split.
> 
> These series creates new file nbd/client-connection.c and part of
> block/nbd.c is refactored and moved.
> 
> v2 is mostly rewritten. I decided move larger part, otherwise it doesn't
> make real sense.
> 
> Note also that v2 is based on master. Patch 01 actually solves same
> problem as
> "[PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread" 
> [*]
> in a smarter way. So, if [*] goes first, this will be rebased to undo
> [*].
> 
> Vladimir Sementsov-Ogievskiy (10):
>   block/nbd: introduce NBDConnectThread reference counter
>   block/nbd: BDRVNBDState: drop unused connect_err and connect_status
>   util/async: aio_co_enter(): do aio_co_schedule in general case
>   block/nbd: simplify waking of nbd_co_establish_connection()
>   block/nbd: drop thr->state
>   block/nbd: bs-independent interface for nbd_co_establish_connection()
>   block/nbd: make nbd_co_establish_connection_cancel() bs-independent
>   block/nbd: rename NBDConnectThread to NBDClientConnection
>   block/nbd: introduce nbd_client_connection_new()
>   nbd: move connection code from block/nbd to nbd/client-connection
> 
>  include/block/nbd.h |  11 ++
>  block/nbd.c | 288 ++--
>  nbd/client-connection.c | 192 +++
>  util/async.c|  11 +-
>  nbd/meson.build |   1 +
>  5 files changed, 218 insertions(+), 285 deletions(-)
>  create mode 100644 nbd/client-connection.c

I think this is a nice cleanup overall, and makes the logic in
block/nbd.c much easier to reason about.

I guess it's 6.1 material though, as it looks somewhat too big for 6.0,
and the only serious bug it actually fixes can be addressed with a
band-aid mentioned above.

The problem I originally came across with, that of the requests being
canceled on drain despite reconnect, still remains, but I think the fix
for it should build up on this series (and thus probably wait till after
6.0).

Thanks,
Roman.

Re: [PATCH v2 10/10] nbd: move connection code from block/nbd to nbd/client-connection

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:27PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We now have bs-independent connection API, which consists of four
> functions:
> 
>   nbd_client_connection_new()
>   nbd_client_connection_unref()
>   nbd_co_establish_connection()
>   nbd_co_establish_connection_cancel()
> 
> Move them to a separate file together with NBDClientConnection
> structure which becomes private to the new API.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> Hmm. I keep only Virtuozzo's copyright in a new file, as actually I've
> developed nbd-reconnection code. Still probably safer to save all
> copyrights. Let me now if you think so and I'll add them.

Not my call.

>  include/block/nbd.h |  11 +++
>  block/nbd.c | 167 --
>  nbd/client-connection.c | 192 
>  nbd/meson.build |   1 +
>  4 files changed, 204 insertions(+), 167 deletions(-)
>  create mode 100644 nbd/client-connection.c

Reviewed-by: Roman Kagan

Re: [PATCH v2 09/10] block/nbd: introduce nbd_client_connection_new()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:26PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> This is the last step of creating bs-independing nbd connection

s/independing/independent/

> interface. With next commit we can finally move it to separate file.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 08/10] block/nbd: rename NBDConnectThread to NBDClientConnection

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to move connection code to own file and want clear names
> and APIs.
> 
> The structure is shared between user and (possibly) several runs of
> connect-thread. So it's wrong to call it "thread". Let's rename to
> something more generic.
> 
> Appropriately rename connect_thread and thr variables to conn.
> 
> connect_thread_state_unref() function gets new appropriate name too
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 127 ++--
>  1 file changed, 63 insertions(+), 64 deletions(-)

[To other reviewers: in addition to renaming there's one blank line
removed, hence the difference between (+) and (-)]

Reviewed-by: Roman Kagan

Re: [PATCH v2 07/10] block/nbd: make nbd_co_establish_connection_cancel() bs-independent

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> nbd_co_establish_connection_cancel() actually needs only pointer to
> NBDConnectThread. So, make it clean.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 06/10] block/nbd: bs-independent interface for nbd_co_establish_connection()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:23PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to split connection code to separate file. Now we are
> ready to give nbd_co_establish_connection() clean and bs-independent
> interface.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +++--
>  1 file changed, 31 insertions(+), 18 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 05/10] block/nbd: drop thr->state

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:22PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Actually, the only bit of information we need is "is thread running or
> not". We don't need all these states. So, instead of thr->state add
> boolean variable thr->running and refactor the code.

There's certain redundancy between thr->refcnt and thr->running.  I
wonder if it makes sense to employ this.

Can be done later if deemed useful.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 103 ++--
>  1 file changed, 27 insertions(+), 76 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 04/10] block/nbd: simplify waking of nbd_co_establish_connection()

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:21PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Instead of connect_bh, bh_ctx and wait_connect fields we can live with
> only one link to waiting coroutine, protected by mutex.
> 
> So new logic is:
> 
> nbd_co_establish_connection() sets wait_co under mutex, release the
> mutex and do yield(). Note, that wait_co may be scheduled by thread
> immediately after unlocking the mutex. Still, in main thread (or
> iothread) we'll not reach the code for entering the coroutine until the
> yield() so we are safe.
> 
> Both connect_thread_func() and nbd_co_establish_connection_cancel() do
> the following to handle wait_co:
> 
> Under mutex, if thr->wait_co is not NULL, call aio_co_wake() (which
> never tries to acquire aio context since previous commit, so we are
> safe to do it under thr->mutex) and set thr->wait_co to NULL.
> This way we protect ourselves of scheduling it twice.
> 
> Also this commit make nbd_co_establish_connection() less connected to
> bs (we have generic pointer to the coroutine, not use s->connection_co
> directly). So, we are on the way of splitting connection API out of
> nbd.c (which is overcomplicated now).
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 49 +--------
>  1 file changed, 9 insertions(+), 40 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 03/10] util/async: aio_co_enter(): do aio_co_schedule in general case

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:20PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the following patch we want to call aio_co_wake() from thread.
> And it works bad.
> Assume we have no iothreads.
> Assume we have a coroutine A, which waits in the yield point for external
> aio_co_wake(), and no progress can be done until it happen.
> Main thread is in blocking aio_poll() (for example, in blk_read()).
> 
> Now, in a separate thread we do aio_co_wake(). It calls  aio_co_enter(),
> which goes through last "else" branch and do aio_context_acquire(ctx).
> 
> Now we have a deadlock, as aio_poll() will not release the context lock
> until some progress is done, and progress can't be done until
> aio_co_wake() wake the coroutine A. And it can't because it wait for
> aio_context_acquire().
> 
> Still, aio_co_schedule() works well in parallel with blocking
> aio_poll(). So let's use it in generic case and drop
> aio_context_acquire/aio_context_release branch from aio_co_enter().
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  util/async.c | 11 ++-
>  1 file changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 674dbefb7c..f05b883a39 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -614,19 +614,12 @@ void aio_co_wake(struct Coroutine *co)
>  
>  void aio_co_enter(AioContext *ctx, struct Coroutine *co)
>  {
> -if (ctx != qemu_get_current_aio_context()) {
> -aio_co_schedule(ctx, co);
> -return;
> -}
> -
> -if (qemu_in_coroutine()) {
> +if (ctx == qemu_get_current_aio_context() && qemu_in_coroutine()) {
>  Coroutine *self = qemu_coroutine_self();
>  assert(self != co);
>  QSIMPLEQ_INSERT_TAIL(>co_queue_wakeup, co, co_queue_next);
>  } else {
> -aio_context_acquire(ctx);
> -qemu_aio_coroutine_enter(ctx, co);
> -aio_context_release(ctx);
> +aio_co_schedule(ctx, co);
>  }
>  }

I'm fine with the change, but I find the log message to be a bit
confusing (although correct).  AFAICS the problem is that calling
aio_co_enter from a thread which has no associated aio_context works
differently compared to calling it from a proper iothread: if the target
context was qemu_aio_context, an iothread would just schedule the
coroutine there, while a "dumb" thread would try lock the context
potentially resulting in a deadlock.  This patch makes "dumb" threads
and iothreads behave identically when entering a coroutine on a foreign
context.

You may want to rephrase the log message to that end.

Anyway

Reviewed-by: Roman Kagan

Re: [PATCH v2 02/10] block/nbd: BDRVNBDState: drop unused connect_err and connect_status

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:19PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> These fields are write-only. Drop them.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH v2 01/10] block/nbd: introduce NBDConnectThread reference counter

2021-04-08 Thread Roman Kagan

On Thu, Apr 08, 2021 at 05:08:18PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The structure is shared between NBD BDS and connection thread. And it
> is possible the connect thread will finish after closing and releasing
> for the bs. To handle this we have a concept of
> CONNECT_THREAD_RUNNING_DETACHED state and when thread is running and
> BDS is going to be closed we don't free the structure, but instead move
> it to CONNECT_THREAD_RUNNING_DETACHED state, so that thread will free
> it.
> 
> Still more native way to solve the problem is using reference counter
> for shared structure. Let's use it. It makes code smaller and more
> readable.
> 
> New approach also makes checks in nbd_co_establish_connection()
> redundant: now we are sure that s->connect_thread is valid during the
> whole life of NBD BDS.
> 
> This also fixes possible use-after-free of s->connect_thread if
> nbd_co_establish_connection_cancel() clears it during
> nbd_co_establish_connection(), and nbd_co_establish_connection() uses
> local copy of s->connect_thread after yield point.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 62 +--------
>  1 file changed, 20 insertions(+), 42 deletions(-)

Reviewed-by: Roman Kagan

Re: [PATCH 06/14] block/nbd: further segregation of connect-thread

2021-04-08 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:29PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Add personal state NBDConnectThread for connect-thread and
> nbd_connect_thread_start() function. Next step would be moving
> connect-thread to a separate file.
> 
> Note that we stop publishing thr->sioc during
> qio_channel_socket_connect_sync(). Which probably means that we can't
> early-cancel qio_channel_socket_connect_sync() call in
> nbd_free_connect_thread(). Still, when thread is detached it doesn't
> make big sense, and we have no guarantee anyway.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 57 -
>  1 file changed, 39 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index e16c6d636a..23eb8adab1 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -85,8 +85,6 @@ typedef enum NBDConnectThreadState {
>  } NBDConnectThreadState;
>  
>  typedef struct NBDConnectCB {
> -/* Initialization constants */
> -SocketAddress *saddr; /* address to connect to */
>  /*
>   * Bottom half to schedule on completion. Scheduled only if bh_ctx is not
>   * NULL
> @@ -103,6 +101,15 @@ typedef struct NBDConnectCB {
>  AioContext *bh_ctx; /* where to schedule bh (NULL means don't schedule) 
> */
>  } NBDConnectCB;
>  
> +typedef void (*NBDConnectThreadCallback)(QIOChannelSocket *sioc, int ret,
> + void *opaque);
> +
> +typedef struct NBDConnectThread {
> +SocketAddress *saddr; /* address to connect to */
> +NBDConnectThreadCallback cb;
> +void *cb_opaque;
> +} NBDConnectThread;
> +
>  typedef struct BDRVNBDState {
>  QIOChannelSocket *sioc; /* The master data channel */
>  QIOChannel *ioc; /* The current I/O channel which may differ (eg TLS) */
> @@ -367,7 +374,6 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  s->connect_thread = g_new(NBDConnectCB, 1);
>  
>  *s->connect_thread = (NBDConnectCB) {
> -.saddr = QAPI_CLONE(SocketAddress, s->saddr),
>  .state = CONNECT_THREAD_NONE,
>  .bh_func = connect_bh,
>  .bh_opaque = s,
> @@ -378,20 +384,18 @@ static void nbd_init_connect_thread(BDRVNBDState *s)
>  
>  static void nbd_free_connect_thread(NBDConnectCB *thr)
>  {
> -if (thr->sioc) {
> -qio_channel_close(QIO_CHANNEL(thr->sioc), NULL);
> -}

Doesn't this result in an open channel leak?  (The original code here
seems to be missing an unref, too.)

> -qapi_free_SocketAddress(thr->saddr);
>  g_free(thr);
>  }
>  
> -static void connect_thread_cb(int ret, void *opaque)
> +static void connect_thread_cb(QIOChannelSocket *sioc, int ret, void *opaque)
>  {
>  NBDConnectCB *thr = opaque;
>  bool do_free = false;
>  
>  qemu_mutex_lock(>mutex);
>  
> +thr->sioc = sioc;
> +
>  switch (thr->state) {
>  case CONNECT_THREAD_RUNNING:
>  thr->state = ret < 0 ? CONNECT_THREAD_FAIL : CONNECT_THREAD_SUCCESS;
> @@ -418,27 +422,45 @@ static void connect_thread_cb(int ret, void *opaque)
>  
>  static void *connect_thread_func(void *opaque)
>  {
> -NBDConnectCB *thr = opaque;
> +NBDConnectThread *thr = opaque;
>  int ret;
> +QIOChannelSocket *sioc = qio_channel_socket_new();
>  
> -thr->sioc = qio_channel_socket_new();
> -
> -ret = qio_channel_socket_connect_sync(thr->sioc, thr->saddr, NULL);
> +ret = qio_channel_socket_connect_sync(sioc, thr->saddr, NULL);
>  if (ret < 0) {
> -object_unref(OBJECT(thr->sioc));
> -thr->sioc = NULL;
> +object_unref(OBJECT(sioc));
> +sioc = NULL;
>  }
>  
> -connect_thread_cb(ret, thr);
> +thr->cb(sioc, ret, thr->cb_opaque);
> +
> +qapi_free_SocketAddress(thr->saddr);
> +g_free(thr);
>  
>  return NULL;
>  }
>  
> +static void nbd_connect_thread_start(const SocketAddress *saddr,
> + NBDConnectThreadCallback cb,
> + void *cb_opaque)
> +{
> +QemuThread thread;
> +NBDConnectThread *thr = g_new(NBDConnectThread, 1);
> +
> +*thr = (NBDConnectThread) {
> +.saddr = QAPI_CLONE(SocketAddress, saddr),
> +.cb = cb,
> +.cb_opaque = cb_opaque,
> +};
> +
> +qemu_thread_create(, "nbd-connect",
> +   connect_thread_func, thr, QEMU_THREAD_DETACHED);
> +}
> +
>  static int coroutine_fn
>  nbd_co_establish_connection(BlockDriverState *bs)
>  {
>  int ret;
> -QemuThread thread;
>  BDRVNBDState *s = bs->opaque;
>  NBDConnectCB *thr = s->connect_thread;
>  
> @@ -448,8 +470,7 @@ nbd_co_establish_connection(BlockDriverState *bs)
>  case CONNECT_THREAD_FAIL:
>  case CONNECT_THREAD_NONE:
>  thr->state = CONNECT_THREAD_RUNNING;
> -qemu_thread_create(, "nbd-connect",
> -   connect_thread_func, thr, QEMU_THREAD_DETACHED);
> +nbd_connect_thread_start(s->saddr, connect_thread_cb,

Re: [PATCH 03/14] block/nbd: drop unused NBDConnectThread::err field

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:26PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The field is used only to free it. Let's drop it for now for
> simplicity.

Well, it's *now* (after your patch 2) only used to free it.  This makes
the reconnect process even further concealed from the user: the client
may be struggling to re-establish the connection but you'll never know
when looking at the logs.

As I wrote in my comment to patch 2 I believe these errors need to be
logged.  Whether to do it directly in the reconnection thread or punt it
to the spawner of it is another matter; I'd personally prefer to do
logging in the original bdrv context as it allows to (easier) tag the
log messages with the indication of which connection is suffering.

Thanks,
Roman.

Re: [PATCH 02/14] block/nbd: nbd_co_establish_connection(): drop unused errp

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We are going to refactor connection logic to make it more
> understandable. Every bit that we can simplify in advance will help.
> Drop errp for now, it's unused anyway. We'll probably reimplement it in
> future.

Although I agree that this passing errors around is a bit of an
overkill, my problem with NBD client is that it's notoriously silent
about problems it expeirences, and those errors never pop up in logs.

Given that these errors are not guest-triggerable, and probably indicate
serious problems at the infrastructure level, instead of endlessly
passing them around (as in the code ATM) or dropping them on the floor
(as you propose in the patch) I'd much rather log them immediately when
encountering.

I have a patch to that end, I'll try to port it on top of your series.

Thanks,
Roman.

Re: [PATCH 01/14] block/nbd: BDRVNBDState: drop unused connect_err

2021-04-07 Thread Roman Kagan

On Wed, Apr 07, 2021 at 01:46:24PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> The field is actually unused. Let's make things a bit simpler dropping
> it and corresponding logic.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index c26dc5a54f..a47d6cfea3 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -122,7 +122,6 @@ typedef struct BDRVNBDState {
>  int in_flight;
>  NBDClientState state;
>  int connect_status;

This field is write-only too.  Do you have any plans on using it later?
Otherwise you may want to nuke it in this patch too.

Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-04-07 Thread Roman Kagan

On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).

The problem with this approach is that it would change the reconnect
behavior.

Currently connection_co purpose is three-fold:

1) receive the header of the server response, identify the request it
   pertains to, and wake the resective request coroutine

2) take on the responsibility to reestablish the connection when it's
   lost

3) monitor the idle connection and initiate the reconnect as soon as the
   connection is lost

Points 1 and 2 can be moved to the request coroutines indeed.  However I
don't see how to do 3 without an extra ever-running coroutine.
Sacrificing it would mean that a connection loss wouldn't be noticed and
the recovery wouldn't be attempted until a request arrived.

This change looks to me like a degradation compared to the current
state.

Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-26 Thread Roman Kagan

On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).
> 
> but this may be too much for soft freeze.

This approach does look appealing to me, and I gave it a quick shot but
the amount of changes this involves exceeds the rc tolerance indeed.

> Another idea:
> 
> You want all the requests be completed on drain_begin(), not
> cancelled. Actually, you don't need reconnect runnning during drained
> section for it. It should be enough just wait for all currenct
> requests before disabling the reconnect in drain_begin handler.

So effectively you suggest doing nbd's own drain within
bdrv_co_drain_begin callback.  I'm not totally sure there are no
assumptions this may break, but I'll try to look into this possibility.

Thanks,
Roman.

Re: [PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-26 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:37:13PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 19:08, Roman Kagan wrote:
> > On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 15.03.2021 09:06, Roman Kagan wrote:
> > > > As the reconnect logic no longer interferes with drained sections, it
> > > > appears unnecessary to explicitly manipulate the in_flight counter.
> > > > 
> > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > > 
> > > And here you actually allow qemu_aio_coroutine_enter() call in
> > > nbd_client_attach_aio_context_bh() to enter connection_co in any yield
> > > point which is possible during drained section. The analysis should be
> > > done to be sure that all these yield points are safe for reentering by
> > > external qemu_aio_coroutine_enter(). (By external I mean not by the
> > > actual enter() we are waiting for at the yield() point. For example
> > > qemu_channel_yield() supports reentering.. And therefore (as I
> > > understand after fast looking through) nbd_read() should support
> > > reentering too..
> > 
> > I'll do a more thorough analysis of how safe it is.
> > 
> > FWIW this hasn't triggered any test failures yet, but that assert in
> > patch 3 didn't ever go off either so I'm not sure I can trust the tests
> > on this.
> > 
> 
> Hmm. First, we should consider qemu_coroutine_yield() in
> nbd_co_establish_connection().
> 
> Most of nbd_co_establish_connection_cancel() purpose is to avoid
> reentering this yield()..

Unless I'm overlooking something, nbd_co_establish_connection() is fine
with spurious entering at this yield point.  What does look problematic,
though, is your next point:

> And I don't know, how to make it safely reenterable: keep in mind bh
> that may be already scheduled by connect_thread_func(). And if bh is
> already scheduled, can we cancel it? I'm not sure.
> 
> We have qemu_bh_delete(). But is it possible, that BH is near to be
> executed and already cannot be removed by qemu_bh_delete()? I don't
> know.
> 
> And if we can't safely drop the bh at any moment, we should wait in
> nbd_client_detach_aio_context until the scheduled bh enters the
> connection_co.. Or something like this

So I think it's not the reentry at this yield point per se which is
problematic, it's that that bh may have been scheduled before the
aio_context switch so once it runs it would wake up connection_co on the
old aio_context.  I think it may be possible to address by adding a
check into connect_bh().

Thanks,
Roman.

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-26 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:09:12PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 19:03, Roman Kagan wrote:
> > On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 15.03.2021 09:06, Roman Kagan wrote:
> > > > The reconnection logic doesn't need to stop while in a drained section.
> > > > Moreover it has to be active during the drained section, as the requests
> > > > that were caught in-flight with the connection to the server broken can
> > > > only usefully get drained if the connection is restored.  Otherwise such
> > > > requests can only either stall resulting in a deadlock (before
> > > > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > > > machinery (after 8c517de24a).
> > > > 
> > > > Since the pieces of the reconnection logic are now properly migrated
> > > > from one aio_context to another, it appears safe to just stop messing
> > > > with the drained section in the reconnection code.
> > > > 
> > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > > 
> > > I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
> > > didn't introduce any bugs.
> > 
> > I guess you're right.
> > 
> > In fact I did reproduce the situation when the drain made no progress
> > exactly becase the only in-flight reference was taken by the
> > connection_co, but it may be due to some intermediate stage of the patch
> > development so I need to do a more thorough analysis to tell if it was
> > triggerable with the original code.
> > 
> > > > Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
> > > > reconnect-delay")
> > > 
> > > And here..
> > > 
> > > 1. There is an existing problem (unrelated to nbd) in Qemu that long
> > > io request which we have to wait for at drained_begin may trigger a
> > > dead lock
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
> > > 
> > > 2. So, when we have nbd reconnect (and therefore long io requests) we
> > > simply trigger this deadlock.. That's why I decided to cancel the
> > > requests (assuming they will most probably fail anyway).
> > > 
> > > I agree that nbd driver is wrong place for fixing the problem
> > > described in
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
> > > but if you just revert 8c517de24a, you'll see the deadlock again..
> > 
> > I may have misunderstood that thread, but isn't that deadlock exactly
> > due to the requests being unable to ever drain because the
> > reconnection process is suspended while the drain is in progress?
> > 
> 
> 
> Hmm, I didn't thought about it this way.  What you mean is that
> reconnection is cancelled on drain_begin, so drain_begin will never
> finish, because it waits for requests, which will never be
> reconnected. So, you are right it's a deadlock too.

Right.

> But as I remember what is described in 8c517de24a is another deadlock,
> triggered by "just a long request during drain_begin". And it may be
> triggered again, if the we'll try to reconnect for several seconds
> during drained_begin() instead of cancelling requests.

IMO it wasn't a different deadlock, it wass exactly this one: the drain
got stuck the way described above with the qemu global mutex taken, so
the rest of qemu got stuck too.

> Didn't you try the scenario described in 8c517de24a on top of your
> series?

I've tried it with the current master with just 8c517de24a reverted --
the deadlock reproduced in all several attempts.  Then with my series --
the deadlock didn't reproduce in any of my several attempts.  More
precisely, qemu appeared frozen once the guest timed out the requests
and initiated ATA link soft reset, which presumably caused that drain,
but, as soon as the reconnect timer expired (resulting in requests
completed into the guest with errors) or the connection was restored
(resulting in requests completed successfully), it resumed normal
operation.

So yes, I think this series does address that deadlock.

Thanks,
Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Tue, Mar 16, 2021 at 09:41:36AM -0500, Eric Blake wrote:
> On 3/15/21 1:06 AM, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> Soft freeze is today.  I'm leaning towards declaring this series as a
> bug fix (and so give it some more soak time to get right, but still okay
> for -rc1) rather than a feature addition (and therefore would need to be
> in a pull request today).  Speak up now if this characterization is off
> base.

Yes I'd consider it a bug fix, too.  I'll do my best beating it into
shape before -rc2.

Thanks,
Roman.

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > Since the pieces of the reconnection logic are now properly migrated
> > from one aio_context to another, it appears safe to just stop messing
> > with the drained section in the reconnection code.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> 
> I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
> didn't introduce any bugs.

I guess you're right.

In fact I did reproduce the situation when the drain made no progress
exactly becase the only in-flight reference was taken by the
connection_co, but it may be due to some intermediate stage of the patch
development so I need to do a more thorough analysis to tell if it was
triggerable with the original code.

> > Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
> > reconnect-delay")
> 
> And here..
> 
> 1. There is an existing problem (unrelated to nbd) in Qemu that long
> io request which we have to wait for at drained_begin may trigger a
> dead lock
> (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
> 
> 2. So, when we have nbd reconnect (and therefore long io requests) we
> simply trigger this deadlock.. That's why I decided to cancel the
> requests (assuming they will most probably fail anyway).
> 
> I agree that nbd driver is wrong place for fixing the problem
> described in
> (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
> but if you just revert 8c517de24a, you'll see the deadlock again..

I may have misunderstood that thread, but isn't that deadlock exactly
due to the requests being unable to ever drain because the
reconnection process is suspended while the drain is in progress?

Thanks,
Roman.

Re: [PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 11:15:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > As the reconnect logic no longer interferes with drained sections, it
> > appears unnecessary to explicitly manipulate the in_flight counter.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> 
> And here you actually allow qemu_aio_coroutine_enter() call in
> nbd_client_attach_aio_context_bh() to enter connection_co in any yield
> point which is possible during drained section. The analysis should be
> done to be sure that all these yield points are safe for reentering by
> external qemu_aio_coroutine_enter(). (By external I mean not by the
> actual enter() we are waiting for at the yield() point. For example
> qemu_channel_yield() supports reentering.. And therefore (as I
> understand after fast looking through) nbd_read() should support
> reentering too..

I'll do a more thorough analysis of how safe it is.

FWIW this hasn't triggered any test failures yet, but that assert in
patch 3 didn't ever go off either so I'm not sure I can trust the tests
on this.

Thanks,
Roman.

Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 10:45:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> 
> The actual point is:
> 
> connection_co (together with all functions called from it) has a lot of yield 
> points. And we can't just enter the coroutine in any of the when we want, as 
> it may break some BH which is actually waited for in this yield point..
> 
> Still, we should care only about yield points possible during drained 
> section, so we don't need to care about direct qemu_coroutine_yield() inside 
> nbd_connection_entry().
> 
> Many things changed since 5ad81b4946.. So probably, now all the (possible 
> during drained section) yield points in nbd_connection_entry support 
> reentering. But some analysis of possible yield points should be done.

Thanks for the explanation.  Will do this analysis.

Roman.

Re: [PATCH 1/7] block/nbd: avoid touching freed connect_thread

2021-03-16 Thread Roman Kagan

On Mon, Mar 15, 2021 at 06:40:12PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > When the NBD connection is being torn down, the connection thread gets
> > canceled and "detached", meaning it is about to get freed.
> > 
> > If this happens while the connection coroutine yielded waiting for the
> > connection thread to complete, when it resumes it may access the
> > invalidated connection thread data.
> > 
> > To prevent this, revalidate the ->connect_thread pointer in
> > nbd_co_establish_connection_cancel before using after the the yield.
> > 
> > Signed-off-by: Roman Kagan 
> 
> Seems possible. Do you have a reproducer? Would be great to make an iotest.

It triggered on me in iotest 277, but seems to be timing-dependent.
I'll see if I can do it reliable.

Thanks,
Roman.

Re: [PATCH 3/7] block/nbd: assert attach/detach runs in the proper context

2021-03-15 Thread Roman Kagan

On Mon, Mar 15, 2021 at 07:41:32PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > Document (via a comment and an assert) that
> > nbd_client_detach_aio_context and nbd_client_attach_aio_context_bh run
> > in the desired aio_context
> > 
> > Signed-off-by: Roman Kagan 
> > ---
> >   block/nbd.c | 12 
> >   1 file changed, 12 insertions(+)
> > 
> > diff --git a/block/nbd.c b/block/nbd.c
> > index 1d8edb5b21..658b827d24 100644
> > --- a/block/nbd.c
> > +++ b/block/nbd.c
> > @@ -241,6 +241,12 @@ static void 
> > nbd_client_detach_aio_context(BlockDriverState *bs)
> >   {
> >   BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > +/*
> > + * This runs in the (old, about to be detached) aio context of the @bs 
> > so
> > + * accessing the stuff on @s is concurrency-free.
> > + */
> > +assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
> 
> Hmm. I don't think so. The handler is called from 
> bdrv_set_aio_context_ignore(), which have the assertion 
> g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());. There is 
> also a comment above bdrv_set_aio_context_ignore() "The caller must own the 
> AioContext lock for the old AioContext of bs".
> 
> So, we are not in the home context of bs here. We are in the main aio context 
> and we hold AioContext lock of old bs context.

You're absolutely right.  I'm wondering where I got the idea of this
assertion from...

> 
> > +
> >   /* Timer is deleted in nbd_client_co_drain_begin() */
> >   assert(!s->reconnect_delay_timer);
> >   /*
> > @@ -258,6 +264,12 @@ static void nbd_client_attach_aio_context_bh(void 
> > *opaque)
> >   BlockDriverState *bs = opaque;
> >   BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> > +/*
> > + * This runs in the (new, just attached) aio context of the @bs so
> > + * accessing the stuff on @s is concurrency-free.
> > + */
> > +assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
> 
> This is correct just because we are in a BH, scheduled for this
> context (I hope we can't reattach some third context prior to entering
> the BH in the second:). So, I don't think this assertion really adds
> something.

Indeed.

> > +
> >   if (s->connection_co) {
> >   /*
> >* The node is still drained, so we know the coroutine has 
> > yielded in
> > 
> 
> I'm not sure that the asserted fact gives us "concurrency-free". For
> this we also need to ensure that all other things in the driver are
> always called in same aio context.. Still, it's a general assumption
> we have when writing block drivers "everything in one aio context, so
> don't care".. Sometime it leads to bugs, as some things are still
> called even from non-coroutine context. Also, Paolo said
> (https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg03814.html)
> that many iothreads will send requests to bs, and the code in driver
> should be thread safe. I don't have good understanding of all these
> things, and what I have is:
> 
> For now (at least we don't have problems in Rhel based downstream) it
> maybe OK to think that in block-driver everything is protected by
> AioContext lock and all concurrency we have inside block driver is
> switching between coroutines but never real parallelism. But in
> general it's not so, and with multiqueue it's not so.. So, I'd not put
> such a comment :)

So the patch is bogus in every respect; let's just drop it.

I hope it doesn't invalidate completely the rest of the series though.

Meanwhile I certainly need to update my idea of concurrency assumptions
in the block layer.

Thanks,
Roman.

[PATCH 1/7] block/nbd: avoid touching freed connect_thread

2021-03-15 Thread Roman Kagan

When the NBD connection is being torn down, the connection thread gets
canceled and "detached", meaning it is about to get freed.

If this happens while the connection coroutine yielded waiting for the
connection thread to complete, when it resumes it may access the
invalidated connection thread data.

To prevent this, revalidate the ->connect_thread pointer in
nbd_co_establish_connection_cancel before using after the the yield.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..447d176b76 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -486,6 +486,15 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
 s->wait_connect = true;
 qemu_coroutine_yield();
 
+/*
+ * If nbd_co_establish_connection_cancel had a chance to run it may have
+ * invalidated ->connect_thread.
+ */
+thr = s->connect_thread;
+if (!thr) {
+return -ECONNABORTED;
+}
+
 qemu_mutex_lock(>mutex);
 
 switch (thr->state) {
-- 
2.30.2

[PATCH 4/7] block/nbd: transfer reconnection stuff across aio_context switch

2021-03-15 Thread Roman Kagan

Make varios pieces of reconnection logic correctly survive the
transition of the BDRVNBDState from one aio_context to another.  In
particular,

- cancel the reconnect_delay_timer and rearm it in the new context;
- cancel the sleep of the connection_co between reconnect attempt so
  that it continues in the new context;
- prevent the connection thread from delivering its status to the old
  context, and retartget it to the new context on attach.

None of these is needed at the moment because the aio_context switch
happens within a drained section and that effectively terminates the
reconnection logic on entry and starts it over on exit.  However, this
patch paves the way to keeping the reconnection process active across
the drained section (in a followup patch).

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 44 ++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 658b827d24..a6d713ba58 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -126,6 +126,7 @@ typedef struct BDRVNBDState {
 bool wait_in_flight;
 
 QEMUTimer *reconnect_delay_timer;
+int64_t reconnect_expire_time_ns;
 
 NBDClientRequest requests[MAX_NBD_REQUESTS];
 NBDReply reply;
@@ -240,6 +241,7 @@ static void reconnect_delay_timer_init(BDRVNBDState *s, 
uint64_t expire_time_ns)
 static void nbd_client_detach_aio_context(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+NBDConnectThread *thr = s->connect_thread;
 
 /*
  * This runs in the (old, about to be detached) aio context of the @bs so
@@ -247,8 +249,31 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
  */
 assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
 
-/* Timer is deleted in nbd_client_co_drain_begin() */
-assert(!s->reconnect_delay_timer);
+/*
+ * Make sure the connection thread doesn't try to deliver its status to the
+ * old context.
+ */
+qemu_mutex_lock(>mutex);
+thr->bh_ctx = NULL;
+qemu_mutex_unlock(>mutex);
+
+/*
+ * Preserve the expiration time of the reconnect_delay_timer in order to
+ * resume it on the new aio context.
+ */
+s->reconnect_expire_time_ns = s->reconnect_delay_timer ?
+timer_expire_time_ns(s->reconnect_delay_timer) : -1;
+reconnect_delay_timer_del(s);
+
+/*
+ * If the connection coroutine was sleeping between reconnect attempts,
+ * wake it up now and let it continue the process in the new aio context.
+ * This will distort the exponential back-off but that's probably ok.
+ */
+if (s->connection_co_sleep_ns_state) {
+qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
+}
+
 /*
  * If reconnect is in progress we may have no ->ioc.  It will be
  * re-instantiated in the proper aio context once the connection is
@@ -263,6 +288,7 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 {
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+NBDConnectThread *thr = s->connect_thread;
 
 /*
  * This runs in the (new, just attached) aio context of the @bs so
@@ -270,6 +296,20 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
  */
 assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
 
+if (nbd_client_connecting_wait(s) && s->reconnect_expire_time_ns >= 0) {
+reconnect_delay_timer_init(s, s->reconnect_expire_time_ns);
+}
+
+/*
+ * If the connection thread hasn't completed connecting yet, make sure it
+ * can deliver its status in the new context.
+ */
+qemu_mutex_lock(>mutex);
+if (thr->state == CONNECT_THREAD_RUNNING) {
+thr->bh_ctx = qemu_get_current_aio_context();
+}
+qemu_mutex_unlock(>mutex);
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
-- 
2.30.2

[PATCH 2/7] block/nbd: use uniformly nbd_client_connecting_wait

2021-03-15 Thread Roman Kagan

Use nbd_client_connecting_wait uniformly all over the block/nbd.c.

While at this, drop the redundant check for nbd_client_connecting_wait
in reconnect_delay_timer_init, as all its callsites do this check too.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 447d176b76..1d8edb5b21 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -165,6 +165,18 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 s->x_dirty_bitmap = NULL;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+NBDClientState state = qatomic_load_acquire(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
+static bool nbd_client_connecting_wait(BDRVNBDState *s)
+{
+return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -205,7 +217,7 @@ static void reconnect_delay_timer_cb(void *opaque)
 {
 BDRVNBDState *s = opaque;
 
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 while (qemu_co_enter_next(>free_sema, NULL)) {
 /* Resume all queued requests */
@@ -217,10 +229,6 @@ static void reconnect_delay_timer_cb(void *opaque)
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(>state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -346,18 +354,6 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 assert(!s->connection_co);
 }
 
-static bool nbd_client_connecting(BDRVNBDState *s)
-{
-NBDClientState state = qatomic_load_acquire(>state);
-return state == NBD_CLIENT_CONNECTING_WAIT ||
-state == NBD_CLIENT_CONNECTING_NOWAIT;
-}
-
-static bool nbd_client_connecting_wait(BDRVNBDState *s)
-{
-return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
-}
-
 static void connect_bh(void *opaque)
 {
 BDRVNBDState *state = opaque;
@@ -667,7 +663,7 @@ static coroutine_fn void nbd_co_reconnect_loop(BDRVNBDState 
*s)
 uint64_t timeout = 1 * NANOSECONDS_PER_SECOND;
 uint64_t max_timeout = 16 * NANOSECONDS_PER_SECOND;
 
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 reconnect_delay_timer_init(s, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
s->reconnect_delay * 
NANOSECONDS_PER_SECOND);
 }
@@ -2473,7 +2469,7 @@ static void nbd_cancel_in_flight(BlockDriverState *bs)
 
 reconnect_delay_timer_del(s);
 
-if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
+if (nbd_client_connecting_wait(s)) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 qemu_co_queue_restart_all(>free_sema);
 }
-- 
2.30.2

[PATCH 5/7] block/nbd: better document a case in nbd_co_establish_connection

2021-03-15 Thread Roman Kagan

Cosmetic: adjust the comment and the return value in
nbd_co_establish_connection where it's entered while the connection
thread is still running.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a6d713ba58..a3eb9b9079 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -562,11 +562,12 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
 case CONNECT_THREAD_RUNNING:
 case CONNECT_THREAD_RUNNING_DETACHED:
 /*
- * Obviously, drained section wants to start. Report the attempt as
- * failed. Still connect thread is executing in background, and its
+ * Spurious corotine resume before connection attempt has finished,
+ * presumably upon attaching a new aio_context. Report the attempt as
+ * failed.  Still connect thread is executing in background, and its
  * result may be used for next connection attempt.
  */
-ret = -1;
+ret = -ECONNABORTED;
 error_setg(errp, "Connection attempt cancelled by other operation");
 break;
 
-- 
2.30.2

[PATCH 7/7] block/nbd: stop manipulating in_flight counter

2021-03-15 Thread Roman Kagan

As the reconnect logic no longer interferes with drained sections, it
appears unnecessary to explicitly manipulate the in_flight counter.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan 
---
 block/nbd.c  | 6 --
 nbd/client.c | 2 --
 2 files changed, 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a5a9e4aca5..3a22f5d897 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -311,7 +311,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 if (s->connection_co) {
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
-bdrv_dec_in_flight(bs);
 }
 
 static void nbd_client_attach_aio_context(BlockDriverState *bs,
@@ -327,7 +326,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }
 
-bdrv_inc_in_flight(bs);
 
 /*
  * Need to wait here for the BH to run because the BH must run while the
@@ -643,11 +641,9 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
 goto out;
 }
 
-bdrv_dec_in_flight(s->bs);
 
 ret = nbd_client_handshake(s->bs, _err);
 
-bdrv_inc_in_flight(s->bs);
 
 out:
 s->connect_status = ret;
@@ -759,7 +755,6 @@ static coroutine_fn void nbd_connection_entry(void *opaque)
 
 qemu_co_queue_restart_all(>free_sema);
 nbd_recv_coroutines_wake_all(s);
-bdrv_dec_in_flight(s->bs);
 
 s->connection_co = NULL;
 if (s->ioc) {
@@ -2307,7 +2302,6 @@ static int nbd_open(BlockDriverState *bs, QDict *options, 
int flags,
 nbd_init_connect_thread(s);
 
 s->connection_co = qemu_coroutine_create(nbd_connection_entry, s);
-bdrv_inc_in_flight(bs);
 aio_co_schedule(bdrv_get_aio_context(bs), s->connection_co);
 
 return 0;
diff --git a/nbd/client.c b/nbd/client.c
index 0c2db4bcba..30d5383cb1 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1434,9 +1434,7 @@ nbd_read_eof(BlockDriverState *bs, QIOChannel *ioc, void 
*buffer, size_t size,
 
 len = qio_channel_readv(ioc, , 1, errp);
 if (len == QIO_CHANNEL_ERR_BLOCK) {
-bdrv_dec_in_flight(bs);
 qio_channel_yield(ioc, G_IO_IN);
-bdrv_inc_in_flight(bs);
 continue;
 } else if (len < 0) {
 return -EIO;
-- 
2.30.2

[PATCH 3/7] block/nbd: assert attach/detach runs in the proper context

2021-03-15 Thread Roman Kagan

Document (via a comment and an assert) that
nbd_client_detach_aio_context and nbd_client_attach_aio_context_bh run
in the desired aio_context.

Signed-off-by: Roman Kagan 
---
 block/nbd.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 1d8edb5b21..658b827d24 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -241,6 +241,12 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (old, about to be detached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
 /* Timer is deleted in nbd_client_co_drain_begin() */
 assert(!s->reconnect_delay_timer);
 /*
@@ -258,6 +264,12 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (new, just attached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
-- 
2.30.2

[PATCH 6/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Roman Kagan

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

Since the pieces of the reconnection logic are now properly migrated
from one aio_context to another, it appears safe to just stop messing
with the drained section in the reconnection code.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
reconnect-delay")
Signed-off-by: Roman Kagan 
---
 block/nbd.c | 79 +++--
 1 file changed, 4 insertions(+), 75 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a3eb9b9079..a5a9e4aca5 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -117,8 +117,6 @@ typedef struct BDRVNBDState {
 Coroutine *connection_co;
 Coroutine *teardown_co;
 QemuCoSleepState *connection_co_sleep_ns_state;
-bool drained;
-bool wait_drained_end;
 int in_flight;
 NBDClientState state;
 int connect_status;
@@ -311,12 +309,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 qemu_mutex_unlock(>mutex);
 
 if (s->connection_co) {
-/*
- * The node is still drained, so we know the coroutine has yielded in
- * nbd_read_eof(), the only place where bs->in_flight can reach 0, or
- * it is entered for the first time. Both places are safe for entering
- * the coroutine.
- */
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
 bdrv_dec_in_flight(bs);
@@ -344,37 +336,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, bs);
 }
 
-static void coroutine_fn nbd_client_co_drain_begin(BlockDriverState *bs)
-{
-BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-s->drained = true;
-if (s->connection_co_sleep_ns_state) {
-qemu_co_sleep_wake(s->connection_co_sleep_ns_state);
-}
-
-nbd_co_establish_connection_cancel(bs, false);
-
-reconnect_delay_timer_del(s);
-
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
-s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-qemu_co_queue_restart_all(>free_sema);
-}
-}
-
-static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
-{
-BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-
-s->drained = false;
-if (s->wait_drained_end) {
-s->wait_drained_end = false;
-aio_co_wake(s->connection_co);
-}
-}
-
-
 static void nbd_teardown_connection(BlockDriverState *bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
@@ -686,16 +647,6 @@ static coroutine_fn void 
nbd_reconnect_attempt(BDRVNBDState *s)
 
 ret = nbd_client_handshake(s->bs, _err);
 
-if (s->drained) {
-s->wait_drained_end = true;
-while (s->drained) {
-/*
- * We may be entered once from nbd_client_attach_aio_context_bh
- * and then from nbd_client_co_drain_end. So here is a loop.
- */
-qemu_coroutine_yield();
-}
-}
 bdrv_inc_in_flight(s->bs);
 
 out:
@@ -724,26 +675,10 @@ static coroutine_fn void 
nbd_co_reconnect_loop(BDRVNBDState *s)
 nbd_reconnect_attempt(s);
 
 while (nbd_client_connecting(s)) {
-if (s->drained) {
-bdrv_dec_in_flight(s->bs);
-s->wait_drained_end = true;
-while (s->drained) {
-/*
- * We may be entered once from nbd_client_attach_aio_context_bh
- * and then from nbd_client_co_drain_end. So here is a loop.
- */
-qemu_coroutine_yield();
-}
-bdrv_inc_in_flight(s->bs);
-} else {
-qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
-  >connection_co_sleep_ns_state);
-if (s->drained) {
-continue;
-}
-if (timeout < max_timeout) {
-timeout *= 2;
-}
+qemu_co_sleep_ns_wakeable(QEMU_CLOCK_REALTIME, timeout,
+  >connection_co_sleep_ns_state);
+if (timeout < max_timeout) {
+timeout *= 2;
 }
 
 nbd_reconnect_attempt(s);
@@ -2548,8 +2483,6 @@ static BlockDriver bdrv_nbd = {
 .bdrv_getlength = nbd_getlength,
 .bdrv_detach_aio_context= nbd_client_detach_aio_context,
 .bdrv_attach_aio_context

[PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Roman Kagan

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.

Roman Kagan (7):
  block/nbd: avoid touching freed connect_thread
  block/nbd: use uniformly nbd_client_connecting_wait
  block/nbd: assert attach/detach runs in the proper context
  block/nbd: transfer reconnection stuff across aio_context switch
  block/nbd: better document a case in nbd_co_establish_connection
  block/nbd: decouple reconnect from drain
  block/nbd: stop manipulating in_flight counter

 block/nbd.c  | 191 +++
 nbd/client.c |   2 -
 2 files changed, 86 insertions(+), 107 deletions(-)

-- 
2.30.2

Re: [RFC] nbd: decouple reconnect from drain

2021-03-14 Thread Roman Kagan

On Fri, Mar 12, 2021 at 03:35:25PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 10.03.2021 12:32, Roman Kagan wrote:
> > NBD connect coroutine takes an extra in_flight reference as if it's a
> > request handler.  This prevents drain from completion until the
> > connection coroutine is releases the reference.
> > 
> > When NBD is configured to reconnect, however, this appears to be fatal
> > to the reconnection idea: the drain procedure wants the reconnection to
> > be suspended, but this is only possible if the in-flight requests are
> > canceled.
> 
> As I remember from our conversation, the problem is not that we don't
> reconnect during drained section, but exactly that we cancel requests
> on drained begins starting from 8c517de24a8a1dcbeb.

Well, I'd put it the other way around: the problem is that we don't
reconnect during the drained section, so the requests can't be
successfully completed if the connection breaks: they can either stall
forever (before 8c517de24a8a1dcbeb) or be aborted (after
8c517de24a8a1dcbeb).

> This is not a problem in scenarios when reconnect is rare case and
> failed request is acceptable. But if we have bad connection and
> requests should often wait for reconnect (so, it may be considered as
> a kind of "latency") then really, cancelling the waiting requests on
> any drain() kills the reconnect feature.

In our experience the primary purpose of reconnect is not to withstand
poor network links, but about being able to restart the NBD server
without disrupting the guest operation.  So failing the requests during
the server maintenance window is never acceptable, no matter how rare it
is (and in our practice it isn't).

> > Fix this by making the connection coroutine stop messing with the
> > in-flight counter.  Instead, certain care is taken to properly move the
> > reconnection stuff from one aio_context to another in
> > .bdrv_{attach,detach}_aio_context callbacks.
> > 
> > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > Signed-off-by: Roman Kagan 
> > ---
> > This patch passes the regular make check but fails some extra iotests,
> > in particular 277.  It obviously lacks more robust interaction with the
> > connection thread (which in general is fairly complex and hard to reason
> > about), and perhaps has some other drawbacks, so I'll work on this
> > further, but I'd appreciate some feedback on whether the idea is sound.
> > 
> 
> In general I like the idea. The logic around drain in nbd is
> overcomplicated. And I never liked the fact that nbd_read_eof() does
> dec-inc inflight section. Some notes:
> 
> 1. I hope, the patch can be divided into several ones, as there are
> several things done:
> 
> - removing use of in_flight counter introduced by 5ad81b4946
> - do reconnect during drained section
> - stop cancelling requests on .drained_begin

I've split the (somewhat extended) patch into a series of 7, but not
exactly as you suggested.  In particular, I can't just stop aborting the
requests on .drained_begin as this would reintroduce the deadlock, so I
just remove the whole .drained_begin/end in a single patch.

> 2. 5ad81b4946 was needed to make nbd_client_attach_aio_context()
> reenter connection_co only in one (or two) possible places, not on any
> yield.. And I don't see how it is achieved now.. This should be
> described in commit msg..

My problem is that I failed to figure out why it was necessary in the
first place, so I think I don't achieve this with the series.

> 3. About cancelling requests on drained_begin. The behavior was
> introduced by 8c517de24a8a1dcbeb, to fix a deadlock. So, if now the
> deadlock is fixed another way, let's change the logic (don't cancel
> requests) in a separate patch, and note 8c517de24a8a1dcbeb commit and
> the commit that fixes deadlock the other way in the commit message.

As I mentioned earlier I did a patch that removed the root cause; a
separate patch removing just the request cancelation didn't look
justified.  I did mention the commits in the log.

Thanks for the review!  I'm now on to submitting a non-RFC version.
Roman.

[RFC] nbd: decouple reconnect from drain

2021-03-10 Thread Roman Kagan

NBD connect coroutine takes an extra in_flight reference as if it's a
request handler.  This prevents drain from completion until the
connection coroutine is releases the reference.

When NBD is configured to reconnect, however, this appears to be fatal
to the reconnection idea: the drain procedure wants the reconnection to
be suspended, but this is only possible if the in-flight requests are
canceled.

Fix this by making the connection coroutine stop messing with the
in-flight counter.  Instead, certain care is taken to properly move the
reconnection stuff from one aio_context to another in
.bdrv_{attach,detach}_aio_context callbacks.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
Signed-off-by: Roman Kagan 
---
This patch passes the regular make check but fails some extra iotests,
in particular 277.  It obviously lacks more robust interaction with the
connection thread (which in general is fairly complex and hard to reason
about), and perhaps has some other drawbacks, so I'll work on this
further, but I'd appreciate some feedback on whether the idea is sound.

 block/nbd.c  | 134 ---
 nbd/client.c |   2 -
 2 files changed, 41 insertions(+), 95 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..5319e543ab 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -117,8 +117,6 @@ typedef struct BDRVNBDState {
 Coroutine *connection_co;
 Coroutine *teardown_co;
 QemuCoSleepState *connection_co_sleep_ns_state;
-bool drained;
-bool wait_drained_end;
 int in_flight;
 NBDClientState state;
 int connect_status;
@@ -126,6 +124,7 @@ typedef struct BDRVNBDState {
 bool wait_in_flight;
 
 QEMUTimer *reconnect_delay_timer;
+int64_t reconnect_expire_time_ns;
 
 NBDClientRequest requests[MAX_NBD_REQUESTS];
 NBDReply reply;
@@ -165,6 +164,18 @@ static void nbd_clear_bdrvstate(BDRVNBDState *s)
 s->x_dirty_bitmap = NULL;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+NBDClientState state = qatomic_load_acquire(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
+static bool nbd_client_connecting_wait(BDRVNBDState *s)
+{
+return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+}
+
 static void nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (ret == -EIO) {
@@ -217,10 +228,6 @@ static void reconnect_delay_timer_cb(void *opaque)
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(>state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -233,8 +240,20 @@ static void nbd_client_detach_aio_context(BlockDriverState 
*bs)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-/* Timer is deleted in nbd_client_co_drain_begin() */
-assert(!s->reconnect_delay_timer);
+/*
+ * This runs in the (old, about to be detached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+/*
+ * Preserve the expiration time of the reconnect_delay_timer in order to
+ * resume it on the new aio context.
+ */
+s->reconnect_expire_time_ns = s->reconnect_delay_timer ?
+timer_expire_time_ns(s->reconnect_delay_timer) : -1;
+reconnect_delay_timer_del(s);
+
 /*
  * If reconnect is in progress we may have no ->ioc.  It will be
  * re-instantiated in the proper aio context once the connection is
@@ -250,6 +269,16 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
 BlockDriverState *bs = opaque;
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
+/*
+ * This runs in the (new, just attached) aio context of the @bs so
+ * accessing the stuff on @s is concurrency-free.
+ */
+assert(qemu_get_current_aio_context() == bdrv_get_aio_context(bs));
+
+if (nbd_client_connecting_wait(s) && s->reconnect_expire_time_ns >= 0) {
+reconnect_delay_timer_init(s, s->reconnect_expire_time_ns);
+}
+
 if (s->connection_co) {
 /*
  * The node is still drained, so we know the coroutine has yielded in
@@ -259,7 +288,6 @@ static void nbd_client_attach_aio_context_bh(void *opaque)
  */
 qemu_aio_coroutine_enter(bs->aio_context, s->connection_co);
 }
-bdrv_dec_in_flight(bs);
 }
 
 static void nbd_client_attach_aio_context(BlockDriverState *bs,
@@ -275,7 +303,6 @@ static void nbd_client_attach_aio_context(BlockDriverState 
*bs,
 qio_channel_attach_aio_context(QIO_CHANNEL(s->ioc), new_context);
 }
 
-bdrv_inc_in_flight(bs);
 
 /*
  * Need t

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Wed, Feb 03, 2021 at 05:44:34PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.02.2021 17:21, Roman Kagan wrote:
> > On Wed, Feb 03, 2021 at 04:10:41PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 03.02.2021 13:53, Roman Kagan wrote:
> > > > On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy 
> > > > wrote:
> > > > > We pause reconnect process during drained section. So, if we have some
> > > > > requests, waiting for reconnect we should cancel them, otherwise they
> > > > > deadlock the drained section.
> > > > > 
> > > > > How to reproduce:
> > > > > 
> > > > > 1. Create an image:
> > > > >  qemu-img create -f qcow2 xx 100M
> > > > > 
> > > > > 2. Start NBD server:
> > > > >  qemu-nbd xx
> > > > > 
> > > > > 3. Start vm with second nbd disk on node2, like this:
> > > > > 
> > > > > ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
> > > > >file=/work/images/cent7.qcow2 -drive \
> > > > >
> > > > > driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
> > > > >  \
> > > > >-vnc :0 -m 2G -enable-kvm -vga std
> > > > > 
> > > > > 4. Access the vm through vnc (or some other way?), and check that NBD
> > > > >  drive works:
> > > > > 
> > > > >  dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > >  - the command should succeed.
> > > > > 
> > > > > 5. Now, kill the nbd server, and run dd in the guest again:
> > > > > 
> > > > >  dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > > > 
> > > > > Now Qemu is trying to reconnect, and dd-generated requests are waiting
> > > > > for the connection (they will wait up to 60 seconds (see
> > > > > reconnect-delay option above) and than fail). But suddenly, vm may
> > > > > totally hang in the deadlock. You may need to increase reconnect-delay
> > > > > period to catch the dead-lock.
> > > > > 
> > > > > VM doesn't respond because drain dead-lock happens in cpu thread with
> > > > > global mutex taken. That's not good thing by itself and is not fixed
> > > > > by this commit (true way is using iothreads). Still this commit fixes
> > > > > drain dead-lock itself.
> > > > > 
> > > > > Note: probably, we can instead continue to reconnect during drained
> > > > > section. To achieve this, we may move negotiation to the connect 
> > > > > thread
> > > > > to make it independent of bs aio context. But expanding drained 
> > > > > section
> > > > > doesn't seem good anyway. So, let's now fix the bug the simplest way.
> > > > > 
> > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > > > ---
> > > > >block/nbd.c | 5 +
> > > > >1 file changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/block/nbd.c b/block/nbd.c
> > > > > index 9daf003bea..912ea27be7 100644
> > > > > --- a/block/nbd.c
> > > > > +++ b/block/nbd.c
> > > > > @@ -242,6 +242,11 @@ static void coroutine_fn 
> > > > > nbd_client_co_drain_begin(BlockDriverState *bs)
> > > > >}
> > > > >nbd_co_establish_connection_cancel(bs, false);
> > > > > +
> > > > > +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> > > > > +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> > > > > +qemu_co_queue_restart_all(>free_sema);
> > > > > +}
> > > > >}
> > > > >static void coroutine_fn nbd_client_co_drain_end(BlockDriverState 
> > > > > *bs)
> > > > 
> > > > This basically defeats the whole purpose of reconnect: if the nbd client
> > > > is trying to reconnect, drain effectively cancels that and makes all
> > > > in-flight requests to complete with an error.
> > > > 
> > > > I'm not suggesting to revert this patch (it's now in the tree as commit
> > > > 8c517de24a), because the deadlock is no better, but I'm afraid the only
> > > > real

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Wed, Feb 03, 2021 at 04:10:41PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 03.02.2021 13:53, Roman Kagan wrote:
> > On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > We pause reconnect process during drained section. So, if we have some
> > > requests, waiting for reconnect we should cancel them, otherwise they
> > > deadlock the drained section.
> > > 
> > > How to reproduce:
> > > 
> > > 1. Create an image:
> > > qemu-img create -f qcow2 xx 100M
> > > 
> > > 2. Start NBD server:
> > > qemu-nbd xx
> > > 
> > > 3. Start vm with second nbd disk on node2, like this:
> > > 
> > >./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
> > >   file=/work/images/cent7.qcow2 -drive \
> > >   
> > > driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
> > >  \
> > >   -vnc :0 -m 2G -enable-kvm -vga std
> > > 
> > > 4. Access the vm through vnc (or some other way?), and check that NBD
> > > drive works:
> > > 
> > > dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > 
> > > - the command should succeed.
> > > 
> > > 5. Now, kill the nbd server, and run dd in the guest again:
> > > 
> > > dd if=/dev/sdb of=/dev/null bs=1M count=10
> > > 
> > > Now Qemu is trying to reconnect, and dd-generated requests are waiting
> > > for the connection (they will wait up to 60 seconds (see
> > > reconnect-delay option above) and than fail). But suddenly, vm may
> > > totally hang in the deadlock. You may need to increase reconnect-delay
> > > period to catch the dead-lock.
> > > 
> > > VM doesn't respond because drain dead-lock happens in cpu thread with
> > > global mutex taken. That's not good thing by itself and is not fixed
> > > by this commit (true way is using iothreads). Still this commit fixes
> > > drain dead-lock itself.
> > > 
> > > Note: probably, we can instead continue to reconnect during drained
> > > section. To achieve this, we may move negotiation to the connect thread
> > > to make it independent of bs aio context. But expanding drained section
> > > doesn't seem good anyway. So, let's now fix the bug the simplest way.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   block/nbd.c | 5 +
> > >   1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/block/nbd.c b/block/nbd.c
> > > index 9daf003bea..912ea27be7 100644
> > > --- a/block/nbd.c
> > > +++ b/block/nbd.c
> > > @@ -242,6 +242,11 @@ static void coroutine_fn 
> > > nbd_client_co_drain_begin(BlockDriverState *bs)
> > >   }
> > >   nbd_co_establish_connection_cancel(bs, false);
> > > +
> > > +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> > > +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> > > +qemu_co_queue_restart_all(>free_sema);
> > > +}
> > >   }
> > >   static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)
> > 
> > This basically defeats the whole purpose of reconnect: if the nbd client
> > is trying to reconnect, drain effectively cancels that and makes all
> > in-flight requests to complete with an error.
> > 
> > I'm not suggesting to revert this patch (it's now in the tree as commit
> > 8c517de24a), because the deadlock is no better, but I'm afraid the only
> > real fix is to implement reconnect during the drain section.  I'm still
> > trying to get my head around it so no patch yet, but I just wanted to
> > bring this up in case anybody beats me to it.
> > 
> 
> 
> What do you mean by "reconnect during drained section"? Trying to
> establish the connection? Or keeping in-flight requests instead of
> cancelling them? We can't keep in-flight requests during drained
> section, as it's the whole sense of drained-section: no in-flight
> requests. So we'll have to wait for them at drain_begin (waiting up to
> reconnect-delay, which doesn't seem good and triggers dead-lock for
> non-iothread environment) or just cancel them..
> 
> Do you argue that waiting on drained-begin is somehow better than
> cancelling?

Sorry I should have used more accurate wording to be clear.

Yes, my point is that canceling the requests on entry to a drained
section is incorrect.  And no, it doesn't matter if it can be long:
that's the price you pay for doing the drain.  Think of reconnect as a
special case of a slow connection: if an nbd reply from the server is
delayed for whatever reason without dropping the connection, you have to
wait here, too.  (In fact, reconnect is even slightly better here since
it has a configurable finite timeout while the message delay does not
AFAIK.)

Does it make sense?

> Or, the problem is that after drained section (assume it was short) we
> are in NBD_CLIENT_CONNECTING_NOWAIT ?  Fixing it should be simple
> enough, we just need to check the time at drained_end and if the delay
> is not expired enable NBD_CLIENT_CONNECTING_WAIT agaub,,

Hmm, I didn't get thus far but this is probably an issue too.

Thanks,
Roman.

Re: [PATCH 1/4] block/nbd: fix drain dead-lock because of nbd reconnect-delay

2021-02-03 Thread Roman Kagan

On Thu, Sep 03, 2020 at 10:02:58PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> We pause reconnect process during drained section. So, if we have some
> requests, waiting for reconnect we should cancel them, otherwise they
> deadlock the drained section.
> 
> How to reproduce:
> 
> 1. Create an image:
>qemu-img create -f qcow2 xx 100M
> 
> 2. Start NBD server:
>qemu-nbd xx
> 
> 3. Start vm with second nbd disk on node2, like this:
> 
>   ./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
>  file=/work/images/cent7.qcow2 -drive \
>  
> driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
>  \
>  -vnc :0 -m 2G -enable-kvm -vga std
> 
> 4. Access the vm through vnc (or some other way?), and check that NBD
>drive works:
> 
>dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
>- the command should succeed.
> 
> 5. Now, kill the nbd server, and run dd in the guest again:
> 
>dd if=/dev/sdb of=/dev/null bs=1M count=10
> 
> Now Qemu is trying to reconnect, and dd-generated requests are waiting
> for the connection (they will wait up to 60 seconds (see
> reconnect-delay option above) and than fail). But suddenly, vm may
> totally hang in the deadlock. You may need to increase reconnect-delay
> period to catch the dead-lock.
> 
> VM doesn't respond because drain dead-lock happens in cpu thread with
> global mutex taken. That's not good thing by itself and is not fixed
> by this commit (true way is using iothreads). Still this commit fixes
> drain dead-lock itself.
> 
> Note: probably, we can instead continue to reconnect during drained
> section. To achieve this, we may move negotiation to the connect thread
> to make it independent of bs aio context. But expanding drained section
> doesn't seem good anyway. So, let's now fix the bug the simplest way.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index 9daf003bea..912ea27be7 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -242,6 +242,11 @@ static void coroutine_fn 
> nbd_client_co_drain_begin(BlockDriverState *bs)
>  }
>  
>  nbd_co_establish_connection_cancel(bs, false);
> +
> +if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
> +s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> +qemu_co_queue_restart_all(>free_sema);
> +}
>  }
>  
>  static void coroutine_fn nbd_client_co_drain_end(BlockDriverState *bs)

This basically defeats the whole purpose of reconnect: if the nbd client
is trying to reconnect, drain effectively cancels that and makes all
in-flight requests to complete with an error.

I'm not suggesting to revert this patch (it's now in the tree as commit
8c517de24a), because the deadlock is no better, but I'm afraid the only
real fix is to implement reconnect during the drain section.  I'm still
trying to get my head around it so no patch yet, but I just wanted to
bring this up in case anybody beats me to it.

Thanks,
Roman.

1 2 3 4 5 6 7 >

1 - 100 of 673 matches

Mail list logo