from:"Leonardo Bras Soares Passos"

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-10 Thread Leonardo Bras Soares Passos

On Mon, Jul 10, 2023 at 3:16 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jul 10, 2023 at 02:49:05PM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Jul 6, 2023 at 5:00 PM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Jul 06, 2023 at 03:02:07PM -0400, Peter Xu wrote:
> > > > On Thu, Jul 06, 2023 at 02:50:20PM -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jul 06, 2023 at 02:14:37PM -0400, Peter Xu wrote:
> > > > > > On Thu, Jul 06, 2023 at 03:07:40PM -0300, Leonardo Bras Soares 
> > > > > > Passos wrote:
> > > > > > > > I asked the same question, and I still keep confused: whether 
> > > > > > > > there's a
> > > > > > > > first bad commit?  Starting from when it fails?
> > > > > > > >
> > > > > > > > For example, is this broken on 6.0 binaries too with pc-q35-6.0?
> > > > > > >
> > > > > > > I tested for qemu 6.0, and it still reproduces, but have not 
> > > > > > > pursued
> > > > > > > this any further.
> > > > > >
> > > > > > I see, thanks!
> > > > > >
> > > > > > But then do you know why it's never hit before?  I assume it means 
> > > > > > this bug
> > > > > > has been there for a long time.
> > > > >
> > > > > It's a race - you have to migrate after the bit has been set before
> > > > > the bit got cleared.
> > > > > cmask is exactly for bits that qemu modifies itself.
> > > >
> > > > Michael, do you mean that Leo's patch is wrong?
> > >
> > >
> > > I mean his patch is exactly right. cmask was designed with this
> > > kind of use case in mind.
> > > Will queue.
> >
> > Thanks Michael!
> >
> > Any chance this will get in on time for v8.1 ?
>
> Yes, working on pull request now.

Thanks!

>
>
> > >
> > > > I just got understood why it got cleared - I think Leo didn't mention 
> > > > that
> > > > the device was actually offlined before migration, IIUC that's why the 
> > > > PDS
> > > > bit got cleared, if PDS was trying to describe that of the slot.
> > > >
> > > > According to:
> > > >
> > > > /* Used to enable checks on load. Note that writable bits are
> > > >  * never checked even if set in cmask. */
> > > > uint8_t *cmask;
> > > >
> > > > It does sound reasonable to me to have PDS cleared when device offlined.
> > > > Since hypervisor doesn't really know what the condition the slot 
> > > > presence
> > > > bit would be when migrating, it seems we should just clear the bit in
> > > > cmask.
> > > >
> > > > So with the last reply from Leo, the patch looks all right to me.  It's
> > > > just that as Leo mentioned, we should mention the offline process if 
> > > > that's
> > > > the case, because that's definitely an important step to reproduce the 
> > > > issue.
> > > >
> > > > Thanks,
> > >
> > > If you want to suggest more text to the commit log, for the benefit
> > > of backporters, that is fine by me.
> > >
> > > > --
> > > > Peter Xu
> > >
>

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-10 Thread Leonardo Bras Soares Passos

On Thu, Jul 6, 2023 at 5:00 PM Michael S. Tsirkin  wrote:
>
> On Thu, Jul 06, 2023 at 03:02:07PM -0400, Peter Xu wrote:
> > On Thu, Jul 06, 2023 at 02:50:20PM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 06, 2023 at 02:14:37PM -0400, Peter Xu wrote:
> > > > On Thu, Jul 06, 2023 at 03:07:40PM -0300, Leonardo Bras Soares Passos 
> > > > wrote:
> > > > > > I asked the same question, and I still keep confused: whether 
> > > > > > there's a
> > > > > > first bad commit?  Starting from when it fails?
> > > > > >
> > > > > > For example, is this broken on 6.0 binaries too with pc-q35-6.0?
> > > > >
> > > > > I tested for qemu 6.0, and it still reproduces, but have not pursued
> > > > > this any further.
> > > >
> > > > I see, thanks!
> > > >
> > > > But then do you know why it's never hit before?  I assume it means this 
> > > > bug
> > > > has been there for a long time.
> > >
> > > It's a race - you have to migrate after the bit has been set before
> > > the bit got cleared.
> > > cmask is exactly for bits that qemu modifies itself.
> >
> > Michael, do you mean that Leo's patch is wrong?
>
>
> I mean his patch is exactly right. cmask was designed with this
> kind of use case in mind.
> Will queue.

Thanks Michael!

Any chance this will get in on time for v8.1 ?

>
> > I just got understood why it got cleared - I think Leo didn't mention that
> > the device was actually offlined before migration, IIUC that's why the PDS
> > bit got cleared, if PDS was trying to describe that of the slot.
> >
> > According to:
> >
> > /* Used to enable checks on load. Note that writable bits are
> >  * never checked even if set in cmask. */
> > uint8_t *cmask;
> >
> > It does sound reasonable to me to have PDS cleared when device offlined.
> > Since hypervisor doesn't really know what the condition the slot presence
> > bit would be when migrating, it seems we should just clear the bit in
> > cmask.
> >
> > So with the last reply from Leo, the patch looks all right to me.  It's
> > just that as Leo mentioned, we should mention the offline process if that's
> > the case, because that's definitely an important step to reproduce the 
> > issue.
> >
> > Thanks,
>
> If you want to suggest more text to the commit log, for the benefit
> of backporters, that is fine by me.
>
> > --
> > Peter Xu
>

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-06 Thread Leonardo Bras Soares Passos

On Thu, Jul 6, 2023 at 3:24 PM Peter Xu  wrote:
>
> On Thu, Jul 06, 2023 at 03:07:40PM -0300, Leonardo Bras Soares Passos wrote:
> > > I asked the same question, and I still keep confused: whether there's a
> > > first bad commit?  Starting from when it fails?
> > >
> > > For example, is this broken on 6.0 binaries too with pc-q35-6.0?
> >
> > I tested for qemu 6.0, and it still reproduces, but have not pursued
> > this any further.
>
> I see, thanks!
>
> But then do you know why it's never hit before?  I assume it means this bug
> has been there for a long time.

Even longer than expected:

I did some testing looking for the bug:

qemu v5.0.1 reproduces
qemu v4.0.1 reproduces
qemu v3.0.1 reproduces
qemu v2.12.1 reproduces

I decided to stop testing at this point, because it required python2
for building qemu, and it is far enough (5 years).

Seems to be a very old bug, that just hasn't bothered anyone until now.

>
> --
> Peter Xu
>

Thanks!
Leo

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-06 Thread Leonardo Bras Soares Passos

On Thu, Jul 6, 2023 at 3:24 PM Peter Xu  wrote:
>
> On Thu, Jul 06, 2023 at 03:07:40PM -0300, Leonardo Bras Soares Passos wrote:
> > > I asked the same question, and I still keep confused: whether there's a
> > > first bad commit?  Starting from when it fails?
> > >
> > > For example, is this broken on 6.0 binaries too with pc-q35-6.0?
> >
> > I tested for qemu 6.0, and it still reproduces, but have not pursued
> > this any further.
>
> I see, thanks!
>
> But then do you know why it's never hit before?  I assume it means this bug
> has been there for a long time.


Oh, I totally missed updating the commit msg  on this:

---
In this scenario, hotplug_handler_plug() calls pcie_cap_slot_plug_cb(),
which sets dev->config byte 0x6e with bit PCI_EXP_SLTSTA_PDS to signal PCI
- hotplug for the guest. After a while the guest will deal with this hotplug
- and qemu will clear the above bit.
+ hotplug for the guest. After a while, if the guest powers down the device
+ qemu will clear the above bit.
---

The whole idea is that the guest powers down the device, which causes
qemu to hot-remove it, and clear PCI_EXP_SLTSTA_PDS.

---
 /*
 * If the slot is populated, power indicator is off and power
 * controller is off, it is safe to detach the devices.
 *
 * Note: don't detach if condition was already true:
 * this is a work around for guests that overwrite
 * control of powered off slots before powering them on.
 */
if ((sltsta & PCI_EXP_SLTSTA_PDS) && pcie_sltctl_powered_off(val) &&
!pcie_sltctl_powered_off(old_slt_ctl))
{
pcie_cap_slot_do_unplug(dev); // clear PCI_EXP_SLTSTA_PDS
}
---

Since the bit is different on source & target qemu, the migration is aborted.


>
> --
> Peter Xu
>

Thanks for reviewing Peter!
I will send a v3 with the updated commit msg and add the comments
suggested by Juan in the source code.

Thanks!
Leo

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-06 Thread Leonardo Bras Soares Passos

On Thu, Jul 6, 2023 at 11:35 AM Peter Xu  wrote:
>
> On Thu, Jul 06, 2023 at 01:55:47AM -0300, Leonardo Bras wrote:
> > When trying to migrate a machine type pc-q35-6.0 or lower, with this
> > cmdline options,
> >
> > -device 
> > driver=pcie-root-port,port=18,chassis=19,id=pcie-root-port18,bus=pcie.0,addr=0x12
> >  \
> > -device 
> > driver=nec-usb-xhci,p2=4,p3=4,id=nex-usb-xhci0,bus=pcie-root-port18,addr=0x12.0x1
> >
> > the following bug happens after all ram pages were sent:
> >
> > qemu-kvm: get_pci_config_device: Bad config data: i=0x6e read: 0 device: 40 
> > cmask: ff wmask: 0 w1cmask:19
> > qemu-kvm: Failed to load PCIDevice:config
> > qemu-kvm: Failed to load pcie-root-port:parent_obj.parent_obj.parent_obj
> > qemu-kvm: error while loading state for instance 0x0 of device 
> > ':00:12.0/pcie-root-port'
> > qemu-kvm: load of migration failed: Invalid argument
> >
> > This happens on pc-q35-6.0 or lower because of:
> > { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
> >
> > In this scenario, hotplug_handler_plug() calls pcie_cap_slot_plug_cb(),
> > which sets dev->config byte 0x6e with bit PCI_EXP_SLTSTA_PDS to signal PCI
> > hotplug for the guest. After a while the guest will deal with this hotplug
> > and qemu will clear the above bit.
>
> Do you mean that the bit will be cleared after this point for the whole
> lifecycle of the VM, as long as the pcie topology doesn't change again?
>
> "This bit indicates the presence of an adapter in the slot"
>
> IIUC the adapter in the slot is there, why it's cleared rather than set?

Fort some reason the guest is powering down the device, and we have in qemu:

 /*
 * If the slot is populated, power indicator is off and power
 * controller is off, it is safe to detach the devices.
 *
 * Note: don't detach if condition was already true:
 * this is a work around for guests that overwrite
 * control of powered off slots before powering them on.
 */
if ((sltsta & PCI_EXP_SLTSTA_PDS) && pcie_sltctl_powered_off(val) &&
!pcie_sltctl_powered_off(old_slt_ctl))
{
pcie_cap_slot_do_unplug(dev);  // clears PCI_EXP_SLTSTA_PDS
}


>
> >
> > Then, during migration, get_pci_config_device() will compare the
> > configs of both the freshly created device and the one that is being
> > received via migration, which will differ due to the PCI_EXP_SLTSTA_PDS bit
> > and cause the bug to reproduce.
> >
> > To avoid this fake incompatibility, there are tree fields in PCIDevice that
> > can help:
> >
> > - wmask: Used to implement R/W bytes, and
> > - w1cmask: Used to implement RW1C(Write 1 to Clear) bytes
> > - cmask: Used to enable config checks on load.
> >
> > According to PCI Express® Base Specification Revision 5.0 Version 1.0,
> > table 7-27 (Slot Status Register) bit 6, the "Presence Detect State" is
> > listed as RO (read-only), so it only makes sense to make use of the cmask
> > field.
> >
> > So, clear PCI_EXP_SLTSTA_PDS bit on cmask, so the fake incompatibility on
> > get_pci_config_device() does not abort the migration.
>
> Yes, using cmask makes more sense to me, but we'd need some pci developer
> to ack it at last I guess, anyway.

Agree! I am waiting for Michael's opinion on this.

>
> >
> > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2215819
> > Signed-off-by: Leonardo Bras 
>
> I asked the same question, and I still keep confused: whether there's a
> first bad commit?  Starting from when it fails?
>
> For example, is this broken on 6.0 binaries too with pc-q35-6.0?

I tested for qemu 6.0, and it still reproduces, but have not pursued
this any further.

>
> Thanks,


Thank you!
Leo

>
> > ---
> >  hw/pci/pcie.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index b8c24cf45f..cae56bf1c8 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -659,6 +659,10 @@ void pcie_cap_slot_init(PCIDevice *dev, PCIESlot *s)
> >  pci_word_test_and_set_mask(dev->w1cmask + pos + PCI_EXP_SLTSTA,
> > PCI_EXP_HP_EV_SUPPORTED);
> >
> > +/* Avoid migration abortion when this device hot-removed by guest */
> > +pci_word_test_and_clear_mask(dev->cmask + pos + PCI_EXP_SLTSTA,
> > + PCI_EXP_SLTSTA_PDS);
> > +
> >  dev->exp.hpev_notified = false;
> >
> >  qbus_set_hotplug_handler(BUS(pci_bridge_get_sec_bus(PCI_BRIDGE(dev))),
> > --
> > 2.41.0
> >
>
> --
> Peter Xu
>

Re: [PATCH v2 1/1] pcie: Add hotplug detect state register to cmask

2023-07-06 Thread Leonardo Bras Soares Passos

On Thu, Jul 6, 2023 at 4:37 AM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > When trying to migrate a machine type pc-q35-6.0 or lower, with this
> > cmdline options,
> >
> > -device 
> > driver=pcie-root-port,port=18,chassis=19,id=pcie-root-port18,bus=pcie.0,addr=0x12
> >  \
> > -device 
> > driver=nec-usb-xhci,p2=4,p3=4,id=nex-usb-xhci0,bus=pcie-root-port18,addr=0x12.0x1
> >
> > the following bug happens after all ram pages were sent:
> >
> > qemu-kvm: get_pci_config_device: Bad config data: i=0x6e read: 0 device: 40 
> > cmask: ff wmask: 0 w1cmask:19
> > qemu-kvm: Failed to load PCIDevice:config
> > qemu-kvm: Failed to load pcie-root-port:parent_obj.parent_obj.parent_obj
> > qemu-kvm: error while loading state for instance 0x0 of device 
> > ':00:12.0/pcie-root-port'
> > qemu-kvm: load of migration failed: Invalid argument
> >
> > This happens on pc-q35-6.0 or lower because of:
> > { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
> >
> > In this scenario, hotplug_handler_plug() calls pcie_cap_slot_plug_cb(),
> > which sets dev->config byte 0x6e with bit PCI_EXP_SLTSTA_PDS to signal PCI
> > hotplug for the guest. After a while the guest will deal with this hotplug
> > and qemu will clear the above bit.
> >
> > Then, during migration, get_pci_config_device() will compare the
> > configs of both the freshly created device and the one that is being
> > received via migration, which will differ due to the PCI_EXP_SLTSTA_PDS bit
> > and cause the bug to reproduce.
> >
> > To avoid this fake incompatibility, there are tree fields in PCIDevice that
> > can help:
> >
> > - wmask: Used to implement R/W bytes, and
> > - w1cmask: Used to implement RW1C(Write 1 to Clear) bytes
> > - cmask: Used to enable config checks on load.
> >
> > According to PCI Express® Base Specification Revision 5.0 Version 1.0,
> > table 7-27 (Slot Status Register) bit 6, the "Presence Detect State" is
> > listed as RO (read-only), so it only makes sense to make use of the cmask
> > field.
> >
> > So, clear PCI_EXP_SLTSTA_PDS bit on cmask, so the fake incompatibility on
> > get_pci_config_device() does not abort the migration.
> >
> > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2215819
> > Signed-off-by: Leonardo Bras 
>
>
>
>
> > ---
> >  hw/pci/pcie.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index b8c24cf45f..cae56bf1c8 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -659,6 +659,10 @@ void pcie_cap_slot_init(PCIDevice *dev, PCIESlot *s)
> >  pci_word_test_and_set_mask(dev->w1cmask + pos + PCI_EXP_SLTSTA,
> > PCI_EXP_HP_EV_SUPPORTED);
> >
> > +/* Avoid migration abortion when this device hot-removed by guest
> > */
>
> I would have included here the text in the commit:
>
>  According to PCI Express® Base Specification Revision 5.0 Version 1.0,
>  table 7-27 (Slot Status Register) bit 6, the "Presence Detect State" is
>  listed as RO (read-only), so it only makes sense to make use of the cmask
>  field.
>
> and
>
> This happens on pc-q35-6.0 or lower because of:
> { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
>
> so if we ever remove the machine type pc-q35-6.0, we can drop it.

It makes sense adding this to the code.

In the remove machine-type case, IIUC we would have to drop every
machine-type older than pc-q35-6.0.
Also,  we would not support migration to any machine without
ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, right?

>
> Yes, I know that we don't drop machine types, but we should at some point.
>
>
> > +pci_word_test_and_clear_mask(dev->cmask + pos + PCI_EXP_SLTSTA,
> > + PCI_EXP_SLTSTA_PDS);
> > +
> >  dev->exp.hpev_notified = false;
> >
> >  qbus_set_hotplug_handler(BUS(pci_bridge_get_sec_bus(PCI_BRIDGE(dev))),
>
> I agree that this is (at least) a step on the right direction.
>
> I wmould had expected to have to need some check related to the value
> of:
>
> { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
>

This bug affects versions older than qemu 6.0. If we add this, we
would have some extra work backporting this to older versions (if
necessary) because the 'property' did not exist back then.

> But I will not claim _any_ understanding of the PCI specification.
>
> So:
>
> Reviewed-by: Juan Quintela 
>
> about that it fixes the migration bug.
>

Thanks Juan!
Leo

Re: [PATCH 1/1] pcie: Add hotplug detect state register to w1cmask

2023-07-05 Thread Leonardo Bras Soares Passos

On Wed, Jul 5, 2023 at 3:40 AM Leonardo Bras Soares Passos
 wrote:
>
> On Tue, Jul 4, 2023 at 3:43 AM Michael S. Tsirkin  wrote:
> >
> > On Tue, Jul 04, 2023 at 03:20:36AM -0300, Leonardo Brás wrote:
> > > Hello Peter and Michael, I have a few updates on this:
> > >
> > > On Mon, 2023-07-03 at 02:20 -0300, Leonardo Brás wrote:
> > > > Hello Peter and Michael, thanks for reviewing!
> > > >
> > > >
> > > > On Thu, 2023-06-29 at 16:56 -0400, Peter Xu wrote:
> > > > > On Thu, Jun 29, 2023 at 04:06:53PM -0400, Michael S. Tsirkin wrote:
> > > > > > On Thu, Jun 29, 2023 at 04:01:41PM -0400, Peter Xu wrote:
> > > > > > > On Thu, Jun 29, 2023 at 03:33:06PM -0400, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > > > On Thu, Jun 29, 2023 at 01:01:53PM -0400, Peter Xu wrote:
> > > > > > > > > Hi, Leo,
> > > > > > > > >
> > > > > > > > > Thanks for figuring this out.  Let me copy a few more 
> > > > > > > > > potential reviewers
> > > > > > > > > from commit 17858a1695 ("hw/acpi/ich9: Set ACPI PCI hot-plug 
> > > > > > > > > as default on
> > > > > > > > > Q35").
> > > > > > > > >
> > > > > > > > > On Thu, Jun 29, 2023 at 06:05:00AM -0300, Leonardo Bras wrote:
> > > > > > > > > > When trying to migrate a machine type pc-q35-6.0 or lower, 
> > > > > > > > > > with this
> > > > > > > > > > cmdline options:
> > > > > > > > > >
> > > > > > > > > > -device 
> > > > > > > > > > driver=pcie-root-port,port=18,chassis=19,id=pcie-root-port18,bus=pcie.0,addr=0x12
> > > > > > > > > >  \
> > > > > > > > > > -device 
> > > > > > > > > > driver=nec-usb-xhci,p2=4,p3=4,id=nex-usb-xhci0,bus=pcie-root-port18,addr=0x12.0x1
> > > > > > > > > >
> > > > > > > > > > the following bug happens after all ram pages were sent:
> > > > > > > > > >
> > > > > > > > > > qemu-kvm: get_pci_config_device: Bad config data: i=0x6e 
> > > > > > > > > > read: 0 device: 40 cmask: ff wmask: 0 w1cmask:19
> > > > > > > > > > qemu-kvm: Failed to load PCIDevice:config
> > > > > > > > > > qemu-kvm: Failed to load 
> > > > > > > > > > pcie-root-port:parent_obj.parent_obj.parent_obj
> > > > > > > > > > qemu-kvm: error while loading state for instance 0x0 of 
> > > > > > > > > > device ':00:12.0/pcie-root-port'
> > > > > > > > > > qemu-kvm: load of migration failed: Invalid argument
> > > > > > > > > >
> > > > > > > > > > This happens on pc-q35-6.0 or lower because of:
> > > > > > > > > > { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
> > > > > > > > > >
> > > > > > > > > > In this scenario, hotplug_handler_plug() calls 
> > > > > > > > > > pcie_cap_slot_plug_cb(),
> > > > > > > > > > which sets the bus dev->config byte 0x6e with bit 
> > > > > > > > > > PCI_EXP_SLTSTA_PDS to
> > > > > > > > > > signal PCI hotplug for the guest. After a while the guest 
> > > > > > > > > > will deal with
> > > > > > > > > > this hotplug and qemu will clear the above bit.
> > > > > > > >
> > > > > > > > Presence Detect State – This bit indicates the presence of an
> > > > > > > > adapter in the slot, reflected by the logical “OR” of the 
> > > > > > > > Physical
> > > > > > > > Layer in-band presence detect mechanism and, if present, any
> > > > > > > > out-of-band presence detect mechanism defined for the slot’s
> > > > > > > > corresponding form factor. Note that the in-band presence
> > > > > > > > detect mechanism requires that power be applied to an adapter
> > > > > > > > f

Re: [PATCH 1/1] pcie: Add hotplug detect state register to w1cmask

2023-07-05 Thread Leonardo Bras Soares Passos

On Tue, Jul 4, 2023 at 3:43 AM Michael S. Tsirkin  wrote:
>
> On Tue, Jul 04, 2023 at 03:20:36AM -0300, Leonardo Brás wrote:
> > Hello Peter and Michael, I have a few updates on this:
> >
> > On Mon, 2023-07-03 at 02:20 -0300, Leonardo Brás wrote:
> > > Hello Peter and Michael, thanks for reviewing!
> > >
> > >
> > > On Thu, 2023-06-29 at 16:56 -0400, Peter Xu wrote:
> > > > On Thu, Jun 29, 2023 at 04:06:53PM -0400, Michael S. Tsirkin wrote:
> > > > > On Thu, Jun 29, 2023 at 04:01:41PM -0400, Peter Xu wrote:
> > > > > > On Thu, Jun 29, 2023 at 03:33:06PM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Jun 29, 2023 at 01:01:53PM -0400, Peter Xu wrote:
> > > > > > > > Hi, Leo,
> > > > > > > >
> > > > > > > > Thanks for figuring this out.  Let me copy a few more potential 
> > > > > > > > reviewers
> > > > > > > > from commit 17858a1695 ("hw/acpi/ich9: Set ACPI PCI hot-plug as 
> > > > > > > > default on
> > > > > > > > Q35").
> > > > > > > >
> > > > > > > > On Thu, Jun 29, 2023 at 06:05:00AM -0300, Leonardo Bras wrote:
> > > > > > > > > When trying to migrate a machine type pc-q35-6.0 or lower, 
> > > > > > > > > with this
> > > > > > > > > cmdline options:
> > > > > > > > >
> > > > > > > > > -device 
> > > > > > > > > driver=pcie-root-port,port=18,chassis=19,id=pcie-root-port18,bus=pcie.0,addr=0x12
> > > > > > > > >  \
> > > > > > > > > -device 
> > > > > > > > > driver=nec-usb-xhci,p2=4,p3=4,id=nex-usb-xhci0,bus=pcie-root-port18,addr=0x12.0x1
> > > > > > > > >
> > > > > > > > > the following bug happens after all ram pages were sent:
> > > > > > > > >
> > > > > > > > > qemu-kvm: get_pci_config_device: Bad config data: i=0x6e 
> > > > > > > > > read: 0 device: 40 cmask: ff wmask: 0 w1cmask:19
> > > > > > > > > qemu-kvm: Failed to load PCIDevice:config
> > > > > > > > > qemu-kvm: Failed to load 
> > > > > > > > > pcie-root-port:parent_obj.parent_obj.parent_obj
> > > > > > > > > qemu-kvm: error while loading state for instance 0x0 of 
> > > > > > > > > device ':00:12.0/pcie-root-port'
> > > > > > > > > qemu-kvm: load of migration failed: Invalid argument
> > > > > > > > >
> > > > > > > > > This happens on pc-q35-6.0 or lower because of:
> > > > > > > > > { "ICH9-LPC", ACPI_PM_PROP_ACPI_PCIHP_BRIDGE, "off" }
> > > > > > > > >
> > > > > > > > > In this scenario, hotplug_handler_plug() calls 
> > > > > > > > > pcie_cap_slot_plug_cb(),
> > > > > > > > > which sets the bus dev->config byte 0x6e with bit 
> > > > > > > > > PCI_EXP_SLTSTA_PDS to
> > > > > > > > > signal PCI hotplug for the guest. After a while the guest 
> > > > > > > > > will deal with
> > > > > > > > > this hotplug and qemu will clear the above bit.
> > > > > > >
> > > > > > > Presence Detect State – This bit indicates the presence of an
> > > > > > > adapter in the slot, reflected by the logical “OR” of the Physical
> > > > > > > Layer in-band presence detect mechanism and, if present, any
> > > > > > > out-of-band presence detect mechanism defined for the slot’s
> > > > > > > corresponding form factor. Note that the in-band presence
> > > > > > > detect mechanism requires that power be applied to an adapter
> > > > > > > for its presence to be detected. Consequently, form factors that
> > > > > > > require a power controller for hot-plug must implement a
> > > > > > > physical pin presence detect mechanism.
> > > > > > > RO
> > > > > > > Defined encodings are:
> > > > > > > 0b Slot Empty
> > > > > > > 1b Card Present in slot
> > > > > > > This bit must be implemented on all Downstream Ports that
> > > > > > > implement slots. For Downstream Ports not connected to slots
> > > > > > > (where the Slot Implemented bit of the PCI Express Capabilities
> > > > > > > register is 0b), this bit must be hardwired to 1b.
> > >
> > > Thank you for providing this doc!
> > > I am new to PCI stuff, could you please point this doc?
> >
> > (I mean, the link to the documentation)
>
> The pci specs are all here: https://pcisig.com/
> Red Hat is a member so just register, it's free.
>
> I'd get the 5.0 version of pci express base:
> https://members.pcisig.com/wg/PCI-SIG/document/13005
>
> 6.0 is out but they did something to make it take years to open,
> and it shouldn't matter for this.

This is great! Thanks for sharing!

>
> > >
> > > > > > >
> > > > > > >
> > > > > > > And this seems to match what QEMU is doing: it clears on unplug
> > > > > > > not after guest deals with hotplug.
> > >
> > > Oh, that's weird.
> > > It should not unplug the device, so IIUC it should not clear the bit.
> > > Maybe something weird is happening in the guest, I will take a look.
> >
> > Updates on this:
> > You are right! For some reason the guest is hot-unplugging the device under 
> > some
> > conditions, so there is another bug on this for me to look after.
> >
> > >
> > > > > > >
> > > > > > >
> > > > > > > > > Then, during migration, get_pci_config_device() will compare 
> > > > > > > > > the
> > > > > > > > > configs of both the freshly created device and the one

Re: [PATCH v2 16/16] migration/multifd: Compute transferred bytes correctly

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:24 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:57 +0200, Juan Quintela wrote:
> >> In the past, we had to put the in the main thread all the operations
> >> related with sizes due to qemu_file not beeing thread safe.  As now
> >> all counters are atomic, we can update the counters just after the
> >> do the write.  As an aditional bonus, we are able to use the right
> >> value for the compression methods.  Right now we were assuming that
> >> there were no compression at all.
> >>
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/multifd.c | 13 -
> >>  1 file changed, 8 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/migration/multifd.c b/migration/multifd.c
> >> index aabf9b6d98..0bf5958a9c 100644
> >> --- a/migration/multifd.c
> >> +++ b/migration/multifd.c
> >> @@ -175,6 +175,7 @@ void multifd_register_ops(int method, MultiFDMethods 
> >> *ops)
> >>  static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
> >>  {
> >>  MultiFDInit_t msg = {};
> >> +size_t size = sizeof(msg);
> >>  int ret;
> >>
> >>  msg.magic = cpu_to_be32(MULTIFD_MAGIC);
> >> @@ -182,10 +183,12 @@ static int 
> >> multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
> >>  msg.id = p->id;
> >>  memcpy(msg.uuid, _uuid.data, sizeof(msg.uuid));
> >>
> >> -ret = qio_channel_write_all(p->c, (char *), sizeof(msg), errp);
> >> +ret = qio_channel_write_all(p->c, (char *), size, errp);
> >>  if (ret != 0) {
> >>  return -1;
> >>  }
> >> +stat64_add(_stats.multifd_bytes, size);
> >> +stat64_add(_stats.transferred, size);
> >>  return 0;
> >>  }
> >
> > Humm, those are atomic ops, right?
> >
> > You think we could have 'multifd_bytes' and 'transferred' in the same 
> > cacheline,
> > to avoid 2 cacheline bounces?
>
> Don't matter on next series.
>
> mig_stats.transferred is dropped.
>
> And transferred becomes:
>
> qemu_file_transferred + multifd_bytes + rdma_bytes.
>
> So everytime that we do a write, we only update one counter.

That's even better :)

Thanks!

>
> > Well, it's unrelated to this patchset, so:
> >
> > Reviewed-by: Leonardo Bras 
>

Re: [PATCH v2 15/16] migration/rdma: Simplify the function that saves a page

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:21 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:57 +0200, Juan Quintela wrote:
> >> When we sent a page through QEMUFile hooks (RDMA) there are three
> >> posiblities:
> >> - We are not using RDMA. return RAM_SAVE_CONTROL_DELAYED and
> >>   control_save_page() returns false to let anything else to proceed.
> >> - There is one error but we are using RDMA.  Then we return a negative
> >>   value, control_save_page() needs to return true.
> >> - Everything goes well and RDMA start the sent of the page
> >>   asynchronously.  It returns RAM_SAVE_CONTROL_DELAYED and we need to
> >>   return 1 for ram_save_page_legacy.
> >>
> >> Clear?
> >>
> >> I know, I know, the interfaz is as bad as it gets.  I think that now
> >> it is a bit clearer, but this needs to be done some other way.
> >
> > interface?
>
> Yeap.  I used the Spanish spelling, that, you know, in English is wrong O:-)

Happens to me all the time :)

>
> Thanks.
> >> diff --git a/migration/rdma.c b/migration/rdma.c
> >> index 416dec00a2..12d3c23fdc 100644
> >> --- a/migration/rdma.c
> >> +++ b/migration/rdma.c
> >> @@ -3239,13 +3239,12 @@ qio_channel_rdma_shutdown(QIOChannel *ioc,
> >>   *
> >>   *@size : Number of bytes to transfer
> >>   *
> >> - *@bytes_sent : User-specificed pointer to indicate how many bytes 
> >> were
> >> + *@pages_sent : User-specificed pointer to indicate how many pages 
> >> were
> >>   *  sent. Usually, this will not be more than a few bytes 
> >> of
> >>   *  the protocol because most transfers are sent 
> >> asynchronously.
> >>   */
> >
> > There is new doc to pages_sent but the parameter is not added to the 
> > signature
> > bellow. Am I missing something?
>
> Good catch.

:)

>
> I redid this patch several times.  And it appears that I forgot some 
> leftovers.
>
> >
> >> -static size_t qemu_rdma_save_page(QEMUFile *f,
> >> -  ram_addr_t block_offset, ram_addr_t 
> >> offset,
> >> -  size_t size, uint64_t *bytes_sent)
> >> +static int qemu_rdma_save_page(QEMUFile *f, ram_addr_t block_offset,
> >> +   ram_addr_t offset, size_t size)
> >>  {
> >>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
> >>  RDMAContext *rdma;
> >> @@ -3277,18 +3276,6 @@ static size_t qemu_rdma_save_page(QEMUFile *f,
> >>  goto err;
> >>  }
> >>
> >> -/*
> >> - * We always return 1 bytes because the RDMA
> >> - * protocol is completely asynchronous. We do not yet know
> >> - * whether an  identified chunk is zero or not because we're
> >> - * waiting for other pages to potentially be merged with
> >> - * the current chunk. So, we have to call qemu_update_position()
> >> - * later on when the actual write occurs.
> >> - */
> >> -if (bytes_sent) {
> >> -*bytes_sent = 1;
> >> -}
> >> -
> >>  /*
> >>   * Drain the Completion Queue if possible, but do not block,
> >>   * just poll.
> >
> > Oh, so this one complements 13/16.
> > Since it doesn't do imaginary transfers anymore, there is no need to use
> > bytes_sent pointer to keep track of them anymore.
> >
> > Other than the pages_sent above that I couldn't understand:
> > Reviewed-by: Leonardo Bras 
>
> Dropping that bit.
>
> Thanks.
>

Re: [PATCH v2 09/16] migration: We don't need the field rate_limit_used anymore

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:18 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:57 +0200, Juan Quintela wrote:
> >> Since previous commit, we calculate how much data we have send with
> >> migration_transferred_bytes() so no need to maintain this counter and
> >> remember to always update it.
> >>
> >> Signed-off-by: Juan Quintela 
> >> Reviewed-by: Cédric Le Goater 
>
> > I reviewed this one together with 8/16.
> > It makes sense for me to squash them together, but anyway:
>
> Already in tree.
>
> See explanation for the split on previous patch.
>
> The other reason for the split is that this part of the patch is trivial
> to review O:-)

:-)

>
> Later, Juan.
>
> >
> > Reviewed-by: Leonardo Bras 
>

Re: [PATCH v2 08/16] migration: Use migration_transferred_bytes() to calculate rate_limit

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:17 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:57 +0200, Juan Quintela wrote:
> >> Signed-off-by: Juan Quintela 
> >> Reviewed-by: Cédric Le Goater 
> >> ---
> >>  migration/migration-stats.h | 8 +++-
> >>  migration/migration-stats.c | 7 +--
> >>  migration/migration.c   | 2 +-
> >>  3 files changed, 13 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> >> index 91fda378d3..f1465c2ebe 100644
> >> --- a/migration/migration-stats.h
> >> +++ b/migration/migration-stats.h
> >> @@ -81,6 +81,10 @@ typedef struct {
> >>   * Number of bytes sent during precopy stage.
> >>   */
> >>  Stat64 precopy_bytes;
> >> +/*
> >> + * Amount of transferred data at the start of current cycle.
> >> + */
> >> +Stat64 rate_limit_start;
> >>  /*
> >>   * Maximum amount of data we can send in a cycle.
> >>   */
> >> @@ -136,8 +140,10 @@ uint64_t migration_rate_get(void);
> >>   * migration_rate_reset: Reset the rate limit counter.
> >>   *
> >>   * This is called when we know we start a new transfer cycle.
> >> + *
> >> + * @f: QEMUFile used for main migration channel
> >>   */
> >> -void migration_rate_reset(void);
> >> +void migration_rate_reset(QEMUFile *f);
> >>
> >>  /**
> >>   * migration_rate_set: Set the maximum amount that can be transferred.
> >> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> >> index 301392d208..da2bb69a15 100644
> >> --- a/migration/migration-stats.c
> >> +++ b/migration/migration-stats.c
> >> @@ -31,7 +31,9 @@ bool migration_rate_exceeded(QEMUFile *f)
> >>  return true;
> >>  }
> >>
> >> -uint64_t rate_limit_used = stat64_get(_stats.rate_limit_used);
> >> +uint64_t rate_limit_start = stat64_get(_stats.rate_limit_start);
> >> +uint64_t rate_limit_current = migration_transferred_bytes(f);
> >> +uint64_t rate_limit_used = rate_limit_current - rate_limit_start;
> >>  uint64_t rate_limit_max = stat64_get(_stats.rate_limit_max);
> >
> > So, IIUC, instead of updating mig_stats.rate_limit_used every time data is 
> > sent,
> > the idea is to 'reset' it to migration_transferred_bytes() at the beginning 
> > of a
> > cycle, and read migration_transferred_bytes() again for checking if the 
> > limit
> > was not crossed.
> >
> > Its a nice change since there is no need to update 2 counters, when 1 is 
> > enough.
> >
> > I think it would look nicer if squashed with 9/16, though. It would make it 
> > more
> > clear this is being added to replace migration_rate_account() strategy.
> >
> > What do you think?
>
> Already in tree.

My bad.
After I ended up reviewing the patchset I noticed a lot of it was
already in the PULL request.

>
> Done this way because on my tree there was an intermediate patch that
> did something like:
>
>
> uint64_t rate_limit_used = stat64_get(_stats.rate_limit_used);
> uint64_t rate_limit_start = stat64_get(_stats.rate_limit_start);
> uint64_t rate_limit_current = migration_transferred_bytes(f);
> uint64_t rate_limit_used_new = rate_limit_current - rate_limit_start;
>
> if (rate_limit_used_new != rate_limit_used) {
> printf("rate_limit old %lu new %lu\n", ...);
> }
>
> So I was sure that the counter that I was replacing had the same value
> that the new one.

Oh, I see.
You kept both to verify the implementation.
Makes sense

>
> This is the reason why I fixed transferred atomic in the previous patch,
> not because it mattered on the big scheme of things (migration_test was
> missing something like 100KB for the normal stage when I started, that
> for calculations don't matter).  But to check if I was doing the things
> right it mattered.  With that patch my replacement counter was exact,
> and none of the if's triggered.
>
> Except for the device transffer stages, there I missed something like
> 900KB, but it made no sense to go all over the tree to fix a counter
> that I was going to remove later.

Yeah, it makes no sense to invest time on stuff that will be removed later.

Thanks for helping me understand this :)

>
> Regards, Juan.
>

Re: [PATCH v2 04/16] qemu-file: Account for rate_limit usage on qemu_fflush()

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:09 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:56 +0200, Juan Quintela wrote:
> >> That is the moment we know we have transferred something.
> >>
> >> Signed-off-by: Juan Quintela 
> >> Reviewed-by: Cédric Le Goater 
> >> ---
> >>  migration/qemu-file.c | 7 +++
> >>  1 file changed, 3 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> >> index 4bc875b452..956bd2a580 100644
> >> --- a/migration/qemu-file.c
> >> +++ b/migration/qemu-file.c
> >> @@ -302,7 +302,9 @@ void qemu_fflush(QEMUFile *f)
> >> _error) < 0) {
> >>  qemu_file_set_error_obj(f, -EIO, local_error);
> >>  } else {
> >> -f->total_transferred += iov_size(f->iov, f->iovcnt);
> >> +uint64_t size = iov_size(f->iov, f->iovcnt);
> >> +qemu_file_acct_rate_limit(f, size);
> >> +f->total_transferred += size;
> >>  }
> >>
> >>  qemu_iovec_release_ram(f);
> >> @@ -519,7 +521,6 @@ void qemu_put_buffer_async(QEMUFile *f, const uint8_t 
> >> *buf, size_t size,
> >>  return;
> >>  }
> >>
> >> -f->rate_limit_used += size;
> >>  add_to_iovec(f, buf, size, may_free);
> >>  }
> >>
> >> @@ -537,7 +538,6 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, 
> >> size_t size)
> >>  l = size;
> >>  }
> >>  memcpy(f->buf + f->buf_index, buf, l);
> >> -f->rate_limit_used += l;
> >>  add_buf_to_iovec(f, l);
> >>  if (qemu_file_get_error(f)) {
> >>  break;
> >> @@ -554,7 +554,6 @@ void qemu_put_byte(QEMUFile *f, int v)
> >>  }
> >>
> >>  f->buf[f->buf_index] = v;
> >> -f->rate_limit_used++;
> >>  add_buf_to_iovec(f, 1);
> >>  }
> >>
> >
> > If we are counting transferred data at fflush, it makes sense to increase 
> > rate-
> > limit accounting at the same place. It may be less granular, but is more
> > efficient.
>
> Yeap, the whole point is that in my next series, rate_limit_used
> dissapear, we just use transferred for both things(*).
>
> Later, Juan.
>
> *: It is a bit more complicated than that, but we go from three counters
>  to a single counter.
>

Seems great to simplify stuff.
Thanks!
Leo

Re: [PATCH v2 03/16] migration: Move setup_time to mig_stats

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:07 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:56 +0200, Juan Quintela wrote:
> >> It is a time that needs to be cleaned each time cancel migration.
> >> Once there create migration_time_since() to calculate how time since a
> >> time in the past.
> >>
> >> Signed-off-by: Juan Quintela 
> >>
> >> ---
> >>
> >> Rename to migration_time_since (cédric)
> >> ---
> >>  migration/migration-stats.h | 13 +
> >>  migration/migration.h   |  1 -
> >>  migration/migration-stats.c |  7 +++
> >>  migration/migration.c   |  9 -
> >>  4 files changed, 24 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> >> index e782f1b0df..21402af9e4 100644
> >> --- a/migration/migration-stats.h
> >> +++ b/migration/migration-stats.h
> >> @@ -75,6 +75,10 @@ typedef struct {
> >>   * Number of bytes sent during precopy stage.
> >>   */
> >>  Stat64 precopy_bytes;
> >> +/*
> >> + * How long has the setup stage took.
> >> + */
> >> +Stat64 setup_time;
> >>  /*
> >>   * Total number of bytes transferred.
> >>   */
> >> @@ -87,4 +91,13 @@ typedef struct {
> >>
> >>  extern MigrationAtomicStats mig_stats;
> >>
> >> +/**
> >> + * migration_time_since: Calculate how much time has passed
> >> + *
> >> + * @stats: migration stats
> >> + * @since: reference time since we want to calculate
> >> + *
> >> + * Returns: Nothing.  The time is stored in val.
> >> + */
> >> +void migration_time_since(MigrationAtomicStats *stats, int64_t since);
> >>  #endif
> >> diff --git a/migration/migration.h b/migration/migration.h
> >> index 48a46123a0..27aa3b1035 100644
> >> --- a/migration/migration.h
> >> +++ b/migration/migration.h
> >> @@ -316,7 +316,6 @@ struct MigrationState {
> >>  int64_t downtime;
> >>  int64_t expected_downtime;
> >>  bool capabilities[MIGRATION_CAPABILITY__MAX];
> >> -int64_t setup_time;
> >>  /*
> >>   * Whether guest was running when we enter the completion stage.
> >>   * If migration is interrupted by any reason, we need to continue
> >> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> >> index 2f2cea965c..3431453c90 100644
> >> --- a/migration/migration-stats.c
> >> +++ b/migration/migration-stats.c
> >> @@ -12,6 +12,13 @@
> >>
> >>  #include "qemu/osdep.h"
> >>  #include "qemu/stats64.h"
> >> +#include "qemu/timer.h"
> >>  #include "migration-stats.h"
> >>
> >>  MigrationAtomicStats mig_stats;
> >> +
> >> +void migration_time_since(MigrationAtomicStats *stats, int64_t since)
> >> +{
> >> +int64_t now = qemu_clock_get_ms(QEMU_CLOCK_HOST);
> >> +stat64_set(>setup_time, now - since);
> >> +}
> >
> > IIUC this calculates a time delta and saves on stats->setup_time, is that 
> > right?
> >
> > It took me some time to understand that, since the function name is
> > migration_time_since(), which seems more generic.
> >
> > Would not be more intuitive to name it migration_setup_time_set() or so?
>
> Dropped this.
> Other reviewer commented that this was not a counter, what is right.  So
> I left the times for future work (it don't interfere with current
> cleanups).

Oh, it makes sense.

>
>
> > I could not see MigrationState->setup_time being initialized as 0 in this 
> > patch.
> > In a quick look in the code I noticed there is no initialization of this 
> > struct,
> > but on qemu_savevm_state() and migrate_prepare() we have:
> >
> > memset(_stats, 0, sizeof(mig_stats));
> >
> > I suppose this is enough, right?
>
> Yeap.  All migration_stats() are initialized to zero at the start of
> qemu, or when we start a migration.
>
> After a migration, it don't matter if it finished with/without error,
> they are there with the right value until we start another migration (in
> the case of error, of course).

That's great to simplify the code.
Thanks!

>
> Later, Juan.
>

Re: [PATCH v2 02/16] migration: Correct transferred bytes value

2023-05-26 Thread Leonardo Bras Soares Passos

On Fri, May 26, 2023 at 5:04 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Mon, 2023-05-15 at 21:56 +0200, Juan Quintela wrote:
> >> We forget several places to add to trasferred amount of data.  With
> >> this fixes I get:
> >>
> >>qemu_file_transferred() + multifd_bytes == transferred
> >>
> >> The only place whrer this is not true is during devices sending.  But
> >> going all through the full tree searching for devices that use
> >> QEMUFile directly is a bit too much.
> >>
> >> Multifd, precopy and xbzrle work as expected. Postocpy still misses 35
> >> bytes, but searching for them is getting complicated, so I stop here.
> >>
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/ram.c   | 14 ++
> >>  migration/savevm.c| 19 +--
> >>  migration/vmstate.c   |  3 +++
> >>  migration/meson.build |  2 +-
> >>  4 files changed, 35 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index f69d8d42b0..fd5a8db0f8 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -337,6 +337,7 @@ int64_t ramblock_recv_bitmap_send(QEMUFile *file,
> >>
> >>  g_free(le_bitmap);
> >>
> >> +stat64_add(_stats.transferred, 8 + size + 8);
> >>  if (qemu_file_get_error(file)) {
> >>  return qemu_file_get_error(file);
> >>  }
> >> @@ -1392,6 +1393,7 @@ static int find_dirty_block(RAMState *rs, 
> >> PageSearchStatus *pss)
> >>  return ret;
> >>  }
> >>  qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> >> +stat64_add(_stats.transferred, 8);
> >>  qemu_fflush(f);
> >>  }
> >>  /*
> >> @@ -3020,6 +3022,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>  RAMState **rsp = opaque;
> >>  RAMBlock *block;
> >>  int ret;
> >> +size_t size = 0;
> >>
> >>  if (compress_threads_save_setup()) {
> >>  return -1;
> >> @@ -3038,16 +3041,20 @@ static int ram_save_setup(QEMUFile *f, void 
> >> *opaque)
> >>  qemu_put_be64(f, ram_bytes_total_with_ignored()
> >>   | RAM_SAVE_FLAG_MEM_SIZE);
> >>
> >> +size += 8;
> >>  RAMBLOCK_FOREACH_MIGRATABLE(block) {
> >>  qemu_put_byte(f, strlen(block->idstr));
> >>  qemu_put_buffer(f, (uint8_t *)block->idstr, 
> >> strlen(block->idstr));
> >>  qemu_put_be64(f, block->used_length);
> >> +size += 1 + strlen(block->idstr) + 8;
> >
> > I was thinking some of them would look better with sizeof()s instead of 
> > given
> > literal number, such as:
> >
> > size += sizeof(Byte) + strlen(block->idstr) + sizeof(block->used_length);
> >
> > Maybe too much?
>
> I dropped this patch for two reasons:
>
> - reviewers gave me a bad time with it O:-)
> - it was there only so if anyone was meassuring that new counters are
>   the same that old counters.
>
> But as I have already checked that, we don't need it.
>
> I drop it on the next round that I send.
> > Maybe, it would be nice to have qemu_put_* to return the value, and in this
> > case:
> >
> > size += qemu_put_be64(...)
> >
> > What do you think?
>
> Even more important than that is to return an error value, but that
> is a very long project.
>
> See on my next series that qemu_fflush() return errors, so code gets
> simplifed:
>
> qemu_fflush(file);
> if (qemu_file_get_error(file)) {
> handle error;
> }
>
> to:
>
> qemu_fflush(file);
> if (qemu_file_get_error(file)) {
> handle error;
> }
>

They look the same to me, what changed?

> We need to do basically all qemu_put_*() and qemu_get_*() functions, but
> it is a step on the right direction.
>
> Later, Juan.
>

Re: [PATCH v1 1/1] hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0

2023-05-03 Thread Leonardo Bras Soares Passos

On Wed, May 3, 2023 at 6:49 AM Jonathan Cameron
 wrote:
>
> On Tue,  2 May 2023 21:27:02 -0300
> Leonardo Bras  wrote:
>
> > Since it's implementation on v8.0.0-rc0, having the PCI_ERR_UNCOR_MASK
> > set for machine types < 8.0 will cause migration to fail if the target
> > QEMU version is < 8.0.0 :
> >
> > qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x10a read: 
> > 40 device: 0 cmask: ff wmask: 0 w1cmask:0
> > qemu-system-x86_64: Failed to load PCIDevice:config
> > qemu-system-x86_64: Failed to load e1000e:parent_obj
> > qemu-system-x86_64: error while loading state for instance 0x0 of device 
> > ':00:02.0/e1000e'
> > qemu-system-x86_64: load of migration failed: Invalid argument
> >
> > The above test migrated a 7.2 machine type from QEMU master to QEMU 7.2.0,
> > with this cmdline:
> >
> > ./qemu-system-x86_64 -M pc-q35-7.2 [-incoming XXX]
> >
> > In order to fix this, property x-pcie-err-unc-mask was introduced to
> > control when PCI_ERR_UNCOR_MASK is enabled. This property is enabled by
> > default, but is disabled if machine type <= 7.2.
> >
> > Fixes: 010746ae1d ("hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register")
> > Suggested-by: Michael S. Tsirkin 
> > Signed-off-by: Leonardo Bras 
>
> Thanks Leo, you are a star.
>
> LGTM
>
> Reviewed-by: Jonathan Cameron 
>

Thanks!

> > ---
> >  include/hw/pci/pci.h |  2 ++
> >  hw/core/machine.c|  1 +
> >  hw/pci/pci.c |  2 ++
> >  hw/pci/pcie_aer.c| 11 +++
> >  4 files changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index 935b4b91b4..e6d0574a29 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -207,6 +207,8 @@ enum {
> >  QEMU_PCIE_EXTCAP_INIT = (1 << QEMU_PCIE_EXTCAP_INIT_BITNR),
> >  #define QEMU_PCIE_CXL_BITNR 10
> >  QEMU_PCIE_CAP_CXL = (1 << QEMU_PCIE_CXL_BITNR),
> > +#define QEMU_PCIE_ERR_UNC_MASK_BITNR 11
> > +QEMU_PCIE_ERR_UNC_MASK = (1 << QEMU_PCIE_ERR_UNC_MASK_BITNR),
> >  };
> >
> >  typedef struct PCIINTxRoute {
> > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > index 47a34841a5..07f763eb2e 100644
> > --- a/hw/core/machine.c
> > +++ b/hw/core/machine.c
> > @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> >  { "e1000e", "migrate-timadj", "off" },
> >  { "virtio-mem", "x-early-migration", "false" },
> >  { "migration", "x-preempt-pre-7-2", "true" },
> > +{ TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> >  };
> >  const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> >
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 8a87ccc8b0..5153ad63d6 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -79,6 +79,8 @@ static Property pci_props[] = {
> >  DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> > failover_pair_id),
> >  DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> > +DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> > +QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >  DEFINE_PROP_END_OF_LIST()
> >  };
> >
> > diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> > index 103667c368..374d593ead 100644
> > --- a/hw/pci/pcie_aer.c
> > +++ b/hw/pci/pcie_aer.c
> > @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, 
> > uint16_t offset,
> >
> >  pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >   PCI_ERR_UNC_SUPPORTED);
> > -pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> > - PCI_ERR_UNC_MASK_DEFAULT);
> > -pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> > - PCI_ERR_UNC_SUPPORTED);
> > +
> > +if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> > +pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> > + PCI_ERR_UNC_MASK_DEFAULT);
> > +pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> > + PCI_ERR_UNC_SUPPORTED);
> > +}
> >
> >  pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >   PCI_ERR_UNC_SEVERITY_DEFAULT);
>

Re: [PATCH] MAINTAINERS: Add Leonardo and Peter as reviewers

2023-04-24 Thread Leonardo Bras Soares Passos

On Fri, Apr 21, 2023 at 12:23 PM Peter Xu  wrote:
>
> On Wed, Apr 19, 2023 at 06:29:57PM +0200, Juan Quintela wrote:
> > Now that David has stepped down with Migration maintainership,
> > Leonardo and Peter has volunteer to review the migration patches.
> > This way they got CC'd on every migration patch.
> >
> > Signed-off-by: Juan Quintela 
>
> Acked-by: Peter Xu 
>
> --
> Peter Xu
>

Acked-by: Leonardo Bras 

Thanks,
Leonardo Bras

Re: [PATCH] migration: Show downtime during postcopy phase

2022-12-20 Thread Leonardo Bras Soares Passos

On Thu, Dec 15, 2022 at 7:24 PM Peter Xu  wrote:
>
> The downtime should be displayed during postcopy phase because the
> switchover phase is done.  OTOH it's weird to show "expected downtime"
> which can confuse what does that mean if the switchover has already
> happened anyway.
>
> This is a slight ABI change on QMP, but I assume it shouldn't affect
> anyone.
>
> Signed-off-by: Peter Xu 
> ---
>  migration/migration.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 64f74534e2..993782598f 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1077,20 +1077,30 @@ bool migration_is_running(int state)
>  }
>  }
>
> +static bool migrate_show_downtime(MigrationState *s)
> +{
> +return (s->state == MIGRATION_STATUS_COMPLETED) || 
> migration_in_postcopy();
> +}
> +
>  static void populate_time_info(MigrationInfo *info, MigrationState *s)
>  {
>  info->has_status = true;
>  info->has_setup_time = true;
>  info->setup_time = s->setup_time;
> +
>  if (s->state == MIGRATION_STATUS_COMPLETED) {
>  info->has_total_time = true;
>  info->total_time = s->total_time;
> -info->has_downtime = true;
> -info->downtime = s->downtime;
>  } else {
>  info->has_total_time = true;
>  info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
> s->start_time;
> +}
> +
> +if (migrate_show_downtime(s)) {
> +info->has_downtime = true;
> +info->downtime = s->downtime;
> +} else {
>  info->has_expected_downtime = true;
>  info->expected_downtime = s->expected_downtime;
>  }
> --
> 2.37.3
>

FWIW:
Reviewed-by: Leonardo Bras

Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration

2022-11-29 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Nov 24, 2022 at 1:04 PM Peter Xu  wrote:
>
> On Wed, Nov 09, 2022 at 02:56:29AM -0300, Leonardo Bras wrote:
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index a0cdb714f7..250caff7f4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
> >  exit(EXIT_FAILURE);
> >  }
> >
> > +migration_load_cleanup();
>
> It's a bit weird to call multifd-load-clean in a listen phase..

I agree.

>
> How about moving it right above
> trace_process_incoming_migration_co_postcopy_end_main()?  Then the new
> helper can also be static.

Seems a nice Idea to have this function to be static.

We have to guarantee this is run after the migration finished, but
before migration_incoming_state_destroy().

You suggested calling it right above of
trace_process_incoming_migration_co_postcopy_end_main(), which git
grep pointed me to an if clause in process_incoming_migration_co().
If I got the location correctly, it would not help: this coroutine is
ran just after the VM went to the target host, and not when the
migration finished.

If we are using multifd channels, this will break the migration with
segmentation fault (SIGSEGV), since the channels have not finished
sending yet.

Best regards,
Leo

>
> > +
> >  migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > MIGRATION_STATUS_COMPLETED);
> >  /*
> > --
> > 2.38.1
> >
>
> --
> Peter Xu
>

Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration

2022-11-14 Thread Leonardo Bras Soares Passos

On Thu, Nov 10, 2022 at 10:48 AM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> D> When multifd and postcopy-ram capabilities are enabled, if a
> > migrate-start-postcopy is attempted, the migration will finish sending the
> > memory pages and then crash with the following error:
> >
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion
> > `QLIST_EMPTY(>yankfns)' failed.
> >
> > This happens because even though all multifd channels could
> > yank_register_function(), none of them could unregister it before
> > unregistering the MIGRATION_YANK_INSTANCE, causing the assert to fail.
> >
> > Fix that by calling multifd_load_cleanup() on postcopy_ram_listen_thread()
> > before MIGRATION_YANK_INSTANCE is unregistered.
>
> Hi
>
> One question,
> What warantees that migration_load_cleanup() is not called twice?
>
> I can't see anything that provides that here?  Or does postcopy have
> never done the cleanup of multifd channels before?

IIUC, postcopy is not doing multifd cleanup for a while, at least
since 6.0.0-rc2.
That is as far as I went back testing, and by fixing other (build)
bugs, I could get the yank to abort the target qemu after the
migration finished on multifd + postcopy scenario.


>
> Later, Juan.
>
>
> > Fixes: b5eea99ec2 ("migration: Add yank feature")
> > Reported-by: Li Xiaohui 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/migration.h |  1 +
> >  migration/migration.c | 18 +-
> >  migration/savevm.c|  2 ++
> >  3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/migration/migration.h b/migration/migration.h
> > index cdad8aceaa..240f64efb0 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -473,6 +473,7 @@ void migration_make_urgent_request(void);
> >  void migration_consume_urgent_request(void);
> >  bool migration_rate_limit(void);
> >  void migration_cancel(const Error *error);
> > +bool migration_load_cleanup(void);
> >
> >  void populate_vfio_info(MigrationInfo *info);
> >  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 739bb683f3..4f363b2a95 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -486,6 +486,17 @@ void migrate_add_address(SocketAddress *address)
> >QAPI_CLONE(SocketAddress, address));
> >  }
> >
> > +bool migration_load_cleanup(void)
> > +{
> > +Error *local_err = NULL;
> > +
> > +if (multifd_load_cleanup(_err)) {
> > +error_report_err(local_err);
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  static void qemu_start_incoming_migration(const char *uri, Error **errp)
> >  {
> >  const char *p = NULL;
> > @@ -540,8 +551,7 @@ static void process_incoming_migration_bh(void *opaque)
> >   */
> >  qemu_announce_self(>announce_timer, migrate_announce_params());
> >
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > +if (migration_load_cleanup()) {
> >  autostart = false;
> >  }
> >  /* If global state section was not received or we are in running
> > @@ -646,9 +656,7 @@ fail:
> >  migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
> >MIGRATION_STATUS_FAILED);
> >  qemu_fclose(mis->from_src_file);
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > -}
> > +migration_load_cleanup();
> >  exit(EXIT_FAILURE);
> >  }
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index a0cdb714f7..250caff7f4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
> >  exit(EXIT_FAILURE);
> >  }
> >
> > +migration_load_cleanup();
> > +
>
> This addition is the one that I don't understand why it was not
> needed/done before.

Please see the above comment, but tl;dr, it was not done before.


Thanks you for reviewing,
Leo

>
> >  migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > MIGRATION_STATUS_COMPLETED);
> >  /*
>
> Later, Juan.
>

Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration

2022-11-09 Thread Leonardo Bras Soares Passos

On Wed, Nov 9, 2022 at 10:31 AM Dr. David Alan Gilbert
 wrote:
>
> * Leonardo Bras (leob...@redhat.com) wrote:
> > When multifd and postcopy-ram capabilities are enabled, if a
> > migrate-start-postcopy is attempted, the migration will finish sending the
> > memory pages and then crash with the following error:
>
> How does that happen? Isn't multifd+postcopy still disabled, I see in
> migrate_caps_check
>
> if (cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
> 
> if (cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
> error_setg(errp, "Postcopy is not yet compatible with multifd");
> return false;
> }
> }
>

I can't see this happening in upstream code (v7.2.0-rc0). Could you
please tell me the lines where this happens?

I mean, I see cap_list[MIGRATION_CAPABILITY_MULTIFD] and
cap_list[MIGRATION_CAPABILITY_POSTCOPY_RAM] in migrate_caps_check()
but I can't see them nested like this, so I am probably missing
something.

This procedure to reproduce was shared by Xiaohui Li (I added a few tweaks):

1.Boot a guest with any qemu command on source host;
2.Boot a guest with same qemu command but append '-incoming defer' on
destination host;
3.Enable multifd and postcopy capabilities on src and dst hosts:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"multifd","state":true}]}}
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"postcopy-ram","state":true}]}}
4.During migration is active, switch to postcopy mode:
{"execute":"migrate-start-postcopy"}

Best regards,
Leo


>
> Dave
>
> > qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion
> > `QLIST_EMPTY(>yankfns)' failed.
> >
> > This happens because even though all multifd channels could
> > yank_register_function(), none of them could unregister it before
> > unregistering the MIGRATION_YANK_INSTANCE, causing the assert to fail.
> >
> > Fix that by calling multifd_load_cleanup() on postcopy_ram_listen_thread()
> > before MIGRATION_YANK_INSTANCE is unregistered.
> >
> > Fixes: b5eea99ec2 ("migration: Add yank feature")
> > Reported-by: Li Xiaohui 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/migration.h |  1 +
> >  migration/migration.c | 18 +-
> >  migration/savevm.c|  2 ++
> >  3 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/migration/migration.h b/migration/migration.h
> > index cdad8aceaa..240f64efb0 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -473,6 +473,7 @@ void migration_make_urgent_request(void);
> >  void migration_consume_urgent_request(void);
> >  bool migration_rate_limit(void);
> >  void migration_cancel(const Error *error);
> > +bool migration_load_cleanup(void);
> >
> >  void populate_vfio_info(MigrationInfo *info);
> >  void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page);
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 739bb683f3..4f363b2a95 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -486,6 +486,17 @@ void migrate_add_address(SocketAddress *address)
> >QAPI_CLONE(SocketAddress, address));
> >  }
> >
> > +bool migration_load_cleanup(void)
> > +{
> > +Error *local_err = NULL;
> > +
> > +if (multifd_load_cleanup(_err)) {
> > +error_report_err(local_err);
> > +return true;
> > +}
> > +return false;
> > +}
> > +
> >  static void qemu_start_incoming_migration(const char *uri, Error **errp)
> >  {
> >  const char *p = NULL;
> > @@ -540,8 +551,7 @@ static void process_incoming_migration_bh(void *opaque)
> >   */
> >  qemu_announce_self(>announce_timer, migrate_announce_params());
> >
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > +if (migration_load_cleanup()) {
> >  autostart = false;
> >  }
> >  /* If global state section was not received or we are in running
> > @@ -646,9 +656,7 @@ fail:
> >  migrate_set_state(>state, MIGRATION_STATUS_ACTIVE,
> >MIGRATION_STATUS_FAILED);
> >  qemu_fclose(mis->from_src_file);
> > -if (multifd_load_cleanup(_err) != 0) {
> > -error_report_err(local_err);
> > -}
> > +migration_load_cleanup();
> >  exit(EXIT_FAILURE);
> >  }
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c
> > index a0cdb714f7..250caff7f4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
> >  exit(EXIT_FAILURE);
> >  }
> >
> > +migration_load_cleanup();
> > +
> >  migrate_set_state(>state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > MIGRATION_STATUS_COMPLETED);
> >  /*
> > --
> > 2.38.1
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PATCH v7 07/12] multifd: Prepare to send a packet without the mutex held

2022-08-20 Thread Leonardo Bras Soares Passos

On Fri, Aug 19, 2022 at 8:32 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Tue, 2022-08-02 at 08:39 +0200, Juan Quintela wrote:
> >> We do the send_prepare() and the fill of the head packet without the
> >> mutex held.  It will help a lot for compression and later in the
> >> series for zero pages.
> >>
> >> Notice that we can use p->pages without holding p->mutex because
> >> p->pending_job == 1.
> >>
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/multifd.h |  2 ++
> >>  migration/multifd.c | 11 ++-
> >>  2 files changed, 8 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/migration/multifd.h b/migration/multifd.h
> >> index a67cefc0a2..cd389d18d2 100644
> >> --- a/migration/multifd.h
> >> +++ b/migration/multifd.h
> >> @@ -109,7 +109,9 @@ typedef struct {
> >>  /* array of pages to sent.
> >>   * The owner of 'pages' depends of 'pending_job' value:
> >>   * pending_job == 0 -> migration_thread can use it.
> >> + * No need for mutex lock.
> >>   * pending_job != 0 -> multifd_channel can use it.
> >> + * No need for mutex lock.
> >>   */
> >>  MultiFDPages_t *pages;
> >>
> >> diff --git a/migration/multifd.c b/migration/multifd.c
> >> index 09a40a9135..68fc9f8e88 100644
> >> --- a/migration/multifd.c
> >> +++ b/migration/multifd.c
> >> @@ -663,6 +663,8 @@ static void *multifd_send_thread(void *opaque)
> >>  p->flags |= MULTIFD_FLAG_SYNC;
> >>  p->sync_needed = false;
> >>  }
> >> +qemu_mutex_unlock(>mutex);
> >> +
> >
> > If it unlocks here, we will have unprotected:
> > for (int i = 0; i < p->pages->num; i++) {
> > p->normal[p->normal_num] = p->pages->offset[i];
> > p->normal_num++;
> > }
> >
> > And p->pages seems to be in the mutex-protected area.
> > Should it be ok?
>
> From the documentation:
>
> /* array of pages to sent.
>  * The owner of 'pages' depends of 'pending_job' value:
>  * pending_job == 0 -> migration_thread can use it.
>  * No need for mutex lock.
>  * pending_job != 0 -> multifd_channel can use it.
>  * No need for mutex lock.
>  */
> MultiFDPages_t *pages;
>
> So, it is right.

Oh, right. I missed that part earlier .

>
> > Also, under that we have:
> > if (p->normal_num) {
> > ret = multifd_send_state->ops->send_prepare(p, _err);
> > if (ret != 0) {
> > qemu_mutex_unlock(>mutex);
> > break;
> > }
> > }
> >
> > Calling mutex_unlock() here, even though the unlock already happened before,
> > could cause any issue?
>
> Good catch.  Never got an error there.
>
> Removing that bit.

Thanks!

Best regards,
Leo

>
> > Best regards,
>
>
> Thanks, Juan.
>

Re: [PATCH v7 06/12] multifd: Make flags field thread local

2022-08-20 Thread Leonardo Bras Soares Passos

On Fri, Aug 19, 2022 at 7:03 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Tue, 2022-08-02 at 08:39 +0200, Juan Quintela wrote:
> >> Use of flags with respect to locking was incensistant.  For the
> >> sending side:
> >> - it was set to 0 with mutex held on the multifd channel.
> >> - MULTIFD_FLAG_SYNC was set with mutex held on the migration thread.
> >> - Everything else was done without the mutex held on the multifd channel.
> >>
> >> On the reception side, it is not used on the migration thread, only on
> >> the multifd channels threads.
> >>
> >> So we move it to the multifd channels thread only variables, and we
> >> introduce a new bool sync_needed on the send side to pass that information.
> >>
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/multifd.h | 10 ++
> >>  migration/multifd.c | 23 +--
> >>  2 files changed, 19 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/migration/multifd.h b/migration/multifd.h
> >> index 36f899c56f..a67cefc0a2 100644
> >> --- a/migration/multifd.h
> >> +++ b/migration/multifd.h
> >> @@ -98,12 +98,12 @@ typedef struct {
> >
> > Just noticed having no name in 'typedef struct' line makes it harder to
> > understand what is going on.
>
> It is common idiom in QEMU.  The principal reason is that if you don't
> want anyone to use "struct MultiFDSendParams" but MultiFDSendParams, the
> best way to achieve that is to do it this way.

I agree, but a comment after the typedef could help reviewing. Something like

typedef struct { /* MultiFDSendParams */
...
} MultiFDSendParams

Becomes this in diff:

diff --git a/migration/multifd.h b/migration/multifd.h
index 134e6a7f19..93bb3a7f4a 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -90,6 +90,7 @@ typedef struct { /* MultiFDSendParams */
[...]


>
> >> @@ -172,6 +172,8 @@ typedef struct {
> >>
> >>  /* pointer to the packet */
> >>  MultiFDPacket_t *packet;
> >> +/* multifd flags for each packet */
> >> +uint32_t flags;
> >>  /* size of the next packet that contains pages */
> >>  uint32_t next_packet_size;
> >>  /* packets sent through this channel */
> >
> > So, IIUC, the struct member flags got moved down (same struct) to an area
> > described as thread-local, meaning it does not need locking.
> >
> > Interesting, I haven't noticed this different areas in the same struct.
>
> It has changed in the last two weeks or so in upstream (it has been on
> this patchset for several months.)

Nice :)

>
>
> >
> >> diff --git a/migration/multifd.c b/migration/multifd.c
> >> index e25b529235..09a40a9135 100644
> >> --- a/migration/multifd.c
> >> +++ b/migration/multifd.c
> >> @@ -602,7 +602,7 @@ int multifd_send_sync_main(QEMUFile *f)
> >>  }
> >>
> >>  p->packet_num = multifd_send_state->packet_num++;
> >> -p->flags |= MULTIFD_FLAG_SYNC;
> >> +p->sync_needed = true;
> >>  p->pending_job++;
> >>  qemu_mutex_unlock(>mutex);
> >>  qemu_sem_post(>sem);
> >> @@ -658,7 +658,11 @@ static void *multifd_send_thread(void *opaque)
> >>
> >>  if (p->pending_job) {
> >>  uint64_t packet_num = p->packet_num;
> >> -uint32_t flags = p->flags;
> >> +p->flags = 0;
> >> +if (p->sync_needed) {
> >> +p->flags |= MULTIFD_FLAG_SYNC;
> >> +p->sync_needed = false;
> >> +}
> >
> > Any particular reason why doing p->flags = 0, then p->flags |= 
> > MULTIFD_FLAG_SYNC
> > ?
>
> It is a bitmap field, and if there is anything on the future, we need to
> set it.  I agree that when there is only one flag, it seems "weird".
>
> > [1] Couldn't it be done without the |= , since it's already being set to 
> > zero
> > before? (becoming "p->flags = MULTIFD_FLAG_SYNC" )
>
> As said, easier to modify later, and also easier if we want to setup a
> flag by default.

Yeah, I agree. It makes sense now.

Thanks

>
> I agree that it is a matter of style/taste.
>
> >>  p->normal_num = 0;
> >>
> >>  if (use_zero_copy_send) {
> >> @@ -680,14 +684,13 @@ static void *multifd_send_thread(void *opaque)
> >>  }
> >>  }
> >>  multifd_send_fill_packet(p);
> >> -p->flags = 0;
> >>  p->num_packets++;
> >>  p->total_normal_pages += p->normal_num;
> >>  p->pages->num = 0;
> >>  p->pages->block = NULL;
> >>  qemu_mutex_unlock(>mutex);
> >>
> >> -trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> >> +trace_multifd_send(p->id, packet_num, p->normal_num, p->flags,
> >> p->next_packet_size);
> >>
> >>  if (use_zero_copy_send) {
> >> @@ -715,7 +718,7 @@ static void *multifd_send_thread(void *opaque)
> >>  p->pending_job--;
> >>  qemu_mutex_unlock(>mutex);
> >>
> >> -if (flags & MULTIFD_FLAG_SYNC) {
> >> +

Re: [PATCH v7 05/12] migration: Make ram_save_target_page() a pointer

2022-08-20 Thread Leonardo Bras Soares Passos

On Fri, Aug 19, 2022 at 6:52 AM Juan Quintela  wrote:
>
> Leonardo Brás  wrote:
> > On Tue, 2022-08-02 at 08:39 +0200, Juan Quintela wrote:
> >> We are going to create a new function for multifd latest in the series.
> >>
> >> Signed-off-by: Juan Quintela 
> >> Reviewed-by: Dr. David Alan Gilbert 
> >> Signed-off-by: Juan Quintela 
> >
> > Double Signed-off-by again.
> >
> >> ---
> >>  migration/ram.c | 13 +
> >>  1 file changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 85d89d61ac..499d9b2a90 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -310,6 +310,9 @@ typedef struct {
> >>  bool preempted;
> >>  } PostcopyPreemptState;
> >>
> >> +typedef struct RAMState RAMState;
> >> +typedef struct PageSearchStatus PageSearchStatus;
> >> +
> >>  /* State of RAM for migration */
> >>  struct RAMState {
> >>  /* QEMUFile used for this migration */
> >> @@ -372,8 +375,9 @@ struct RAMState {
> >>   * is enabled.
> >>   */
> >>  unsigned int postcopy_channel;
> >> +
> >> +int (*ram_save_target_page)(RAMState *rs, PageSearchStatus *pss);
> >>  };
> >> -typedef struct RAMState RAMState;
> >>
> >>  static RAMState *ram_state;
> >>
> >> @@ -2255,14 +2259,14 @@ static bool save_compress_page(RAMState *rs, 
> >> RAMBlock *block, ram_addr_t offset)
> >>  }
> >>
> >>  /**
> >> - * ram_save_target_page: save one target page
> >> + * ram_save_target_page_legacy: save one target page
> >>   *
> >>   * Returns the number of pages written
> >>   *
> >>   * @rs: current RAM state
> >>   * @pss: data about the page we want to send
> >>   */
> >> -static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss)
> >> +static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus 
> >> *pss)
> >>  {
> >>  RAMBlock *block = pss->block;
> >>  ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >> @@ -2469,7 +2473,7 @@ static int ram_save_host_page(RAMState *rs, 
> >> PageSearchStatus *pss)
> >>
> >>  /* Check the pages is dirty and if it is send it */
> >>  if (migration_bitmap_clear_dirty(rs, pss->block, pss->page)) {
> >> -tmppages = ram_save_target_page(rs, pss);
> >> +tmppages = rs->ram_save_target_page(rs, pss);
> >>  if (tmppages < 0) {
> >>  return tmppages;
> >>  }
> >> @@ -3223,6 +3227,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >>  ram_control_before_iterate(f, RAM_CONTROL_SETUP);
> >>  ram_control_after_iterate(f, RAM_CONTROL_SETUP);
> >>
> >> +(*rsp)->ram_save_target_page = ram_save_target_page_legacy;
> >>  ret =  multifd_send_sync_main(f);
> >>  if (ret < 0) {
> >>  return ret;
> >
> >
> > So, IIUC:
> > - Rename ram_save_target_page -> ram_save_target_page_legacy
> > - Add a function pointer to RAMState (or a callback)
> > - Assign function pointer = ram_save_target_page_legacy at setup
> > - Replace ram_save_target_page() by indirect function call using above 
> > pointer.
> >
> > I could see no issue in this, so I belive it works fine.
> >
> > The only thing that concerns me is the name RAMState.
>
> Every device state is setup in RAMState.
>
> > IMHO, a struct named RAMState is supposed to just reflect the state of ram 
> > (or
> > according to this struct's comments, the state of RAM for migration. Having 
> > a
> > function pointer here that saves a page seems counterintuitive, since it 
> > does
> > not reflect the state of RAM.
>
> The big problem for adding another struct is that we would have to
> change all the callers, or yet another global variable.  Both are bad
> idea in my humble opinion.
>
> > Maybe we could rename the struct, or even better, create another struct that
> > could look something like this:
> >
> > struct RAMMigration {
> > RAMState state;
> > int (*ram_save_target_page)(RAMState *rs, PageSearchStatus *pss);
> > /* Other callbacks or further info.*/
> > }
> >
> > What do you think about it?
>
> Really this depends on configuration.  What is setup for qemu
> migration.  I think this is the easiest way to do it, we can add a new
> struct, but it gets everything much more complicated:
>
> - the value that we receive in ram_save_setup() is a RAMState
> - We would have to change all the callers form
>   * ram_save_iterate()
>   * ram_find_and_save_block()
>   * ram_save_host_page()

Maybe RAMState could be part of a bigger struct, and we could use
something like a container_of().
So whenever you want to use it, it would be available.

What about that?

>
> So I think it is quite a bit of churn for not a lot of gain.
>
> Later, Juan.
>

Re: [RFC PATCH 2/2] kvm/kvm-all.c: listener should delay kvm_vm_ioctl to the commit phase

2022-08-18 Thread Leonardo Bras Soares Passos

On Thu, Aug 18, 2022 at 5:05 PM Peter Xu  wrote:
>
> On Tue, Aug 16, 2022 at 06:12:50AM -0400, Emanuele Giuseppe Esposito wrote:
> > +static void kvm_memory_region_node_add(KVMMemoryListener *kml,
> > +   struct kvm_userspace_memory_region 
> > *mem)
> > +{
> > +MemoryRegionNode *node;
> > +
> > +node = g_malloc(sizeof(MemoryRegionNode));
> > +*node = (MemoryRegionNode) {
> > +.mem = mem,
> > +};
>
> Nit: direct assignment of struct looks okay, but maybe pointer assignment
> is clearer (with g_malloc0?  Or iirc we're suggested to always use g_new0):
>
>   node = g_new0(MemoryRegionNode, 1);
>   node->mem = mem;
>
> [...]
>
> > +/* for KVM_SET_USER_MEMORY_REGION_LIST */
> > +struct kvm_userspace_memory_region_list {
> > + __u32 nent;
> > + __u32 flags;
> > + struct kvm_userspace_memory_region entries[0];
> > +};
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for 
> > userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -1426,6 +1433,8 @@ struct kvm_vfio_spapr_tce {
> >   struct kvm_userspace_memory_region)
> >  #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
> >  #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
> > +#define KVM_SET_USER_MEMORY_REGION_LIST _IOW(KVMIO, 0x49, \
> > + struct 
> > kvm_userspace_memory_region_list)
>
> I think this is probably good enough, but just to provide the other small
> (but may not be important) piece of puzzle here.  I wanted to think through
> to understand better but I never did..
>
> For a quick look, please read the comment in kvm_set_phys_mem().
>
> /*
>  * NOTE: We should be aware of the fact that here we're only
>  * doing a best effort to sync dirty bits.  No matter whether
>  * we're using dirty log or dirty ring, we ignored two facts:
>  *
>  * (1) dirty bits can reside in hardware buffers (PML)
>  *
>  * (2) after we collected dirty bits here, pages can be 
> dirtied
>  * again before we do the final KVM_SET_USER_MEMORY_REGION to
>  * remove the slot.
>  *
>  * Not easy.  Let's cross the fingers until it's fixed.
>  */
>
> One example is if we have 16G mem, we enable dirty tracking and we punch a
> hole of 1G at offset 1G, it'll change from this:
>
>  (a)
>   |- 16G ---|
>
> To this:
>
>  (b)(c)  (d)
>   |--1G--|XX|14G|
>
> Here (c) will be a 1G hole.
>
> With current code, the hole punching will del region (a) and add back
> region (b) and (d).  After the new _LIST ioctl it'll be atomic and nicer.
>
> Here the question is if we're with dirty tracking it means for each region
> we have a dirty bitmap.  Currently we do the best effort of doing below
> sequence:
>
>   (1) fetching dirty bmap of (a)
>   (2) delete region (a)
>   (3) add region (b) (d)
>
> Here (a)'s dirty bmap is mostly kept as best effort, but still we'll lose
> dirty pages written between step (1) and (2) (and actually if the write
> comes within (2) and (3) I think it'll crash qemu, and iiuc that's what
> we're going to fix..).
>
> So ideally the atomic op can be:
>
>   "atomically fetch dirty bmap for removed regions, remove regions, and add
>new regions"
>
> Rather than only:
>
>   "atomically remove regions, and add new regions"
>
> as what the new _LIST ioctl do.
>
> But... maybe that's not a real problem, at least I didn't know any report
> showing issue with current code yet caused by losing of dirty bits during
> step (1) and (2).  Neither do I know how to trigger an issue with it.
>
> I'm just trying to still provide this information so that you should be
> aware of this problem too, at the meantime when proposing the new ioctl
> change for qemu we should also keep in mind that we won't easily lose the
> dirty bmap of (a) here, which I think this patch does the right thing.
>

Thanks for bringing these details Peter!

What do you think of adding?
(4) Copy the corresponding part of (a)'s dirty bitmap to (b) and (d)'s
dirty bitmaps.


Best regards,
Leo

> Thanks!
>
> --
> Peter Xu
>

Re: [PATCH v1 1/1] migration: add remaining params->has_* = true in migration_instance_init()

2022-07-26 Thread Leonardo Bras Soares Passos

Please include:

Fixes: 69ef1f36b0 ("migration: define 'tls-creds' and 'tls-hostname'
migration parameters")
Fixes: 1d58872a91 ("migration: do not wait for free thread")
Fixes: d2f1d29b95 ("migration: add support for a "tls-authz" migration
parameter")

On Mon, Jul 25, 2022 at 10:02 PM Leonardo Bras  wrote:
>
> Some of params->has_* = true are missing in migration_instance_init, this
> causes migrate_params_check() to skip some tests, allowing some
> unsupported scenarios.
>
> Fix this by adding all missing params->has_* = true in
> migration_instance_init().
>
> Signed-off-by: Leonardo Bras 
> ---
>  migration/migration.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index e03f698a3c..82fbe0cf55 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -4451,6 +4451,7 @@ static void migration_instance_init(Object *obj)
>  /* Set has_* up only for parameter checks */
>  params->has_compress_level = true;
>  params->has_compress_threads = true;
> +params->has_compress_wait_thread = true;
>  params->has_decompress_threads = true;
>  params->has_throttle_trigger_threshold = true;
>  params->has_cpu_throttle_initial = true;
> @@ -4471,6 +4472,9 @@ static void migration_instance_init(Object *obj)
>  params->has_announce_max = true;
>  params->has_announce_rounds = true;
>  params->has_announce_step = true;
> +params->has_tls_creds = true;
> +params->has_tls_hostname = true;
> +params->has_tls_authz = true;
>
>  qemu_sem_init(>postcopy_pause_sem, 0);
>  qemu_sem_init(>postcopy_pause_rp_sem, 0);
> --
> 2.37.1
>

Re: [PATCH v4 3/3] migration/multifd: Report to user when zerocopy not working

2022-07-12 Thread Leonardo Bras Soares Passos

On Tue, Jul 12, 2022 at 7:42 PM Peter Xu  wrote:
>
> On Mon, Jul 11, 2022 at 06:11:13PM -0300, Leonardo Bras wrote:
> > Some errors, like the lack of Scatter-Gather support by the network
> > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > zero-copy, which causes it to fall back to the default copying mechanism.
> >
> > After each full dirty-bitmap scan there should be a zero-copy flush
> > happening, which checks for errors each of the previous calls to
> > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > increment dirty_sync_missed_zero_copy migration stat to let the user know
> > about it.
> >
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Daniel P. Berrangé 
>
> Acked-by: Peter Xu 

Thanks Peter!

> --
> Peter Xu
>

Re: [PATCH v3 1/3] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent

2022-07-11 Thread Leonardo Bras Soares Passos

On Thu, Jul 7, 2022 at 7:18 PM Peter Xu  wrote:
>
> On Thu, Jul 07, 2022 at 06:14:17PM -0300, Leonardo Brás wrote:
> > Having 'if(queued == sent)' will cause us to falsely return '1' in two buggy
> > cases, while 'if queued == 0) will either skip early or go into 'infinite' 
> > loop.
>
> I'm not sure I strictly follow here..
>

Sorry, I was thinking of a different scenario.

> Imagine the case we do flush() twice without sending anything, then in the
> 1st flush we'll see queued>sent, we'll finish flush() until queued==sent.
> Then in the 2nd (continuous) flush() we'll see queued==sent immediately.
>
> IIUC with the current patch we'll return 1 which I think is wrong because
> fallback didn't happen, and if with the change to "if (queued==sent) return
> 0" it'll fix it?

Yes, you are correct.
It's a possible scenario to have a flush happen just after another
without any sending in between.

I will fix it as suggested.

Best regards,
Leo

>
> --
> Peter Xu
>

Re: [PATCH v3 3/3] migration/multifd: Warn user when zerocopy not working

2022-07-07 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jul 7, 2022 at 2:56 PM Peter Xu  wrote:
>
> On Mon, Jul 04, 2022 at 05:23:15PM -0300, Leonardo Bras wrote:
> > Some errors, like the lack of Scatter-Gather support by the network
> > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > zero-copy, which causes it to fall back to the default copying mechanism.
> >
> > After each full dirty-bitmap scan there should be a zero-copy flush
> > happening, which checks for errors each of the previous calls to
> > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > increment dirty_sync_missed_zero_copy migration stat to let the user know
> > about it.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/ram.h | 2 ++
> >  migration/multifd.c | 2 ++
> >  migration/ram.c | 5 +
> >  3 files changed, 9 insertions(+)
> >
> > diff --git a/migration/ram.h b/migration/ram.h
> > index ded0a3a086..d3c7eb96f5 100644
> > --- a/migration/ram.h
> > +++ b/migration/ram.h
> > @@ -87,4 +87,6 @@ void ram_write_tracking_prepare(void);
> >  int ram_write_tracking_start(void);
> >  void ram_write_tracking_stop(void);
> >
> > +void dirty_sync_missed_zero_copy(void);
> > +
> >  #endif
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 684c014c86..3909b34967 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -624,6 +624,8 @@ int multifd_send_sync_main(QEMUFile *f)
> >  if (ret < 0) {
> >  error_report_err(err);
> >  return -1;
> > +} else if (ret == 1) {
> > +dirty_sync_missed_zero_copy();
> >  }
> >  }
> >  }
>
> I know that Juan is working on some patch to only do
> multifd_send_sync_main() for each dirty sync, but that's not landed, right?

That's correct, but I am hoping it should land before the release, so
the numbers will match.


>
> Can we name it without "dirty-sync" at all (so it'll work before/after
> Juan's patch will be applied)?  Something like "zero-copy-send-fallbacks"?

It initially was something like that, but on the v2 thread there was
some discussion on
the topic, and it was suggested the number would not mean much to the
user, unless
it was connected to something else.

Markus suggested the connection to @dirty-sync-count right in the
name, and Daniel suggested the above name, which sounds fine to me.

>
> The other thing is the subject may need to be touched up as right now with
> the field we don't warn the user anymore on zero-copy-send fallbacks.

Ok, Warning sounds misleading here.
What do you think about 'report' instead?

Best regards,
Leo

>
> Thanks,
>
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 01f9cc1d72..db948c4787 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -407,6 +407,11 @@ static void ram_transferred_add(uint64_t bytes)
> >  ram_counters.transferred += bytes;
> >  }
> >
> > +void dirty_sync_missed_zero_copy(void)
> > +{
> > +ram_counters.dirty_sync_missed_zero_copy++;
> > +}
> > +
> >  /* used by the search for pages to send */
> >  struct PageSearchStatus {
> >  /* Current block being searched */
> > --
> > 2.36.1
> >
>
> --
> Peter Xu
>

Re: [PATCH v3 2/3] Add dirty-sync-missed-zero-copy migration stat

2022-07-07 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jul 7, 2022 at 2:54 PM Peter Xu  wrote:
>
> On Mon, Jul 04, 2022 at 05:23:14PM -0300, Leonardo Bras wrote:
> > Signed-off-by: Leonardo Bras 
> > ---
> >  qapi/migration.json   | 7 ++-
> >  migration/migration.c | 2 ++
> >  monitor/hmp-cmds.c| 4 
> >  3 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 7102e474a6..fed08b9b88 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -55,6 +55,10 @@
> >  # @postcopy-bytes: The number of bytes sent during the post-copy phase
> >  #  (since 7.0).
> >  #
> > +# @dirty-sync-missed-zero-copy: Number of times dirty RAM synchronization 
> > could
> > +#   not avoid copying zero pages.  This is 
> > between 0
>
> Avoid copying zero pages?  Isn't this for counting MSG_ZEROCOPY fallbacks?

Yes, sorry, I think I got confused at some point between some cuts & pastes.
It should be "not avoid copying dirty pages." I will fix that on a V4.


>
> > +#   and @dirty-sync-count * @multifd-channels.
>
> I'd not name it as "dirty-sync-*" because fundamentally the accounting is
> not doing like that (more in latter patch).

Ok, I will take a look & answer there.

> I also think we should squash
> patch 2/3 as patch 3 only started to provide meaningful values.

IIRC Previously in zero-copy-send implementation, I was asked to keep the
property/capability in a separated patch in order to make it easier to review.
So I thought it would be helpful now.

>
> > +#   (since 7.1)
> >  # Since: 0.14
> >  ##
> >  { 'struct': 'MigrationStats',
> > @@ -65,7 +69,8 @@
> > 'postcopy-requests' : 'int', 'page-size' : 'int',
> > 'multifd-bytes' : 'uint64', 'pages-per-second' : 'uint64',
> > 'precopy-bytes' : 'uint64', 'downtime-bytes' : 'uint64',
> > -   'postcopy-bytes' : 'uint64' } }
> > +   'postcopy-bytes' : 'uint64',
> > +   'dirty-sync-missed-zero-copy' : 'uint64' } }
> >
> >  ##
> >  # @XBZRLECacheStats:
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 78f5057373..048f7f8bdb 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1027,6 +1027,8 @@ static void populate_ram_info(MigrationInfo *info, 
> > MigrationState *s)
> >  info->ram->normal_bytes = ram_counters.normal * page_size;
> >  info->ram->mbps = s->mbps;
> >  info->ram->dirty_sync_count = ram_counters.dirty_sync_count;
> > +info->ram->dirty_sync_missed_zero_copy =
> > +ram_counters.dirty_sync_missed_zero_copy;
> >  info->ram->postcopy_requests = ram_counters.postcopy_requests;
> >  info->ram->page_size = page_size;
> >  info->ram->multifd_bytes = ram_counters.multifd_bytes;
> > diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> > index ca98df0495..5f3be9e405 100644
> > --- a/monitor/hmp-cmds.c
> > +++ b/monitor/hmp-cmds.c
> > @@ -307,6 +307,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
> >  monitor_printf(mon, "postcopy ram: %" PRIu64 " kbytes\n",
> > info->ram->postcopy_bytes >> 10);
> >  }
> > +if (info->ram->dirty_sync_missed_zero_copy) {
> > +monitor_printf(mon, "missed zero-copy on: %" PRIu64 " 
> > iterations\n",
> > +   info->ram->dirty_sync_missed_zero_copy);
>
> I suggest we don't call it "iterations" because it's not the generic mean
> of iterations.

Yeah, I thought that too, but I could not think on anything better.
What do you suggest instead?

Best regards,
Leo

>
> > +}
> >  }
> >
> >  if (info->has_disk) {
> > --
> > 2.36.1
> >
>
> --
> Peter Xu
>

Re: [PATCH v3 1/3] QIOChannelSocket: Fix zero-copy flush returning code 1 when nothing sent

2022-07-07 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jul 7, 2022 at 2:47 PM Peter Xu  wrote:
>
> Hi, Leo,
>
> On Mon, Jul 04, 2022 at 05:23:13PM -0300, Leonardo Bras wrote:
> > If flush is called when no buffer was sent with MSG_ZEROCOPY, it currently
> > returns 1. This return code should be used only when Linux fails to use
> > MSG_ZEROCOPY on a lot of sendmsg().
> >
> > Fix this by returning early from flush if no sendmsg(...,MSG_ZEROCOPY)
> > was attempted.
> >
> > Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag 
> > & io_flush for CONFIG_LINUX")
> > Signed-off-by: Leonardo Bras 
> > ---
> >  io/channel-socket.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index 4466bb1cd4..698c086b70 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -716,12 +716,18 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
> >  struct cmsghdr *cm;
> >  char control[CMSG_SPACE(sizeof(*serr))];
> >  int received;
> > -int ret = 1;
> > +int ret;
> > +
> > +if (!sioc->zero_copy_queued) {
>
> I think I asked this in the downstream review but didn't get a
> response.. shouldn't this check be "queued == sent"?

This is just supposed to skip flush if nothing was queued for sending.
queued == sent is tested bellow in the while part.

Without this, the function could return 1 if nothing was sent with zero-copy,
and it would be confusing, because the QIOChannel API says 1 should be
returned only if all zero-copy sends fell back to copying.

Best regards,
Leo

>
> > +return 0;
> > +}
> >
> >  msg.msg_control = control;
> >  msg.msg_controllen = sizeof(control);
> >  memset(control, 0, sizeof(control));
> >
> > +ret = 1;
> > +
> >  while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
> >  received = recvmsg(sioc->fd, , MSG_ERRQUEUE);
> >  if (received < 0) {
> > --
> > 2.36.1
> >
>
> --
> Peter Xu
>

Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working

2022-06-28 Thread Leonardo Bras Soares Passos

On Tue, Jun 28, 2022 at 10:52 AM Dr. David Alan Gilbert
 wrote:
>
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > On Tue, Jun 28, 2022 at 09:32:04AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > > > > Some errors, like the lack of Scatter-Gather support by the network
> > > > > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on 
> > > > > using
> > > > > zero-copy, which causes it to fall back to the default copying 
> > > > > mechanism.
> > > >
> > > > How common is this lack of SG support ? What NICs did you have that
> > > > were affected ?
> > >
> > > I am not aware of any NIC without SG available for testing, nor have
> > > any idea on how common they are.
> > > But since we can detect sendmsg() falling back to copying we should
> > > warn the user if this ever happens.
> > >
> > > There is also a case in IPv6 related to fragmentation that may cause
> > > MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
> > > covered.
> > >
> > > >
> > > > > After each full dirty-bitmap scan there should be a zero-copy flush
> > > > > happening, which checks for errors each of the previous calls to
> > > > > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, 
> > > > > then
> > > > > warn the user about it.
> > > > >
> > > > > Since it happens once each full dirty-bitmap scan, even in worst case
> > > > > scenario it should not print a lot of warnings, and will allow 
> > > > > tracking
> > > > > how many dirty-bitmap iterations were not able to use zero-copy send.
> > > >
> > > > For long running migrations which are not converging, or converging
> > > > very slowly there could be 100's of passes.
> > > >
> > >
> > > I could change it so it only warns once, if that is too much output.
> >
> > Well I'm mostly wondering what we're expecting the user todo with this
> > information.


My rationale on that:
- zero-copy-send is a feature that is supposed to improve send
throughput by reducing cpu usage.
- there is a chance the sendmsg(MSG_ZEROCOPY) fails to use zero-copy
- if this happens, there will be a potential throughput decrease on sendmsg()
- the user (or management app) need to know when zero-copy-send is
degrading throughput, so it can be disabled
- this is also important for performance testing, given it can be
confusing having zero-copy-send improving throughput in some cases,
and degrading in others, without any apparent reason why.

> > Generally a log file containing warnings ends up turning
> > into a bug report. If we think it is important for users and/or mgmt
> > apps to be aware of this info, then it might be better to actually
> > put a field in the query-migrate stats to report if zero-copy is
> > being honoured or not,
>
> Yeh just a counter would work there I think.

The warning idea was totally due to my inexperience on this mgmt app
interface, since I had no other idea on how to deal with that.

I think having it in query-migrate is a much better idea than a
warning, since it should be much easier to parse and disable
zero-copy-send if desired.
Even in my current qemu test script, it's much better having it in
query-migrate.

>
> > and just have a trace point in this location
> > instead.
>
> Yeh.
>

Yeap, the counter idea seems great!
Will it be always printed there, or only when zero-copy-send is enabled?

Best regards,
Leo

> Dave
>
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange 
> > :|
> > |: https://libvirt.org -o-https://fstop138.berrange.com 
> > :|
> > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange 
> > :|
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PATCH v1 2/2] migration/multifd: Warn user when zerocopy not working

2022-06-28 Thread Leonardo Bras Soares Passos

On Tue, Jun 28, 2022 at 4:53 AM Daniel P. Berrangé  wrote:
>
> On Mon, Jun 27, 2022 at 10:09:09PM -0300, Leonardo Bras wrote:
> > Some errors, like the lack of Scatter-Gather support by the network
> > interface(NETIF_F_SG) may cause sendmsg(...,MSG_ZEROCOPY) to fail on using
> > zero-copy, which causes it to fall back to the default copying mechanism.
>
> How common is this lack of SG support ? What NICs did you have that
> were affected ?

I am not aware of any NIC without SG available for testing, nor have
any idea on how common they are.
But since we can detect sendmsg() falling back to copying we should
warn the user if this ever happens.

There is also a case in IPv6 related to fragmentation that may cause
MSG_ZEROCOPY to fall back to the copying mechanism, so it's also
covered.

>
> > After each full dirty-bitmap scan there should be a zero-copy flush
> > happening, which checks for errors each of the previous calls to
> > sendmsg(...,MSG_ZEROCOPY). If all of them failed to use zero-copy, then
> > warn the user about it.
> >
> > Since it happens once each full dirty-bitmap scan, even in worst case
> > scenario it should not print a lot of warnings, and will allow tracking
> > how many dirty-bitmap iterations were not able to use zero-copy send.
>
> For long running migrations which are not converging, or converging
> very slowly there could be 100's of passes.
>

I could change it so it only warns once, if that is too much output.

Best regards,
Leo

>
>
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 684c014c86..9c62aec84e 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -624,6 +624,9 @@ int multifd_send_sync_main(QEMUFile *f)
> >  if (ret < 0) {
> >  error_report_err(err);
> >  return -1;
> > +} else if (ret == 1) {
> > +warn_report("The network device is not able to use "
> > +"zero-copy-send: copying is being used");
> >  }
> >  }
> >  }
> > --
> > 2.36.1
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability

2022-06-19 Thread Leonardo Bras Soares Passos

CC: Jiri Denemark 

On Mon, Jun 20, 2022 at 2:40 AM Leonardo Bras  wrote:
>
> When originally implemented, zero_copy_send was designed as a Migration
> paramenter.
>
> But taking into account how is that supposed to work, and how
> the difference between a capability and a parameter, it only makes sense
> that zero-copy-send would work better as a capability.
>
> Taking into account how recently the change got merged, it was decided
> that it's still time to make it right, and convert zero_copy_send into
> a Migration capability.
>
> Signed-off-by: Leonardo Bras 
> ---
>  qapi/migration.json   | 33 ---
>  migration/migration.c | 52 ---
>  monitor/hmp-cmds.c|  6 -
>  3 files changed, 28 insertions(+), 63 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 6130cd9fae..baf8d734de 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -461,6 +461,13 @@
>  #   procedure starts. The VM RAM is saved with running 
> VM.
>  #   (since 6.0)
>  #
> +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> +#  When true, enables a zero-copy mechanism for sending
> +#  memory pages, if host supports it.
> +#  Requires that QEMU be permitted to use locked memory
> +#  for guest RAM pages.
> +#  (since 7.1)
> +#
>  # Features:
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
> @@ -474,7 +481,8 @@
> 'block', 'return-path', 'pause-before-switchover', 'multifd',
> 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> -   'validate-uuid', 'background-snapshot'] }
> +   'validate-uuid', 'background-snapshot',
> +   { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'}] }
>
>  ##
>  # @MigrationCapabilityStatus:
> @@ -738,12 +746,6 @@
>  #  will consume more CPU.
>  #  Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#  When true, enables a zero-copy mechanism for sending
> -#  memory pages, if host supports it.
> -#  Requires that QEMU be permitted to use locked memory
> -#  for guest RAM pages.
> -#  Defaults to false. (Since 7.1)
>  #
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #aliases for the purpose of dirty bitmap migration.  
> Such
> @@ -784,7 +786,6 @@
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 'multifd-compression',
> 'multifd-zlib-level' ,'multifd-zstd-level',
> -   { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
> 'block-bitmap-mapping' ] }
>
>  ##
> @@ -911,13 +912,6 @@
>  #  will consume more CPU.
>  #  Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#  When true, enables a zero-copy mechanism for sending
> -#  memory pages, if host supports it.
> -#  Requires that QEMU be permitted to use locked memory
> -#  for guest RAM pages.
> -#  Defaults to false. (Since 7.1)
> -#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #aliases for the purpose of dirty bitmap migration.  
> Such
>  #aliases may for example be the corresponding names 
> on the
> @@ -972,7 +966,6 @@
>  '*multifd-compression': 'MultiFDCompression',
>  '*multifd-zlib-level': 'uint8',
>  '*multifd-zstd-level': 'uint8',
> -'*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
>  '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>
>  ##
> @@ -1119,13 +1112,6 @@
>  #  will consume more CPU.
>  #  Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#  When true, enables a zero-copy mechanism for sending
> -#  memory pages, if host supports it.
> -#  Requires that QEMU be permitted to use locked memory
> -#  for guest RAM pages.
> -#  Defaults to false. (Since 7.1)
> -#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #aliases for the purpose of dirty bitmap migration.  
> Such
>  #aliases may for example be the corresponding names 
> on the
> @@ -1178,7 +1164,6 @@
>  '*multifd-compression': 'MultiFDCompression',
>  '*multifd-zlib-level':

Re: [PATCH v2 1/2] QIOChannelSocket: Reduce ifdefs to improve readability

2022-06-14 Thread Leonardo Bras Soares Passos

On Tue, Jun 14, 2022 at 5:36 AM Daniel P. Berrangé  wrote:
>
> On Mon, Jun 13, 2022 at 06:21:18PM -0300, Leonardo Bras Soares Passos wrote:
> > On Fri, Jun 10, 2022 at 5:25 AM Daniel P. Berrangé  
> > wrote:
> > >
> >
> > [...]
> >
> > > Ok, so if it is checked earlier then we merely need an assert.
> > >
> > >  if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > >  #ifdef QEMU_MSG_ZEROCOPY
> > >  sflags = MSG_ZEROCOPY;
> > >  zero_copy_enabled = true;
> > >  #else
> > >  g_assert_unreachable();
> > >  #endif
> > > > }
> >
> > Ok, I will add that in the next version.
> >
> > >
> > >
> > >
> > > > > > @@ -592,15 +594,13 @@ static ssize_t 
> > > > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > > > >  case EINTR:
> > > > > >  goto retry;
> > > > > > -#ifdef QEMU_MSG_ZEROCOPY
> > > > > >  case ENOBUFS:
> > > > > > -if (sflags & MSG_ZEROCOPY) {
> > > > > > +if (zero_copy_enabled) {
> > > > >
> > > > > if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY)
> > > > >
> > > > > avoids the #ifdef without needing to add yet another
> > > > > variable expressing what's already expressed in both
> > > > > 'flags' and 'sflags'.
> > > >
> > > > Yes, it does, but at the cost of not compiling-out the zero-copy part
> > > > when it's not supported,
> > > > since the QIO_CHANNEL_WRITE_FLAG_ZERO_COPY comes as a parameter. This 
> > > > ends up
> > > > meaning there will be at least one extra test for every time this
> > > > function is called (the one in the next patch).
> > >
> > > The cost of a simple bit test is between negligible-and-non-existant
> > > with branch prediction. I doubt it would be possible to even measure
> > > it.
> >
> > Yeah, you are probably right on that.
> > So the main learning point here is that it's not worth creating a new
> > boolean for compiling-out
> > code that should not impact performance ?
>
> As ever "it depends" so there's no hard rule, and sometimes it can
> verge on bikeshed colouring :-)
>
> I didn't like the variable in this case, because it introduces a 3rd
> variable to the method for representing whether zero copy is need,
> which is excessive. I'm not a fan of redundancy as it can often then
> lead to inconsistency. So it would need a compelling reason why it is
> better, which is difficult for such a simple method. If the code was
> more complex, a variable might have benefit of clarity, but in this
> case IMHO it was just overkill.

I see. Thanks for the clarification!

Best regards,
Leo

>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-13 Thread Leonardo Bras Soares Passos

On Mon, Jun 13, 2022 at 7:53 PM Peter Xu  wrote:
>
> On Mon, Jun 13, 2022 at 05:58:44PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Peter,
> >
> > On Wed, Jun 8, 2022 at 5:23 PM Peter Xu  wrote:
> > [...]
> > > > In a previous iteration of the patchset, it was made clear that it's
> > > > desirable to detect when the kernel falls back to copying mechanism,
> > > > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > > > overhead. This was done by the return value of flush(), which is 1 if
> > > > that occurs.
> > >
> > > Two questions..
> > >
> > >   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
> > >  is functional?
> >
> > I am not sure about what exactly you meant by 'like zerocopy is
> > funcional', but the
> > idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
> > syscall with MSG_ZEROCOPY that previously happened. This does not depend on
> > the outcome (like falling back to the copying mechanism).
> > btw, most of those messages may be batched to reduce overhead.
> >
> > At some point, zero-copy may fail, and fall back to copying, so in
> > those messages
> > an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
> > those messages in a flush will trigger the returning of 1 from the
> > flush function.
>
> Ah I think I missed the "reset ret==0 when !SO_EE_CODE_ZEROCOPY_COPIED"
> path..  Sorry.
>
> >
> > >
> > >  If the answer is yes, I don't see how ret=1 will ever be
> > >  returned.. because we'll also go into the same loop in
> > >  qio_channel_socket_flush() anyway.
> >
> >
> > We set ret to 1 at function entry and then for each message in the 
> > MSG_ERRQUEUE,
> > we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
> > If it ever have a different error code, we set ret=0.
> >
> > So, in our previous example, if we have a net device not supporting
> > the 'Scatter-Gather'
> > feature (NETIF_F_SG), every error message will be
> > SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.
> >
> >
> > >
> > >  If the answer is no, then since we'll have non-zero zero_copy_queued,
> > >  will the loop in qio_channel_socket_flush() go into a dead one?  How
> > >  could it return?
> >
> > No, because it will go through all packets sent with MSG_ZEROCOPY, 
> > including the
> > ones that fell back to copying, so the counter should be fine. If any
> > code disables
> > zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, 
> > so it
> > should be fine.
> >
> > >
> > >   2) Even if we have the correct ret=1 returned when that happens, which
> > >  caller is detecting that ret==1 and warn the admin?
> > >
> >
> > No caller is using that right now.
> > It's supposed to be a QIOChannel interface feature, and any 
> > user/implementation
> > could use that information to warn if zero-copy is not being used, fall 
> > back to
> > copying directly (to avoid overhead of testing zero-copy) or even use
> > it to cancel the
> > sending if wanted.
> >
> > It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.
>
> OK the detection makes sense, thanks for the details.
>
> Then now I'm wondering whether we should have warned the admin already if
> zero-copy send is not fully enabled in live migration.  Should we add a
> error_report_once() somewhere for the ret==1 already?  After all the user
> specify zero_copy_send=true explicitly.  Did I miss something again?
>

You are correct, I think warning the user is the valid thing to have here.
At the end of the first iteration, where the first flush happens,  I
think it's too late to
fail the migration, since a huge lot of the data has already been sent.

Best regards,
Leo

Re: [PATCH v2 1/2] QIOChannelSocket: Reduce ifdefs to improve readability

2022-06-13 Thread Leonardo Bras Soares Passos

On Fri, Jun 10, 2022 at 5:25 AM Daniel P. Berrangé  wrote:
>

[...]

> Ok, so if it is checked earlier then we merely need an assert.
>
>  if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
>  #ifdef QEMU_MSG_ZEROCOPY
>  sflags = MSG_ZEROCOPY;
>  zero_copy_enabled = true;
>  #else
>  g_assert_unreachable();
>  #endif
> > }

Ok, I will add that in the next version.

>
>
>
> > > > @@ -592,15 +594,13 @@ static ssize_t 
> > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > >  case EINTR:
> > > >  goto retry;
> > > > -#ifdef QEMU_MSG_ZEROCOPY
> > > >  case ENOBUFS:
> > > > -if (sflags & MSG_ZEROCOPY) {
> > > > +if (zero_copy_enabled) {
> > >
> > > if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY)
> > >
> > > avoids the #ifdef without needing to add yet another
> > > variable expressing what's already expressed in both
> > > 'flags' and 'sflags'.
> >
> > Yes, it does, but at the cost of not compiling-out the zero-copy part
> > when it's not supported,
> > since the QIO_CHANNEL_WRITE_FLAG_ZERO_COPY comes as a parameter. This ends 
> > up
> > meaning there will be at least one extra test for every time this
> > function is called (the one in the next patch).
>
> The cost of a simple bit test is between negligible-and-non-existant
> with branch prediction. I doubt it would be possible to even measure
> it.

Yeah, you are probably right on that.
So the main learning point here is that it's not worth creating a new
boolean for compiling-out
code that should not impact performance ?
I mean, if performance-wise they should be the same, then a new
variable would be just a
bother for the programmer.

Best regards,
Leo






>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-13 Thread Leonardo Bras Soares Passos

Hello Peter,

On Wed, Jun 8, 2022 at 5:23 PM Peter Xu  wrote:
[...]
> > In a previous iteration of the patchset, it was made clear that it's
> > desirable to detect when the kernel falls back to copying mechanism,
> > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > overhead. This was done by the return value of flush(), which is 1 if
> > that occurs.
>
> Two questions..
>
>   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
>  is functional?

I am not sure about what exactly you meant by 'like zerocopy is
funcional', but the
idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
syscall with MSG_ZEROCOPY that previously happened. This does not depend on
the outcome (like falling back to the copying mechanism).
btw, most of those messages may be batched to reduce overhead.

At some point, zero-copy may fail, and fall back to copying, so in
those messages
an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
those messages in a flush will trigger the returning of 1 from the
flush function.

>
>  If the answer is yes, I don't see how ret=1 will ever be
>  returned.. because we'll also go into the same loop in
>  qio_channel_socket_flush() anyway.

We set ret to 1 at function entry and then for each message in the MSG_ERRQUEUE,
we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
If it ever have a different error code, we set ret=0.

So, in our previous example, if we have a net device not supporting
the 'Scatter-Gather'
feature (NETIF_F_SG), every error message will be
SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.

>
>  If the answer is no, then since we'll have non-zero zero_copy_queued,
>  will the loop in qio_channel_socket_flush() go into a dead one?  How
>  could it return?

No, because it will go through all packets sent with MSG_ZEROCOPY, including the
ones that fell back to copying, so the counter should be fine. If any
code disables
zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, so it
should be fine.

>
>   2) Even if we have the correct ret=1 returned when that happens, which
>  caller is detecting that ret==1 and warn the admin?
>

No caller is using that right now.
It's supposed to be a QIOChannel interface feature, and any user/implementation
could use that information to warn if zero-copy is not being used, fall back to
copying directly (to avoid overhead of testing zero-copy) or even use
it to cancel the
sending if wanted.

It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.

Best regards,
Leo

Re: [PATCH v2 1/2] QIOChannelSocket: Reduce ifdefs to improve readability

2022-06-09 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Thu, Jun 9, 2022 at 5:10 AM Daniel P. Berrangé  wrote:
>
> On Wed, Jun 08, 2022 at 06:04:02PM -0300, Leonardo Bras wrote:
> > During implementation of MSG_ZEROCOPY feature, a lot of #ifdefs were
> > introduced, particularly at qio_channel_socket_writev().
> >
> > Rewrite some of those changes so it's easier to read.
> >   ...
> > Signed-off-by: Leonardo Bras 
> > ---
> >  io/channel-socket.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index dc9c165de1..ef7c7cfbac 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -554,6 +554,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  size_t fdsize = sizeof(int) * nfds;
> >  struct cmsghdr *cmsg;
> >  int sflags = 0;
> > +bool zero_copy_enabled = false;
> >
> >  memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >
> > @@ -581,6 +582,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  #ifdef QEMU_MSG_ZEROCOPY
> >  if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> >  sflags = MSG_ZEROCOPY;
> > +zero_copy_enabled = true;
> >  }
>
> There should be a
>
>  #else
> error_setg(errp, "Zero copy not supported on this platform");
> return -1;
>  #endif
>

IIUC, if done as suggested, it will break every non-zero-copy call of
qio_channel_socket_writev();

I think you are suggesting something like :

if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
#ifdef QEMU_MSG_ZEROCOPY
sflags = MSG_ZEROCOPY;
zero_copy_enabled = true; // I know you suggested this out,
just for example purposes
#else
error_setg(errp, "Zero copy not supported on this platform");
return -1;
#endif
}

Which is supposed to fail if QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is specified, but
qemu does not support it at compile time.

If I get the part above correctly, it would not be necessary, as
qio_channel_socket_writev() is
called only by qio_channel_writev_full(), which tests:

if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
error_setg_errno(errp, EINVAL,
 "Requested Zero Copy feature is not available");
return -1;
}

and QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY is only set in
qio_channel_socket_connect_sync(), and is conditional to QEMU_MSG_ZEROCOPY
being enabled during compile time. Meaning it's the same test as
before mentioned, but
failing earlier.

> >  #endif
> >
> > @@ -592,15 +594,13 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  return QIO_CHANNEL_ERR_BLOCK;
> >  case EINTR:
> >  goto retry;
> > -#ifdef QEMU_MSG_ZEROCOPY
> >  case ENOBUFS:
> > -if (sflags & MSG_ZEROCOPY) {
> > +if (zero_copy_enabled) {
>
> if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY)
>
> avoids the #ifdef without needing to add yet another
> variable expressing what's already expressed in both
> 'flags' and 'sflags'.

Yes, it does, but at the cost of not compiling-out the zero-copy part
when it's not supported,
since the QIO_CHANNEL_WRITE_FLAG_ZERO_COPY comes as a parameter. This ends up
meaning there will be at least one extra test for every time this
function is called (the one in the next patch).

An option would be testing sflags & MSG_ZEROCOPY, which would
compile-out zero-copy code
if it's not supported, but there was a bug in some distros where
MSG_ZEROCOPY is not defined,
causing the build to fail.

I understand the idea of reusing those variables instead of creating a
new one, but this boolean
variable will most certainly be compiled-out in this function, and
will allow compiling out the
zero-copy code where it's not supported.

Best regards,
Leo


>
> >  error_setg_errno(errp, errno,
> >   "Process can't lock enough memory for 
> > using MSG_ZEROCOPY");
> >  return -1;
> >  }
> >  break;
> > -#endif
> >  }
> >
> >  error_setg_errno(errp, errno,
> > --
> > 2.36.1
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v1 1/1] QIOChannelSocket: Fix zero-copy send so socket flush works

2022-06-08 Thread Leonardo Bras Soares Passos

On Wed, Jun 8, 2022 at 5:55 PM Peter Xu  wrote:
>
> On Wed, Jun 08, 2022 at 04:26:10PM -0400, Peter Xu wrote:
> > On Wed, Jun 08, 2022 at 03:18:09PM -0300, Leonardo Bras wrote:
> > > Somewhere between v6 and v7 the of the zero-copy-send patchset a crucial
> > > part of the flushing mechanism got missing: incrementing zero_copy_queued.
> > >
> > > Without that, the flushing interface becomes a no-op, and there is no
> > > garantee the buffer is really sent.
> > >
> > > This can go as bad as causing a corruption in RAM during migration.
> > >
> > > Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy 
> > > flag & io_flush for CONFIG_LINUX")
> > > Reported-by: 徐闯 
> > > Signed-off-by: Leonardo Bras 
> >
> > Copy Dave/Juan; Leo please remember to do so in the next posts, or no one
> > will be picking this up. :)
>

Thanks for letting me know.

> My fault, it's an io channel patch.  But still good to copy relevant
> developers..

Np. Sure, I will keep in mind to add them in the next version.

Oh, BTW: I will be sending a v2 shortly.

>
> --
> Peter Xu
>

Re: [PATCH v1 1/1] QIOChannelSocket: Fix zero-copy send so socket flush works

2022-06-08 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Wed, Jun 8, 2022 at 3:46 PM Daniel P. Berrangé  wrote:
>
> On Wed, Jun 08, 2022 at 03:18:09PM -0300, Leonardo Bras wrote:
> > Somewhere between v6 and v7 the of the zero-copy-send patchset a crucial
> > part of the flushing mechanism got missing: incrementing zero_copy_queued.
> >
> > Without that, the flushing interface becomes a no-op, and there is no
> > garantee the buffer is really sent.
> >
> > This can go as bad as causing a corruption in RAM during migration.
> >
> > Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag 
> > & io_flush for CONFIG_LINUX")
> > Reported-by: 徐闯 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  io/channel-socket.c | 11 ---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index dc9c165de1..ca4cae930f 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -554,6 +554,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  size_t fdsize = sizeof(int) * nfds;
> >  struct cmsghdr *cmsg;
> >  int sflags = 0;
> > +bool zero_copy_enabled = false;
> >
> >  memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >
> > @@ -581,6 +582,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  #ifdef QEMU_MSG_ZEROCOPY
> >  if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> >  sflags = MSG_ZEROCOPY;
> > +zero_copy_enabled = true;
> >  }
> >  #endif
> >
> > @@ -592,21 +594,24 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  return QIO_CHANNEL_ERR_BLOCK;
> >  case EINTR:
> >  goto retry;
> > -#ifdef QEMU_MSG_ZEROCOPY
>
> Removing this ifdef appears incidental to the change. If this is
> redundant just remove it in its own patch.

The idea is to reduce the amount of #ifdefs as Peter suggested,
because adding another ifdef here
would introduce extra noise. But sure, I see no problem adding this
change as a previous patch.

>
> >  case ENOBUFS:
> > -if (sflags & MSG_ZEROCOPY) {
> > +if (zero_copy_enabled) {
> >  error_setg_errno(errp, errno,
> >   "Process can't lock enough memory for 
> > using MSG_ZEROCOPY");
> >  return -1;
> >  }
> >  break;
> > -#endif
> >  }
> >
> >  error_setg_errno(errp, errno,
> >   "Unable to write to socket");
> >  return -1;
> >  }
> > +
> > +if (zero_copy_enabled) {
>
> What's wrong with
>
>if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> sioc->zero_copy_queued++;
> }

There is nothing wrong with it, but using zero_copy_enabled as
presented here will
compile-out this 'if()'  block if the user does not support MSG_ZEROCOPY.

Best regards,
Leo

>
>
> Introducing another local variable doesn't really add value IMHO.
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-08 Thread Leonardo Bras Soares Passos

On Wed, Jun 8, 2022 at 8:41 AM Peter Xu  wrote:
>
> On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> > (1) is not an option, as the interface currently uses ret=1 to make
> > sure MSG_ZEROCOPY is getting used,
> > I added that so the user of qio_channel can switch off zero-copy if
> > it's not getting used, and save some cpu.
>
> Yes (1) is not, but could you explain what do you mean by making sure
> MSG_ZEROCOPY being used?  Why is it relevant to the retval here?

If sendmsg() is called with MSG_ZEROCOPY, and everything is configured
correctly, the kernel will attempt to send the buffer using zero-copy.

Even with the right configuration on a recent enough kernel, there are
factors that can prevent zero-copy from happening, and the kernel will
fall back to the copying mechanism.
An example being the net device not supporting 'Scatter-Gather'
feature (NETIF_F_SG).

When this happens, there is an overhead for 'trying zero-copy first',
instead of just opting for the copying mechanism.

In a previous iteration of the patchset, it was made clear that it's
desirable to detect when the kernel falls back to copying mechanism,
so the user of 'QIOChannelSocket' can switch to copying and avoid the
overhead. This was done by the return value of flush(), which is 1 if
that occurs.

>
> I just figured it's a bit weird to return >0 here in flush().
>
> >
> > (2) is not a problem, but I fail to see how useful that would be. Is
> > the idea manually keeping track of flush happening?
>
> Yes if we can check this up it'll be good enough to me.  The trace point
> could help in some case in the future too to monitor the behavior of kernel
> MSG_ERRQUEUE but if you don't like it then it's okay.
>

TBH I am not sure how those traces work yet, and I am afraid it can
introduce some overhead in flush.
In any way, we can introduce this trace in a separated patch, since
fixing zero-copy flush seems more urgent right now.

Best regards,
Leo

> --
> Peter Xu
>

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-07 Thread Leonardo Bras Soares Passos

Hello Peter,

On Wed, Jun 1, 2022 at 10:58 AM Peter Xu  wrote:
>
[...]
> > Hi, Leonardo. I'm also paying attention to the application of MSG_ZEROCOPY
> > in live migration recently. I noticed that you defined a member
> > `zero_copy_queued` in the struct QIOChannelSocket, but I can't find out
> > where the value of this member has been changed in your patch. Can you
> > answer it for me?
> >
>
> Good point.. it should probably be increased when queuing the pages. We'd
> better fix it up or it seems the flush() will be literally an no-op..

That's correct.
I am working on a fix right now.
The idea is to increment it in qio_channel_socket_writev() if sendmsg succeeds.

>
> Two things in qio_channel_socket_flush() we can do to make sure it'll work
> as expected, imo:
>
>   1) make ret=-1 as initial value, rather than 1 - we only check negative
>  errors in the caller so we could have missed a positive "1"
>
>   2) add a tracepoint into the loop of updating zero_copy_sent
>
> Leo, what's your take?

(1) is not an option, as the interface currently uses ret=1 to make
sure MSG_ZEROCOPY is getting used,
I added that so the user of qio_channel can switch off zero-copy if
it's not getting used, and save some cpu.

(2) is not a problem, but I fail to see how useful that would be. Is
the idea manually keeping track of flush happening?

Best regards,
Leo

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-07 Thread Leonardo Bras Soares Passos

Hello 徐闯,

Thanks for reviewing!

On Wed, Jun 1, 2022 at 6:37 AM 徐闯  wrote:
[...]
> Hi, Leonardo. I'm also paying attention to the application of
> MSG_ZEROCOPY in live migration recently. I noticed that you defined a
> member `zero_copy_queued` in the struct QIOChannelSocket, but I can't
> find out where the value of this member has been changed in your patch.
> Can you answer it for me?
>

You are right.
This was being correctly implemented until v6, and then the increment
just vanished.

Since v6 there  were a lot of changes both in the patch and in the
base repository, so I think I completely missed it in some change
iteration.

I will send a fix shortly.
Is that ok if I include a "Reported-by:  徐闯
" in the patch?

Best regards,
Leo

Re: [PULL 00/16] migration queue

2022-05-13 Thread Leonardo Bras Soares Passos

On Wed, May 11, 2022 at 5:55 AM Dr. David Alan Gilbert
 wrote:
>
> * Leonardo Bras Soares Passos (leob...@redhat.com) wrote:
> > From a previous thread:
> >
> > On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
> >  wrote:
> > >
> > > Leo:
> > >   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> > > one I guess is the simpler one; I think Stefanha managed to find the
> > > liburing fix for the __kernel_timespec case, but that looks like a bit
> > > more fun!
> > >
> > > Dave
> >
> > I thought Stefanha had fixed this bug, and we were just waiting for a
> > new alpine rootfs/image with that fixed.
> > Is that correct?
> >
> > On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
> >  wrote:
> > >
> > > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > > > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
> > [...]
> > > >
> > > > Yuk. That very much looks like a bug in liburing itself to me.
> > > >
> > > >
> > > > We've exposed the latent bug by including linux/errqueue.h
> > >
> > > Yes, I think there was a thread after the 1st pull where Leo identified
> > > the patch that fixed it; but it's not in that image.
> >
> > I only fixed the MSG_ZEROCOPY missing define bug, as I got that
> > Stefanha had already fixed the issue in liburing/alpine.
> >
> > questions:
> > - Has Stefanha really fixed that, and we are just waiting for a new
> > image, or have I got that wrong?
> > - How should I proceed with that?
> >
> > - If we proceed with fixing this up in alpine, will that require this
> > patchset to be on pause until it's fixed there?
>
> It needs to pass in CI; so yes.
>
> > - If so, is there any suggestion on how to fix that in qemu code?
> > (this header is needed because of SO_EE_* defines)
>
> I've not actually looked at the detail of the failure; but yes I think
> we need a qemu workaround here.
>
> If there's no simple fix, then adding a test to meson.build to
> conditionally disable liburing might be best; like the test code for
> libcap_ng I guess (search in meson.build for libcap_ng.found  at around
> line 540.

Hello Dave,

I solved this issue by doing this:

+linux_io_uring_test = '''
+  #include 
+  #include 
+
+  int main(void) { return 0; }'''
+
 linux_io_uring = not_found
 if not get_option('linux_io_uring').auto() or have_block
   linux_io_uring = dependency('liburing', version: '>=0.3',
   required: get_option('linux_io_uring'),
   method: 'pkg-config', kwargs: static_kwargs)
+  if not cc.links(linux_io_uring_test)
+linux_io_uring = not_found
+  endif
 endif
+

Seems to work fine in CI, and now Alpine does not fail anymore.
(See pipeline https://gitlab.com/LeoBras/qemu/-/pipelines/538123933
for reference)

I am not sure if this is the right thing to do, but I will be sending
it as a full new patchset (v13), with the first patch being the one
with the above change and the rest just carrying the recommended
fixes.

I was also thinking I could instead send the single "fix" patch, and
recommend adding it before my v12. If that is the correct approach for
this case, please let me know so I can improve in the future. (I am
trying to figure out what is simpler/best for maintainers)

Best regards,
Leo







>
> Dave
>
> > Thank you all!
> >
> > Best regards,
> > Leo
> >
> > >
> > > Dave
> > >
> > > > With regards,
> > > > Daniel
> > > > --
> > > > |: https://berrange.com  -o-
> > > > https://www.flickr.com/photos/dberrange :|
> > > > |: https://libvirt.org -o-
> > > > https://fstop138.berrange.com :|
> > > > |: https://entangle-photo.org-o-
> > > > https://www.instagram.com/dberrange :|
> > > >
> > > --
> > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> > >
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PULL 00/16] migration queue

2022-05-10 Thread Leonardo Bras Soares Passos

>From a previous thread:

On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
 wrote:
>
> Leo:
>   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> one I guess is the simpler one; I think Stefanha managed to find the
> liburing fix for the __kernel_timespec case, but that looks like a bit
> more fun!
>
> Dave

I thought Stefanha had fixed this bug, and we were just waiting for a
new alpine rootfs/image with that fixed.
Is that correct?

On Tue, May 10, 2022 at 7:43 AM Dr. David Alan Gilbert
 wrote:
>
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > On Tue, May 10, 2022 at 10:58:30AM +0100, Dr. David Alan Gilbert wrote:
[...]
> >
> > Yuk. That very much looks like a bug in liburing itself to me.
> >
> >
> > We've exposed the latent bug by including linux/errqueue.h
>
> Yes, I think there was a thread after the 1st pull where Leo identified
> the patch that fixed it; but it's not in that image.

I only fixed the MSG_ZEROCOPY missing define bug, as I got that
Stefanha had already fixed the issue in liburing/alpine.

questions:
- Has Stefanha really fixed that, and we are just waiting for a new
image, or have I got that wrong?
- How should I proceed with that?
- If we proceed with fixing this up in alpine, will that require this
patchset to be on pause until it's fixed there?
- If so, is there any suggestion on how to fix that in qemu code?
(this header is needed because of SO_EE_* defines)

Thank you all!

Best regards,
Leo

>
> Dave
>
> > With regards,
> > Daniel
> > --
> > |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange 
> > :|
> > |: https://libvirt.org -o-https://fstop138.berrange.com 
> > :|
> > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange 
> > :|
> >
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PULL 12/16] migration: Add zero-copy-send parameter for QMP/HMP for Linux

2022-05-09 Thread Leonardo Bras Soares Passos

On Mon, May 9, 2022 at 4:45 PM Richard Henderson
 wrote:
>
> On 5/9/22 10:02, Dr. David Alan Gilbert (git) wrote:
> > diff --git a/roms/skiboot b/roms/skiboot
> > index 24a7eb3596..820d43c0a7 16
> > --- a/roms/skiboot
> > +++ b/roms/skiboot
> > @@ -1 +1 @@
> > -Subproject commit 24a7eb35966d93455520bc2debdd7954314b638b
> > +Subproject commit 820d43c0a7751e75a8830561f35535dfffd522bd
>
> This is an error, probably in rebasing.

Yeah, that's odd. Should not be there.

David, could you please remove that?
Thanks!

>
>
> r~
>

Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-05 Thread Leonardo Bras Soares Passos

On Thu, May 5, 2022 at 12:55 PM Daniel P. Berrangé  wrote:
>
> On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote:
> >
> > Hello Daniel,
> >
> > But what if this gets compiled in a Linux system without MSG_ZEROCOPY 
> > support?
> > As qapi will have zero-copy-send as an option we could have this scenario:
> >
> > - User request migration using zero-copy-send
> > - multifd_save_setup() will set write_flags = 
> > QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
> > - In qio_channel_socket_connect_sync(): setsockopt() part will be
> > compiled-out, so no error here
> > - above part in qio_channel_socket_writev() will be commented-out,
> > which means write_flags will be ignored
> > - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
> > - migration will succeed
> >
> > In the above case, the user has all the reason to think migration is
> > using MSG_ZEROCOPY, but in fact it's quietly falling back to
> > copy-mode.
>
> I think we're ok because qio_channel_writev_full() does
>
> if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
> !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
> error_setg_errno(errp, EINVAL,
>  "Requested Zero Copy feature is not available");
> return -1;
> }
>
> and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to
> get set when MSG_ZEROCOPY is compiled out, we'll trigger the error
> condition.

Oh, that's right. It will fail in the first writev(), I was just
considering failing during setup.

>
> > That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
> > MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
> > of even offering zero-copy-send if we don't have it.
> >
> > Another local option is to do implement your suggestions, and also
> > change qio_channel_socket_connect_sync() so it returns an error if
> > MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:
> >
> > +#ifdef CONFIG_LINUX
> > +#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
> > +int ret, v = 1;
> > +ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, , sizeof(v));
> > +if (ret == 0) {
> > +/* Zero copy available on host */
> > +qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +}
> > +#else
> > +error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
> > +return -1;
> > +#endif
> > +#endif
>
> Do we actually need the ifdef CONFIG_LINUX bit at all ?
>
> Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY,
> which will fail on non-Linux anyway.

By some include issue, or by future implementations we can have
MSG_ZEROCOPY or SO_ZEROCOPY getting defined in OS other than Linux,
which would introduce some headaches.

Since you pointed out that migration will fail on writev, the above
piece of code is not necessary.
We could have a local define that equals to (MSG_ZEROCOPY &&
SO_ZEROCOPY && CONFIG_LINUX) so that we can make the code simpler
where needed.

I will work on a v12 and send it here.

Best regards,
Leo

Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-05 Thread Leonardo Bras Soares Passos

On Thu, May 5, 2022 at 5:05 AM Daniel P. Berrangé  wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error 
> > occurs.
> >
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> > copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent 
> > instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before 
> > freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, 
> > and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may 
> > require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Peter Xu 
> > Reviewed-by: Daniel P. Berrangé 
> > Reviewed-by: Juan Quintela 
> > ---
> >  include/io/channel-socket.h |   2 +
> >  io/channel-socket.c | 120 ++--
> >  2 files changed, 118 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..513c428fe4 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -47,6 +47,8 @@ struct QIOChannelSocket {
> >  socklen_t localAddrLen;
> >  struct sockaddr_storage remoteAddr;
> >  socklen_t remoteAddrLen;
> > +ssize_t zero_copy_queued;
> > +ssize_t zero_copy_sent;
> >  };
> >
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index 696a04dc9c..ae756ce166 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -25,9 +25,25 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#ifdef CONFIG_LINUX
> > +#include 
> > +#include 
> > +#endif
> >
> >  #define SOCKET_MAX_FDS 16
> >
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on 
> > systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more 
> > readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x400
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> Please put these behind CONFIG_LINUX to make it clear to readers that
> this is entirely Linux specific
>
>
> > +
> >  SocketAddress *
> >  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> >   Error **errp)
> > @@ -54,6 +70,8 @@ qio_channel_socket_new(void)
> >
> >  sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
> >  sioc->fd = -1;
> > +sioc->zero_copy_queued = 0;
> > +sioc->zero_copy_sent = 0;
> >
> >  ioc = QIO_CHANNEL(sioc);
> >  qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> > @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket 
> > *ioc,
> >  return -1;
> >  }
> >
> > +#ifdef CONFIG_LINUX
> > +int ret, v = 1;
> > +ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, , sizeof(v));
> > +if (ret == 0) {
> > +/* Zero copy available on host */
> > +qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +}
> > +#endif
> > +
> >  return 0;
> >  }
> >
> > @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
> >  size_t fdsize = sizeof(int) * nfds;
> >  struct cmsghdr *cmsg;
> > +int sflags = 0;
> >
> >  memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >
> > @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >  }
> >
> > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +sflags = MSG_ZEROCOPY;
> >

Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-04 Thread Leonardo Bras Soares Passos

On Wed, May 4, 2022 at 4:53 PM Peter Xu  wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on 
> > systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more 
> > readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x400
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> So this will define these two values on e.g. FreeBSD, while they do not
> make sense at all there because these numbers are pure magics and
> meaningless outside Linux..

Correct.
But since only in Linux it's possible to set the
QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag, sflags will always be zero and
it would never try using MSG_ZEROCOPY outside Linux.

> I don't think it's anything dangerous, but IMHO it's another way of being
> not clean comparing of using some "#ifdef"s.  Comparing to this approach
> the "use #ifdef" approach is actually slightly more cleaner to me. :)
>

This requires:
- Creating a define such as 'QEMU_MSG_ZEROCOPY', that needs to include
 to get some flags:
#define QEMU_MSG_ZEROCOPY defined(CONFIG_LINUX) &&
defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY)
- Making it available for all code in this patchset that does "ifdef
CONFIG_LINUX'
(migration/migration.c/h, qapi/migration.json, monitor/hmp-cmds.c,
io/channel-socket.c)
- Replace current usage of CONFIG_LINUX in this patchset for QEMU_MSG_ZEROCOPY
- Change qio_channel_socket_writev() so the current 2 usages of
MSG_ZEROCOPY are surrounded by ifdef QEMU_MSG_ZEROCOPY.

Pros of above approach (1):
- Smaller binary: The whole MSG_ZEROCOPY code is compiled out if the
building system does not support it.
- Since it's compiled out, there is a couple lines of less code
running if the building system does not support it
- It's not even possible to set this option in MigrationSetParams,
which will return an error.

Pros of current approach (2):
- Define is local to file (I am not sure if it's ok to create a
'global' define for above approach, including  bits)
- A build system that does not support MSG_ZEROCOPY can produce a
binary that can use MSG_ZEROCOPY if the target system supports it.
- There are no #ifdefs on qio_channel_socket_writev()

(2) is already implemented in v11, but I have no issue implementing
(1) for v12 if it's ok to create this 'global' define.

> Let's wait for some other inputs.

Agree.
Having the pros of each approach clear, I would like some input on
what is better for the project.

Best regards,
Leo

Re: [PATCH v9 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux

2022-05-04 Thread Leonardo Bras Soares Passos

On Wed, May 4, 2022 at 7:48 AM Markus Armbruster  wrote:
>
> Leonardo Bras  writes:
>
> > Add property that allows zero-copy migration of memory pages
> > on the sending side, and also includes a helper function
> > migrate_use_zero_copy_send() to check if it's enabled.
> >
> > No code is introduced to actually do the migration, but it allow
> > future implementations to enable/disable this feature.
> >
> > On non-Linux builds this parameter is compiled-out.
> >
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Peter Xu 
> > Reviewed-by: Daniel P. Berrangé 
> > Reviewed-by: Juan Quintela 
> > ---
> >  qapi/migration.json   | 24 
> >  migration/migration.h |  5 +
> >  migration/migration.c | 32 
> >  migration/socket.c| 11 +--
> >  monitor/hmp-cmds.c|  6 ++
> >  5 files changed, 76 insertions(+), 2 deletions(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 409eb086a2..04246481ce 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -741,6 +741,13 @@
> >  #  will consume more CPU.
> >  #  Defaults to 1. (Since 5.0)
> >  #
> > +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> > +#  When true, enables a zero-copy mechanism for sending 
> > memory
> > +#  pages, if host supports it.
> > +#  Requires that QEMU be permitted to use locked memory 
> > for guest
> > +#  RAM pages.
>
> Please wrap lines around column 75.  More of the same below.
>
> > +#  Defaults to false. (Since 7.1)
> > +#
> >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> >  #aliases for the purpose of dirty bitmap 
> > migration.  Such
> >  #aliases may for example be the corresponding 
> > names on the
>
> With that, QAPI schema
> Acked-by: Markus Armbruster 

Thanks Markus!

Best regards,
Leo

>
> [...]
>

Re: [PULL 06/11] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-02 Thread Leonardo Bras Soares Passos

Hello Peter,

On Mon, May 2, 2022 at 8:52 PM Peter Xu  wrote:
>
> Leo,
>
> On Fri, Apr 29, 2022 at 11:40:44PM -0300, Leonardo Bras Soares Passos wrote:
> > Does anyone else have any further suggestions, or know how this kind of 
> > issue
> > is generally solved in qemu?
>
> I've no solid idea why it can't see MSG_ZEROCOPY defined in the specific
> environment, but when I was looking at bits/socket.h I saw this:
>
> #ifndef _SYS_SOCKET_H
> # error "Never include  directly; use  instead."
> #endif
>
> Maybe worth a shot to do a replacement in all cases?
>

Sure, no problem with this, I will update for v11.
(Or should I send a different patch since Dave has already merged in his tree?)

But it should not interfere in MSG_ZEROCOPY definition:

> > I tracked down how the test happened, downloaded the same docker image from 
> > the
> > tests(opensuse-leap-15.2), and took a look at the filesystem for the
> > MSG_ZEROCOPY define, which I could not find anywhere.

By this, I mean I did a 'grep MSG_ZEROCOPY -r /' and could not find anything, so
it's probably not defined anywhere in the fs.

> --
> Peter Xu
>

Thanks Peter!

Best regards,
Leo

Re: [PULL 06/11] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-04-29 Thread Leonardo Bras Soares Passos

Hello Dave,

On Thu, Apr 28, 2022 at 1:20 PM Dr. David Alan Gilbert
 wrote:
>
> Leo:
>   Unfortunately this is failing a couple of CI tests; the MSG_ZEROCOPY
> one I guess is the simpler one; I think Stefanha managed to find the
> liburing fix for the __kernel_timespec case, but that looks like a bit
> more fun!
>
> Dave

About MSG_ZEROCOPY error:

I tracked down how the test happened, downloaded the same docker image from the
tests(opensuse-leap-15.2), and took a look at the filesystem for the
MSG_ZEROCOPY define, which I could not find anywhere.

Then I took a look into /usr/include/bits/socket.h, which is where RHEL has
MSG_ZEROCOPY defined. Zypper defines it as been provided by glibc-devel, which
is versioned at 2.26-lp152.26.12.1.

I then took a look at https://sourceware.org/git/glibc.git, and found commit
78cde19f62 that introduces MSG_ZEROCOPY. The first version that has this commit
is glibc-2.27.

So, basically, this means opensuse-leap-15.2 glibc version does not support
MSG_ZEROCOPY. Based on that, I had a few ideas on how to solve the CI bug:
1 - Propose a backport of this patch (few comments +  single define) for
leap-15.x, wait for them to accept and update the version in qemu CI.
(TBH I have no idea how the opensuse community works, I just suppose it could
be a way of tackling this.)
2 - include an #ifndef MSG_ZEROCOPY #define MSG_ZEROCOPY 0x400 #endif in
code, which is ugly IMHO, but will be fast and clean.
3 - In CI, patch /usr/include/bits/socket.h before building, which will also
work fine, but defeats the purpose of keeping qemu building on the platform.

Among the above, I would go with (2), as it seems a reasonable way of dealing
with this.

Does anyone else have any further suggestions, or know how this kind of issue
is generally solved in qemu?

Best regards,
Leo

>
>
> Job #2390848140 ( https://gitlab.com/dagrh/qemu/-/jobs/2390848140/raw )
> Name: build-system-alpine
> In file included from /usr/include/linux/errqueue.h:6,
>  from ../io/channel-socket.c:29:
> /usr/include/linux/time_types.h:7:8: error: redefinition of 'struct 
> __kernel_timespec'
> 7 | struct __kernel_timespec {
>   |^
> In file included from /usr/include/liburing.h:19,
>  from /builds/dagrh/qemu/include/block/aio.h:18,
>  from /builds/dagrh/qemu/include/io/channel.h:26,
>  from /builds/dagrh/qemu/include/io/channel-socket.h:24,
>  from ../io/channel-socket.c:24:
> /usr/include/liburing/compat.h:9:8: note: originally defined here
> 9 | struct __kernel_timespec {
>   |^
>
> 
> Name: build-system-opensuse
>
> https://gitlab.com/dagrh/qemu/-/jobs/2390848160/raw
> ../io/channel-socket.c: In function â€˜qio_channel_socket_writevâ€™:
> ../io/channel-socket.c:578:18: error: â€˜MSG_ZEROCOPYâ€™ undeclared (first 
> use in this function); did you mean â€˜SO_ZEROCOPYâ€™?
>  sflags = MSG_ZEROCOPY;
>   ^~~~
>   SO_ZEROCOPY
> ../io/channel-socket.c:578:18: note: each undeclared identifier is reported 
> only once for each function it appears in
>
> * Dr. David Alan Gilbert (git) (dgilb...@redhat.com) wrote:
> > From: Leonardo Bras 
> >
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error 
> > occurs.
> >
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> > copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent 
> > instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before 
> > freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, 
> > and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may 
> > require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras 
> >

Re: [PATCH v10 0/7] MSG_ZEROCOPY + multifd

2022-04-28 Thread Leonardo Bras Soares Passos

On Thu, Apr 28, 2022 at 11:08 AM Dr. David Alan Gilbert
 wrote:
>
> * Leonardo Bras (leob...@redhat.com) wrote:
> > This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
> > use of it for multifd migration performance improvement, by reducing cpu
> > usage.
> >
> > Patch #1 creates new callbacks for QIOChannel, allowing the implementation
> > of zero copy writing.
> >
> > Patch #2 implements io_writev flags and io_flush() on QIOChannelSocket,
> > making use of MSG_ZEROCOPY on Linux.
> >
> > Patch #3 adds a "zero_copy_send" migration property, only available with
> > CONFIG_LINUX, and compiled-out in any other architectures.
> > This migration property has to be enabled before multifd migration starts.
> >
> > Patch #4 adds a helper function that allows to see if TLS is going to be 
> > used.
> > This helper will be later used in patch #5.
> >
> > Patch #5 changes multifd_send_sync_main() so it returns int instead of void.
> > The return value is used to understand if any error happened in the 
> > function,
> > allowing migration to possible fail earlier.
> >
> > Patch #6 implements an workaround: The behavior introduced in d48c3a0445 is
> > hard to deal with in zerocopy, so a workaround is introduced to send the
> > header in a different syscall, without MSG_ZEROCOPY.
> >
> > Patch #7 Makes use of QIOChannelSocket zero_copy implementation on
> > nocomp multifd migration.
>
> Queued.
>
> > Results:
> > In preliminary tests, the resource usage of __sys_sendmsg() reduced 15 
> > times,
> > and the overall migration took 13-22% less time, based in synthetic cpu
> > workload.
> >
> > In further tests, it was noted that, on multifd migration with 8 channels:
> > - On idle hosts, migration time reduced in 10% to 21%.
> > - On hosts busy with heavy cpu stress (1 stress thread per cpu, but
> >   not cpu-pinned) migration time reduced in ~25% by enabling zero-copy.
> > - On hosts with heavy cpu-pinned workloads (1 stress thread per cpu,
> >   cpu-pinned), migration time reducted in ~66% by enabling zero-copy.
>
> Nice.
>
> > Above tests setup:
> > - Sending and Receiving hosts:
> >   - CPU : Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz (448 CPUS)
> >   - Network card: E810-C (100Gbps)
> >   - >1TB RAM
> >   - QEMU: Upstream master branch + This patchset
> >   - Linux: Upstream v5.15
>
> That configuration is particularly interesting because while it's a big
> machine with lots of cores, the individual cores are clocked relatively
> slowly; also having lots of cores probably means they're all fighting
> over memory bandwidth, so the less copies the better.
>
> Dave
>

Thanks Dave!

Best regards,
Leo

> > - VM configuration:
> >   - 28 VCPUs
> >   - 512GB RAM
> >
> >
> > ---
> > Changes since v9:
> > - Patch #6 got simplified and improved (thanks Daniel)
> > - Patch #7 got better comments (thanks Peter Xu)
> >
> > Changes since v8:
> > - Inserted two new patches #5 & #6, previous patch #5 is now #7.
> > - Workaround an optimization introduced in d48c3a0445
> > - Removed unnecessary assert in qio_channel_writev_full_all
> >
> > Changes since v7:
> > - Migration property renamed from zero-copy to zero-copy-send
> > - A few early tests added to help misconfigurations to fail earlier
> > - qio_channel_full*_flags() renamed back to qio_channel_full*()
> > - multifd_send_sync_main() reverted back to not receiving a flag,
> >   so it always sync zero-copy when enabled.
> > - Improve code quality on a few points
> >
> > Changes since v6:
> > - Remove io_writev_zero_copy(), and makes use of io_writev() new flags
> >   to achieve the same results.
> > - Rename io_flush_zero_copy() to io_flush()
> > - Previous patch #2 became too small, so it was squashed in previous
> >   patch #3 (now patch #2)
> >
> > Changes since v5:
> > - flush_zero_copy now returns -1 on fail, 0 on success, and 1 when all
> >   processed writes were not able to use zerocopy in kernel.
> > - qio_channel_socket_poll() removed, using qio_channel_wait() instead
> > - ENOBUFS is now processed inside qio_channel_socket_writev_flags()
> > - Most zerocopy parameter validation moved to migrate_params_check(),
> >   leaving only feature test to socket_outgoing_migration() callback
> > - Naming went from *zerocopy to *zero_copy or *zero-copy, due to QAPI/QMP
> >   preferences
> > - Improved docs
> >
> > Changes since v4:
> > - 3 patches got splitted in 6
> > - Flush is used for syncing after each iteration, instead of only at the end
> > - If zerocopy is not available, fail in connect instead of failing on write
> > - 'multifd-zerocopy' property renamed to 'zerocopy'
> > - Fail migrations that don't support zerocopy, if it's enabled.
> > - Instead of checking for zerocopy at each write, save the flags in
> >   MultiFDSendParams->write_flags and use them on write
> > - Reorganized flag usage in QIOChannelSocket
> > - A lot of typos fixed
> > - More doc on buffer restrictions
> >
> > Changes since v3:
> > - QIOChannel interface names changed from

Re: [PATCH v10 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-27 Thread Leonardo Bras Soares Passos

On Tue, Apr 26, 2022 at 8:26 PM Peter Xu  wrote:
>
> On Tue, Apr 26, 2022 at 08:06:56PM -0300, Leonardo Bras wrote:
> > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent before
> > a new iteration is started. It will also flush at the beginning and at the
> > end of migration.
> >
> > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between 
> > writev_zero_copy() and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > A lot of locked memory may be needed in order to use multifd migration
> > with zero-copy enabled, so disabling the feature should be necessary for
> > low-privileged users trying to perform multifd migrations.
> >
> > Signed-off-by: Leonardo Bras 
>
> Reviewed-by: Peter Xu 
>
> --
> Peter Xu
>

Thanks for reviewing Peter!

Best regards,
Leo

Re: [PATCH v10 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-27 Thread Leonardo Bras Soares Passos

On Wed, Apr 27, 2022 at 5:46 AM Daniel P. Berrangé  wrote:
>
> On Tue, Apr 26, 2022 at 08:06:56PM -0300, Leonardo Bras wrote:
> > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent before
> > a new iteration is started. It will also flush at the beginning and at the
> > end of migration.
> >
> > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between 
> > writev_zero_copy() and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > A lot of locked memory may be needed in order to use multifd migration
> > with zero-copy enabled, so disabling the feature should be necessary for
> > low-privileged users trying to perform multifd migrations.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.h   |  2 ++
> >  migration/migration.c | 11 ++-
> >  migration/multifd.c   | 37 +++--
> >  migration/socket.c|  5 +++--
> >  4 files changed, 50 insertions(+), 5 deletions(-)
>
> Reviewed-by: Daniel P. Berrangé 
>
>
> With regards,
> Daniel

Thanks for reviewing, Daniel!

Best regards,
Leo

> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v10 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-27 Thread Leonardo Bras Soares Passos

On Wed, Apr 27, 2022 at 5:44 AM Daniel P. Berrangé  wrote:
>
> On Tue, Apr 26, 2022 at 08:06:55PM -0300, Leonardo Bras wrote:
> > Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> > sending the header packet and the memory pages happens in the same
> > writev, which can potentially make the migration faster.
> >
> > Using channel-socket as example, this works well with the default copying
> > mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> > the migration to often break.
> >
> > This happens because the header packet buffer gets reused quite often,
> > and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> > to send the buffer, it has already changed, sending the wrong data and
> > causing the migration to abort.
> >
> > It means that, as it is, the buffer for the header packet is not suitable
> > for sending with MSG_ZEROCOPY.
> >
> > In order to enable zero copy for multifd, send the header packet on an
> > individual write(), without any flags, and the remanining pages with a
> > writev(), as it was happening before. This only changes how a migration
> > with zero-copy-send=true works, not changing any current behavior for
> > migrations with zero-copy-send=false.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.c | 23 ---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
>
> Reviewed-by: Daniel P. Berrangé 
>
>
> With regards,
> Daniel

Thanks for reviewing Daniel!

Best regards,
Leo

> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v10 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-27 Thread Leonardo Bras Soares Passos

On Tue, Apr 26, 2022 at 8:26 PM Peter Xu  wrote:
>
> On Tue, Apr 26, 2022 at 08:06:55PM -0300, Leonardo Bras wrote:
> > Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> > sending the header packet and the memory pages happens in the same
> > writev, which can potentially make the migration faster.
> >
> > Using channel-socket as example, this works well with the default copying
> > mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> > the migration to often break.
> >
> > This happens because the header packet buffer gets reused quite often,
> > and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> > to send the buffer, it has already changed, sending the wrong data and
> > causing the migration to abort.
> >
> > It means that, as it is, the buffer for the header packet is not suitable
> > for sending with MSG_ZEROCOPY.
> >
> > In order to enable zero copy for multifd, send the header packet on an
> > individual write(), without any flags, and the remanining pages with a
> > writev(), as it was happening before. This only changes how a migration
> > with zero-copy-send=true works, not changing any current behavior for
> > migrations with zero-copy-send=false.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.c | 23 ---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 15fb668e64..07b2e92d8d 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -617,6 +617,7 @@ static void *multifd_send_thread(void *opaque)
> >  MultiFDSendParams *p = opaque;
> >  Error *local_err = NULL;
> >  int ret = 0;
> > +bool use_zero_copy_send = migrate_use_zero_copy_send();
> >
> >  trace_multifd_send_thread_start(p->id);
> >  rcu_register_thread();
> > @@ -639,9 +640,14 @@ static void *multifd_send_thread(void *opaque)
> >  if (p->pending_job) {
> >  uint64_t packet_num = p->packet_num;
> >  uint32_t flags = p->flags;
> > -p->iovs_num = 1;
> >  p->normal_num = 0;
> >
> > +if (use_zero_copy_send) {
> > +p->iovs_num = 0;
> > +} else {
> > +p->iovs_num = 1;
> > +}
> > +
> >  for (int i = 0; i < p->pages->num; i++) {
> >  p->normal[p->normal_num] = p->pages->offset[i];
> >  p->normal_num++;
> > @@ -665,8 +671,19 @@ static void *multifd_send_thread(void *opaque)
> >  trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> > p->next_packet_size);
> >
> > -p->iov[0].iov_len = p->packet_len;
> > -p->iov[0].iov_base = p->packet;
> > +if (use_zero_copy_send) {
> > +/* Send header first, without zerocopy */
> > +ret = qio_channel_write_all(p->c, (void *)p->packet,
> > +p->packet_len, _err);
> > +if (ret != 0) {
> > +break;
> > +}
> > +
>
> Extra but useless newline.. but not worth a repost.  Looks good here:
>
> Reviewed-by: Peter Xu 
>
> Thanks,

Thanks for reviewing Peter!

Best regards,
Leo

>
> > +} else {
> > +/* Send header using the same writev call */
> > +p->iov[0].iov_len = p->packet_len;
> > +p->iov[0].iov_base = p->packet;
> > +}
> >
> >  ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> >   _err);
> > --
> > 2.36.0
> >
>
> --
> Peter Xu
>

Re: [PATCH v9 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Leonardo Bras Soares Passos

Hello Peter, thanks for helping!

On Tue, Apr 26, 2022 at 1:02 PM Peter Xu  wrote:
>
> Leo,
>
> This patch looks mostly good to me, a few nitpicks below.
>
> On Mon, Apr 25, 2022 at 06:50:56PM -0300, Leonardo Bras wrote:
[...]
> >  }
> > +
> > +/*
> > + * When using zero-copy, it's necessary to flush after each iteration 
> > to
> > + * make sure pages from earlier iterations don't end up replacing newer
> > + * pages.
> > + */
> > +flush_zero_copy = migrate_use_zero_copy_send();
>
> Would you mind inline it if it's only used once?

It's not obvious in the diff, but this is used in a loop bellow, so I inserted
the variable to avoid calling migrate_use_zero_copy_send() for each
multifd channel.

>
> It's great to have that comment, but IMHO it could be more explicit, even
> marking a TODO showing that maybe we could do better in the future:
>
>   /*
>* When using zero-copy, it's necessary to flush the pages before any of
>* the pages can be sent again, so we'll make sure the new version of the
>* pages will always arrive _later_ than the old pages.
>*
>* Currently we achieve this by flushing the zero-page requested writes
>* per ram iteration, but in the future we could potentially optimize it
>* to be less frequent, e.g. only after we finished one whole scanning of
>* all the dirty bitmaps.
>*/
>

Thanks! I will insert that in the next version.

The thing here is that I was under the impression an iteration was equivalent to
a whole scanning of all the dirty bitmaps. I see now that it may not
be the case.

[...]
> > @@ -688,10 +708,9 @@ static void *multifd_send_thread(void *opaque)
> >  p->iov[0].iov_base = p->packet;
> >  }
> >
> > -ret = qio_channel_writev_all(p->c, p->iov + iov_offset,
> > - p->iovs_num - iov_offset,
> > - _err);
> > -
> > +ret = qio_channel_writev_full_all(p->c, p->iov + iov_offset,
> > +  p->iovs_num - iov_offset, 
> > NULL,
> > +  0, p->write_flags, 
> > _err);
>
> I kind of agree with Dan in previous patch - this iov_offset is confusing,
> better drop it.

Sure, fixed for v10.

>
[...]
> --
> Peter Xu
>

Best regards,
Leo

Re: [PATCH v9 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-26 Thread Leonardo Bras Soares Passos

Hello Daniel, thank you for the feedback!

On Tue, Apr 26, 2022 at 5:11 AM Daniel P. Berrangé  wrote:
>
> On Mon, Apr 25, 2022 at 06:50:55PM -0300, Leonardo Bras wrote:
> > Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> > sending the header packet and the memory pages happens in the same
> > writev, which can potentially make the migration faster.
> >
> > Using channel-socket as example, this works well with the default copying
> > mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> > the migration to often break.
> >
> > This happens because the header packet buffer gets reused quite often,
> > and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> > to send the buffer, it has already changed, sending the wrong data and
> > causing the migration to abort.
> >
> > It means that, as it is, the buffer for the header packet is not suitable
> > for sending with MSG_ZEROCOPY.
> >
> > In order to enable zero copy for multifd, send the header packet on an
> > individual write(), without any flags, and the remanining pages with a
> > writev(), as it was happening before. This only changes how a migration
> > with zero-copy-send=true works, not changing any current behavior for
> > migrations with zero-copy-send=false.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd.c | 29 ++---
> >  1 file changed, 26 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 15fb668e64..6c940aaa98 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -639,6 +639,8 @@ static void *multifd_send_thread(void *opaque)
> >  if (p->pending_job) {
> >  uint64_t packet_num = p->packet_num;
> >  uint32_t flags = p->flags;
> > +int iov_offset = 0;
> > +
>
> No need for this if you change:
>
> >  p->iovs_num = 1;
>
>if (!migrate_use_zero_copy_send()) {
>   p->iovs_num = 1;
>}
>

I understand the point now: setting p->iovs_num = 0 before
multifd_send_state->ops->send_prepare() causes p->iov[0] to be used for
pages instead of the header. I was not aware, so thanks for pointing that out!

But it's also necessary to have an else clause with p->iovs_num = 0, right?
It seems like the variable is not set anywhere else, and it would keep growing
after the second loop iteration, causing prepare() to access p->iov[]
outside bounds.

Am I missing something here?

>
> >  p->normal_num = 0;
> >
> > @@ -665,15 +667,36 @@ static void *multifd_send_thread(void *opaque)
> >  trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> > p->next_packet_size);
> >
> > -p->iov[0].iov_len = p->packet_len;
> > -p->iov[0].iov_base = p->packet;
> > +if (migrate_use_zero_copy_send()) {
> > +/* Send header without zerocopy */
> > +ret = qio_channel_write_all(p->c, (void *)p->packet,
> > +p->packet_len, _err);
> > +if (ret != 0) {
> > +break;
> > +}
> > +
> > +if (!p->normal_num) {
> > +/* No pages will be sent */
> > +goto skip_send;
> > +}
>
> Don't need this AFAIK, because the qio_channel_writev_all
> call will be a no-op if  iovs_num is zero
>

Oh, I see:
qio_channel_writev_all() will call qio_channel_writev_full_all() where
niov == 0 and thus nlocal_iov == 0, avoiding the loop that calls
qio_channel_writev_full().

I will remove that in v10


> >
> > -ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > +/* Skip first iov : header */
> > +iov_offset = 1;
>
> Don't need to set this

Agree, that makes sense since the offset part is discontinued.

>
> > +} else {
> > +/* Send header using the same writev call */
> > +p->iov[0].iov_len = p->packet_len;
> > +p->iov[0].iov_base = p->packet;
> > +}
> > +
> > +ret = qio_channel_writev_all(p->c, p->iov + iov_offset,
> > + p->iovs_num - iov_offset,
> >   _err);
>
> This wouldn't need changing if we don't reserve iovs[0] when
> not required.

Agree.

>
> > +
> >  if (ret != 0) {
> >  break;
> >  }
> >
> > +skip_send:
> >  qemu_mutex_lock(>mutex);
> >  p->pending_job--;
> >  qemu_mutex_unlock(>mutex);
> > --
> > 2.36.0
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

I will probably send

Re: [PATCH v9 5/7] multifd: multifd_send_sync_main now returns negative on error

2022-04-26 Thread Leonardo Bras Soares Passos

On Tue, Apr 26, 2022 at 9:59 AM Peter Xu  wrote:
>
> On Mon, Apr 25, 2022 at 06:50:54PM -0300, Leonardo Bras wrote:
> > Even though multifd_send_sync_main() currently emits error_reports, it's
> > callers don't really check it before continuing.
> >
> > Change multifd_send_sync_main() to return -1 on error and 0 on success.
> > Also change all it's callers to make use of this change and possibly fail
> > earlier.
> >
> > (This change is important to next patch on  multifd zero copy
> > implementation, to make it sure an error in zero-copy flush does not go
> > unnoticed.
> >
> > Signed-off-by: Leonardo Bras 
>
> Reviewed-by: Peter Xu 
>
> --
> Peter Xu
>

Thanks Peter!

Best regards,
Leo

Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2022-04-13 Thread Leonardo Bras Soares Passos

Hello Juan,

Sorry to go back that early in discussion, but I was reviewing for v9
and I am not sure If I am unable to recall the reason, or I missed an
argument here.
Could you please help me with this?

On Tue, Nov 2, 2021 at 9:32 AM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > zerocopy interface.
> >
> > Change multifd_send_sync_main() so it can distinguish the last sync from
> > the setup and per-iteration ones, so a flush_zerocopy() can be called
> > at the last sync in order to make sure all RAM is sent before finishing
> > the migration.
>
> You need to do this after each iteration.  Otherwise it can happen that:
>
> channel 1:   channel 2:
>
>send page 11
>
> next iteration
>  send page 11
>
>  this page arrives
>
> now arrives this old copy.
>
> After each iteration, one needs to be sure that no ram is inflight.
>
> This means that I think you don't need the last_sync parameter at all,
> as you have to do the flush() in every iteration.

The flush command is used to guarantee every packet queued before
flush is actually sent before flush returns.
I mean, flushing every iteration will not help with the situation
above, where the pages are sent in order, but arrive at target in a
different order.

There is a chance that in the above text you meant 'send page' as
"queue page for sending", and 'page arrives' as "actually send the
queued page".
It that is correct, then syncing every iteration should not be necessary:
- On page queue, Linux saves the page address and size for sending
- On actual send, Linux will send the current data in the page and send.

So, in this example, if page 11 from iteration 'i' happens to be
'actually sent' after page 11 from iteration 'i+1', it would not be an
issue:
###
channel 1:   channel 2:
Iteration i

queue page 11 (i)

iteration i+1
  queue page 11 (i+1)
  actually send page 11 (i+1)

actually send page 11 (i)
###

That's because page 11 (i) will contain a newer version compared to
page 11 (i+1)

tl;dr:
- The page content always depends on the send time, instead of queue time.
- The iteration count describes the queue time.
(on non-zerocopy it's the opposite: it will depend on queue time,
because it copies the memory content during enqueue)

>
[...]

Juan, could you please help me understand if I am missing a part of
your argument up there?
Also, syncing every iteration is still necessary / recommended?

Best regards,
Leo

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-03-07 Thread Leonardo Bras Soares Passos

On Tue, Mar 1, 2022 at 12:57 AM Peter Xu  wrote:
>
> On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> > I did a change on:
> >
> > commit d48c3a044537689866fe44e65d24c7d39a68868a
> > Author: Juan Quintela 
> > Date:   Fri Nov 19 15:35:58 2021 +0100
> >
> > multifd: Use a single writev on the send side
> >
> > Until now, we wrote the packet header with write(), and the rest of the
> > pages with writev().  Just increase the size of the iovec and do a
> > single writev().
> >
> > Signed-off-by: Juan Quintela 
> > Reviewed-by: Dr. David Alan Gilbert 
> >
> > And now we need to "perserve" this header until we do the sync,
> > otherwise we are overwritting it with other things.
> >
> > What testing have you done after this commit?
> >
> > Notice that it is not _complicated_ to fix it, I will try to come with
> > some idea on monday, but basically is having an array of buffers for
> > each thread, and store them until we call a sync().
>
> Or can we conditionally merge the two write()s?  IMHO the array of buffers
> idea sounds too complicated, and I'm not extremely sure whether it'll pay
> off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
> the merged version otherwise.

I think that's a great idea!
It would optimize the non-zerocopy version while letting us have a
simpler zerocopy implementation.
The array of buffers implementation would either require us to have a
'large' amount of memory for keeping the headers, or having flush
happening too often.

>
> Btw, is there any performance measurements for above commit d48c3a044537?
> I had a feeling that the single write() may not help that much, because for
> multifd the bottleneck should be on the nic not on the processor.

I am quite curious about those numbers too.

>
> IOW, we could find that the major time used does not fall into the
> user<->kernel switches (which is where the extra overhead of write()
> syscall, iiuc), but we simply blocked on any of the write()s because the
> socket write buffer is full...  So we could have saved some cpu cycles by
> merging the calls, but performance-wise we may not get much.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks Peter!

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-21 Thread Leonardo Bras Soares Passos

On Mon, Feb 21, 2022 at 4:41 PM Leonardo Bras Soares Passos
 wrote:
>
> Hello Juan, thanks for the feedback!
>
> On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela  wrote:
> >
> > Leonardo Bras  wrote:
> > > Implement zero copy send on nocomp_send_write(), by making use of 
> > > QIOChannel
> > > writev + flags & flush interface.
> > >
> > > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > > after each iteration in order to make sure all dirty pages are sent before
> > > a new iteration is started. It will also flush at the beginning and at the
> > > end of migration.
> > >
> > > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > > the migration process, and avoid resuming the guest in the target host
> > > without receiving all current RAM.
> > >
> > > This will work fine on RAM migration because the RAM pages are not 
> > > usually freed,
> > > and there is no problem on changing the pages content between 
> > > writev_zero_copy() and
> > > the actual sending of the buffer, because this change will dirty the page 
> > > and
> > > cause it to be re-sent on a next iteration anyway.
> > >
> > > A lot of locked memory may be needed in order to use multid migration
> >^^
> > multifd.
> >
> > I can fix it on the commit.
>
> No worries, fixed for v9.
>
> >
> >
> > > @@ -1479,7 +1479,16 @@ static bool 
> > > migrate_params_check(MigrationParameters *params, Error **errp)
> > >  error_prepend(errp, "Invalid mapping given for 
> > > block-bitmap-mapping: ");
> > >  return false;
> > >  }
> > > -
> > > +#ifdef CONFIG_LINUX
> > > +if (params->zero_copy_send &&
> > > +(!migrate_use_multifd() ||
> > > + params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > > + (params->tls_creds && *params->tls_creds))) {
> > > +error_setg(errp,
> > > +   "Zero copy only available for non-compressed non-TLS 
> > > multifd migration");
> > > +return false;
> > > +}
> > > +#endif
> > >  return true;
> > >  }
> >
> > Test is long, but it is exactly what we need.  Good.
>
> Thanks!
>
>
> >
> >
> > >
> > > diff --git a/migration/multifd.c b/migration/multifd.c
> > > index 43998ad117..2d68b9cf4f 100644
> > > --- a/migration/multifd.c
> > > +++ b/migration/multifd.c
> > > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> > >  multifd_send_state = NULL;
> > >  }
> > >
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >  int i;
> > > +bool flush_zero_copy;
> > >
> > >  if (!migrate_use_multifd()) {
> > > -return;
> > > +return 0;
> > >  }
> > >  if (multifd_send_state->pages->num) {
> > >  if (multifd_send_pages(f) < 0) {
> > >  error_report("%s: multifd_send_pages fail", __func__);
> > > -return;
> > > +return 0;
> > >  }
> > >  }
> > > +
> > > +/*
> > > + * When using zero-copy, it's necessary to flush after each 
> > > iteration to
> > > + * make sure pages from earlier iterations don't end up replacing 
> > > newer
> > > + * pages.
> > > + */
> > > +flush_zero_copy = migrate_use_zero_copy_send();
> > > +
> > >  for (i = 0; i < migrate_multifd_channels(); i++) {
> > >  MultiFDSendParams *p = _send_state->params[i];
> > >
> > > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> > >  if (p->quit) {
> > >  error_report("%s: channel %d has already quit", __func__, i);
> > >  qemu_mutex_unlock(>mutex);
> > > -return;
> > > +return 0;
> > >  }
> > >
> > >  p->packet_num = multifd_send_state->packet_num++;
> > > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> > >  ram_counters.transferred += p->packet_len;
> > >  qem

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-21 Thread Leonardo Bras Soares Passos

On Fri, Feb 18, 2022 at 2:36 PM Juan Quintela  wrote:
>
> Leonardo Bras Soares Passos  wrote:
> > Hello Peter, thanks for reviewing!
> >
> > On Mon, Feb 7, 2022 at 11:22 PM Peter Xu  wrote:
> >>
> >> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> >> > -void multifd_send_sync_main(QEMUFile *f)
> >> > +int multifd_send_sync_main(QEMUFile *f)
> >> >  {
> >> >  int i;
> >> > +bool flush_zero_copy;
> >> >
> >> >  if (!migrate_use_multifd()) {
> >> > -return;
> >> > +return 0;
> >> >  }
> >> >  if (multifd_send_state->pages->num) {
> >> >  if (multifd_send_pages(f) < 0) {
> >> >  error_report("%s: multifd_send_pages fail", __func__);
> >> > -return;
> >> > +return 0;
> >>
> >> I've not checked how it used to do if multifd_send_pages() failed, but.. 
> >> should
> >> it returns -1 rather than 0 when there will be a return code?
> >
> > Yeah, that makes sense.
> > The point here is that I was trying not to modify much of the current 
> > behavior.
>
> if (qatomic_read(_send_state->exiting)) {
> return -1;
> }
>
> if (p->quit) {
> error_report("%s: channel %d has already quit!", __func__, i);
> qemu_mutex_unlock(>mutex);
> return -1;
> }
>
> This are the only two cases where the current code can return one
> error.  In the 1st case we are exiting, we are already in the middle of
> finishing, so we don't really care.
> In the second one, we have already quit, and error as already quite big.
>
> But I agree with both comments:
> - we need to improve the error paths
> - leonardo changes don't affect what is already there.
>



>
> > I mean, multifd_send_sync_main() would previously return void, so any
> > other errors would not matter to the caller of this function, which
> > will continue to run as if nothing happened.
> >
> > Now, if it fails with flush_zero_copy, the operation needs to be aborted.
> >
> > Maybe, I should make it different:
> > - In any error, return -1.
> > - Create/use a specific error code in the case of a failing
> > flush_zero_copy, so I can test the return value for it on the caller
> > function and return early.
>
> We need to add the check.  It don't matter if the problem is zero_copy
> or the existing one, we are under a minor catastrophe and migration has
> to be aborted.

Ok, I will fix that so we can abort in case of any error.
Maybe it's better to do that on a separated patch, before 5/5, right?

>
> > Or alternatively, the other errors could also return early, but since
> > this will change how the code currently works, I would probably need
> > another patch for that change. (so it can be easily reverted if
> > needed)
> >
> > What do you think is better?
> >
> >
> >> >  }
> >> >  }
> >> > +
> >> > +/*
> >> > + * When using zero-copy, it's necessary to flush after each 
> >> > iteration to
> >> > + * make sure pages from earlier iterations don't end up replacing 
> >> > newer
> >> > + * pages.
> >> > + */
> >> > +flush_zero_copy = migrate_use_zero_copy_send();
> >> > +
> >> >  for (i = 0; i < migrate_multifd_channels(); i++) {
> >> >  MultiFDSendParams *p = _send_state->params[i];
> >> >
> >> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >> >  if (p->quit) {
> >> >  error_report("%s: channel %d has already quit", __func__, 
> >> > i);
> >> >  qemu_mutex_unlock(>mutex);
> >> > -return;
> >> > +return 0;
> >>
> >> Same question here.
> >
> > Please see above,
> >
> >>
> >> >  }
> >>
> >> The rest looks good.  Thanks,
>
> Later, Juan.
>

Thanks for the feedback!

Best regards,
Leo

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-21 Thread Leonardo Bras Soares Passos

Hello Juan, thanks for the feedback!

On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent before
> > a new iteration is started. It will also flush at the beginning and at the
> > end of migration.
> >
> > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between 
> > writev_zero_copy() and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > A lot of locked memory may be needed in order to use multid migration
>^^
> multifd.
>
> I can fix it on the commit.

No worries, fixed for v9.

>
>
> > @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters 
> > *params, Error **errp)
> >  error_prepend(errp, "Invalid mapping given for 
> > block-bitmap-mapping: ");
> >  return false;
> >  }
> > -
> > +#ifdef CONFIG_LINUX
> > +if (params->zero_copy_send &&
> > +(!migrate_use_multifd() ||
> > + params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > + (params->tls_creds && *params->tls_creds))) {
> > +error_setg(errp,
> > +   "Zero copy only available for non-compressed non-TLS 
> > multifd migration");
> > +return false;
> > +}
> > +#endif
> >  return true;
> >  }
>
> Test is long, but it is exactly what we need.  Good.

Thanks!


>
>
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 43998ad117..2d68b9cf4f 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> >  multifd_send_state = NULL;
> >  }
> >
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >  int i;
> > +bool flush_zero_copy;
> >
> >  if (!migrate_use_multifd()) {
> > -return;
> > +return 0;
> >  }
> >  if (multifd_send_state->pages->num) {
> >  if (multifd_send_pages(f) < 0) {
> >  error_report("%s: multifd_send_pages fail", __func__);
> > -return;
> > +return 0;
> >  }
> >  }
> > +
> > +/*
> > + * When using zero-copy, it's necessary to flush after each iteration 
> > to
> > + * make sure pages from earlier iterations don't end up replacing newer
> > + * pages.
> > + */
> > +flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >  for (i = 0; i < migrate_multifd_channels(); i++) {
> >  MultiFDSendParams *p = _send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >  if (p->quit) {
> >  error_report("%s: channel %d has already quit", __func__, i);
> >  qemu_mutex_unlock(>mutex);
> > -return;
> > +return 0;
> >  }
> >
> >  p->packet_num = multifd_send_state->packet_num++;
> > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> >  ram_counters.transferred += p->packet_len;
> >  qemu_mutex_unlock(>mutex);
> >  qemu_sem_post(>sem);
> > +
> > +if (flush_zero_copy) {
> > +int ret;
> > +Error *err = NULL;
> > +
> > +ret = qio_channel_flush(p->c, );
> > +if (ret < 0) {
> > +error_report_err(err);
> > +return -1;
> > +}
> > +}
> >  }
> >  for (i = 0; i < migrate_multifd_channels(); i++) {
> >  MultiFDSendParams *p = _send_state->params[i];
> > @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
> >  qemu_sem_wait(>sem_sync);
> >  }
> >  trace_multifd_send_sync_main(multifd_send_state->packet_num);
> > +
> > +return 0;
> >  }
>
> We are leaving pages is flight for potentially a lot of time. I *think*
> that we can sync shorter than that.
>
> >  static void *multifd_send_thread(void *opaque)
> > @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
> >  p->iov[0].iov_len = p->packet_len;
> >  p->iov[0].iov_base = p->packet;
> >
> > -ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > - _err);
> > +ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, 
> > NULL,
> > +  0, p->write_flags, 
> > _err);
> >

Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-02-21 Thread Leonardo Bras Soares Passos

Thanks for reviewing, Juan!

On Fri, Feb 18, 2022 at 1:36 PM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using 
> > qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras 
>
> Reviewed-by: Juan Quintela 
>
> As everybody pointed out about the missing assertion...
>

Re: [PATCH v7 3/5] migration: Add zero-copy parameter for QMP/HMP for Linux

2022-02-21 Thread Leonardo Bras Soares Passos

Hello Juan,
Thanks for thew feedback!


On Fri, Feb 18, 2022 at 1:31 PM Juan Quintela  wrote:
>
> Leonardo Bras Soares Passos  wrote:
> > On Wed, Jan 19, 2022 at 3:16 PM Daniel P. Berrangé  
> > wrote:
> >>
> >> On Wed, Jan 19, 2022 at 03:03:29PM -0300, Leonardo Bras Soares Passos 
> >> wrote:
> >> > Hello Daniel,
> >> >
> >> > On Thu, Jan 13, 2022 at 10:10 AM Daniel P. Berrangé 
> >> >  wrote:
> >> > >
> >> > > On Thu, Jan 06, 2022 at 07:13:40PM -0300, Leonardo Bras wrote:
> >> > > > Add property that allows zero-copy migration of memory pages,
> >> > > > and also includes a helper function migrate_use_zero_copy() to check
> >> > > > if it's enabled.
> >> > > >
> >> > > > No code is introduced to actually do the migration, but it allow
> >> > > > future implementations to enable/disable this feature.
> >> > > >
> >> > > > On non-Linux builds this parameter is compiled-out.
> >> > > >
> >> > > > Signed-off-by: Leonardo Bras 
> >> > > > ---
> >> > > >  qapi/migration.json   | 24 
> >> > > >  migration/migration.h |  5 +
> >> > > >  migration/migration.c | 32 
> >> > > >  migration/socket.c|  5 +
> >> > > >  monitor/hmp-cmds.c|  6 ++
> >> > > >  5 files changed, 72 insertions(+)
> >> > >
> >> > > Reviewed-by: Daniel P. Berrangé 
> >> >
> >> > Thanks!
> >>
> >
> > Ok, I see the point.
> > I will try to refactor the code changing zero-copy to zero-copy-send
> > or something like that.
>
> Hi
>
> I am late to the party, but I agree with Dan that we need two flags.
>
> Thre reason is that you can be the target of one migration, and later be
> the source of a next one.  If we only have one flag that means different
> things on the source and destination side, things become really
> complicated.
>
> Later, Juan.
>

Yeah, that makes sense. :)

Best regards,
Leo

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-07 Thread Leonardo Bras Soares Passos

Hello Peter, thanks for reviewing!

On Mon, Feb 7, 2022 at 11:22 PM Peter Xu  wrote:
>
> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >  int i;
> > +bool flush_zero_copy;
> >
> >  if (!migrate_use_multifd()) {
> > -return;
> > +return 0;
> >  }
> >  if (multifd_send_state->pages->num) {
> >  if (multifd_send_pages(f) < 0) {
> >  error_report("%s: multifd_send_pages fail", __func__);
> > -return;
> > +return 0;
>
> I've not checked how it used to do if multifd_send_pages() failed, but.. 
> should
> it returns -1 rather than 0 when there will be a return code?

Yeah, that makes sense.
The point here is that I was trying not to modify much of the current behavior.

I mean, multifd_send_sync_main() would previously return void, so any
other errors would not matter to the caller of this function, which
will continue to run as if nothing happened.

Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Maybe, I should make it different:
- In any error, return -1.
- Create/use a specific error code in the case of a failing
flush_zero_copy, so I can test the return value for it on the caller
function and return early.

Or alternatively, the other errors could also return early, but since
this will change how the code currently works, I would probably need
another patch for that change. (so it can be easily reverted if
needed)

What do you think is better?


> >  }
> >  }
> > +
> > +/*
> > + * When using zero-copy, it's necessary to flush after each iteration 
> > to
> > + * make sure pages from earlier iterations don't end up replacing newer
> > + * pages.
> > + */
> > +flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >  for (i = 0; i < migrate_multifd_channels(); i++) {
> >  MultiFDSendParams *p = _send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >  if (p->quit) {
> >  error_report("%s: channel %d has already quit", __func__, i);
> >  qemu_mutex_unlock(>mutex);
> > -return;
> > +return 0;
>
> Same question here.

Please see above,

>
> >  }
>
> The rest looks good.  Thanks,

Thank you!

Best regards,
Leo

Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-02-07 Thread Leonardo Bras Soares Passos

Hello Peter,

On Mon, Feb 7, 2022 at 9:50 AM Peter Xu  wrote:
>
> On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using 
> > qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras 
>
> With Dan's comment addressed on removing the redundant assertion:
>
> Reviewed-by: Peter Xu 
>

Thank you for reviewing!

I think I am now missing reviewing only on patch 5/5 before sending
the next version.
Could you and/or Daniel help me with that? Just to check if I am
missing anything?

Best regards,
Leo

Re: [PATCH v1 1/1] target/i386: Mask xstate_bv based on the cpu enabled features

2022-02-05 Thread Leonardo Bras Soares Passos

Hello David, thank you for the feedback.

On Wed, Feb 2, 2022 at 12:47 PM David Edmondson
 wrote:
>
> On Tuesday, 2022-02-01 at 16:09:57 -03, Leonardo Brás wrote:
>
> > Hello David, thanks for this feedback!
> >
> > On Mon, 2022-01-31 at 12:53 +, David Edmondson wrote:
> >> On Saturday, 2022-01-29 at 06:46:45 -03, Leonardo Bras wrote:
> >>
> >> > The following steps describe a migration bug:
> >> > 1 - Bring up a VM with -cpu EPYC on a host with EPYC-Milan cpu
> >> > 2 - Migrate to a host with EPYC-Naples cpu
> >> >
> >> > The guest kernel crashes shortly after the migration.
> >> >
> >> > The crash happens due to a fault caused by XRSTOR:
> >> > A set bit in XSTATE_BV is not set in XCR0.
> >> > The faulting bit is FEATURE_PKRU (enabled in Milan, but not in
> >> > Naples)
> >>
> >> I'm trying to understand how this happens.
> >>
> >> If we boot on EPYC-Milan with "-cpu EPYC", the PKRU feature should
> >> not
> >> be exposed to the VM (it is not available in the EPYC CPU).
> >>
> >> Given this, how would bit 0x200 (representing PKRU) end up set in
> >> xstate_bv?
> >
> > During my debug, I noticed this bit gets set before the kernel even
> > starts.
> >
> > It's possible Seabios and/or IPXE are somehow setting 0x200 using the
> > xrstor command. I am not sure if qemu is able to stop this in KVM mode.
>
> I don't believe that this should be possible.
>
> If the CPU is set to EPYC in QEMU then .features[FEAT_7_0_ECX] does not
> include CPUID_7_0_ECX_PKU, which in turn means that when
> x86_cpu_enable_xsave_components() generates FEAT_XSAVE_COMP_LO it should
> not set XSTATE_PKRU_BIT.
>
> Given that, KVM's vcpu->arch.guest_supported_xcr0 will not include
> XSTATE_PKRU_BIT, and __kvm_set_xcr() should not allow that bit to be
> set when it intercepts the guest xsetbv instruction.

Thanks for sharing those details, it helped me on the kernel side of this bug.

FWIW, i did send a patchset fixing this bug to kernel list:
https://patchwork.kernel.org/project/kvm/list/?series=611524=%2A=both


Best regards,
Leo

Re: [PATCH v1 1/1] target/i386: Mask xstate_bv based on the cpu enabled features

2022-02-01 Thread Leonardo Bras Soares Passos

Hello David Edmondson and Igor Memmedov,

Thank you for the feedback!

For some reason I did not get your comments in my email.
I could only notice them when I opened Patchwork to get the link.

Sorry for the delay. I will do my best to address them in a few minutes.

Best regards,
Leo

On Sat, Jan 29, 2022 at 6:47 AM Leonardo Bras  wrote:
>
> The following steps describe a migration bug:
> 1 - Bring up a VM with -cpu EPYC on a host with EPYC-Milan cpu
> 2 - Migrate to a host with EPYC-Naples cpu
>
> The guest kernel crashes shortly after the migration.
>
> The crash happens due to a fault caused by XRSTOR:
> A set bit in XSTATE_BV is not set in XCR0.
> The faulting bit is FEATURE_PKRU (enabled in Milan, but not in Naples)
>
> To avoid this kind of bug:
> In kvm_get_xsave, mask-out from xstate_bv any bits that are not set in
> current vcpu's features.
>
> This keeps cpu->env->xstate_bv with feature bits compatible with any
> host machine capable of running the vcpu model.
>
> Signed-off-by: Leonardo Bras 
> ---
>  target/i386/xsave_helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c
> index ac61a96344..0628226234 100644
> --- a/target/i386/xsave_helper.c
> +++ b/target/i386/xsave_helper.c
> @@ -167,7 +167,7 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const void 
> *buf, uint32_t buflen)
>  env->xmm_regs[i].ZMM_Q(1) = ldq_p(xmm + 8);
>  }
>
> -env->xstate_bv = header->xstate_bv;
> +env->xstate_bv = header->xstate_bv & env->features[FEAT_XSAVE_COMP_LO];
>
>  e = _ext_save_areas[XSTATE_YMM_BIT];
>  if (e->size && e->offset) {
> --
> 2.34.1
>

Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-02-01 Thread Leonardo Bras Soares Passos

Hello Daniel, thanks for reviewing!

On Tue, Feb 1, 2022 at 6:35 AM Daniel P. Berrangé  wrote:
>
> On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using 
> > qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  include/io/channel.h| 38 -
> >  chardev/char-io.c   |  2 +-
> >  hw/remote/mpqemu-link.c |  2 +-
> >  io/channel-buffer.c |  1 +
> >  io/channel-command.c|  1 +
> >  io/channel-file.c   |  1 +
> >  io/channel-socket.c |  2 ++
> >  io/channel-tls.c|  1 +
> >  io/channel-websock.c|  1 +
> >  io/channel.c| 53 +++--
> >  migration/rdma.c|  1 +
> >  scsi/pr-manager-helper.c|  2 +-
> >  tests/unit/test-io-channel-socket.c |  1 +
> >  13 files changed, 92 insertions(+), 14 deletions(-)
> >
> > diff --git a/io/channel.c b/io/channel.c
> > index e8b019dc36..b8b99fdc4c 100644
> > --- a/io/channel.c
> > +++ b/io/channel.c
> > @@ -72,18 +72,32 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> >  size_t niov,
> >  int *fds,
> >  size_t nfds,
> > +int flags,
> >  Error **errp)
> >  {
> >  QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> >
> > -if ((fds || nfds) &&
> > -!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> > +if (fds || nfds) {
> > +if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> > +error_setg_errno(errp, EINVAL,
> > + "Channel does not support file descriptor 
> > passing");
> > +return -1;
> > +}
> > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +error_setg_errno(errp, EINVAL,
> > + "Zero Copy does not support file descriptor 
> > passing");
> > +return -1;
> > +}
>
> Here you gracefully reject FD passing when zero copy is requested
> which is good.
>
> > +}
> > +
>
> > @@ -235,10 +249,16 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
> >iov, niov,
> >0, iov_size(iov, niov));
> >
> > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +assert(fds == NULL && nfds == 0);
> > +}
>
> But here you  abort QEMU if FD passing is requested when zero copy
> is set.
>
> AFAICT, if you just delete this assert, the code to gracefully
> report errors will do the right thing.

Yeah, thatś right. This test is unnecessary since qio_channel_writev_full()
will be called and will return error if fds + zerocopy happens.

Good catch!

>
> Without the assert:
>
>   Reviewed-by: Daniel P. Berrangé 
>

Thanks!
I will wait for more feedback on other patches before sending the v9,
but it should not take too long this time.

Best regards,
Leo

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-31 Thread Leonardo Bras Soares Passos

Hello Peter,

Re-reading everything before submitting the next version.
I think I finally got that you are suggesting to just add a break at
the end of the case, after the if :)

Sorry I misunderstand that before,

Best regards,
Leo

On Thu, Jan 13, 2022 at 3:48 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > @@ -558,15 +575,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >  }
> >
> > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +sflags = MSG_ZEROCOPY;
> > +}
> > +
> >   retry:
> > -ret = sendmsg(sioc->fd, , 0);
> > +ret = sendmsg(sioc->fd, , sflags);
> >  if (ret <= 0) {
> > -if (errno == EAGAIN) {
> > +switch (errno) {
> > +case EAGAIN:
> >  return QIO_CHANNEL_ERR_BLOCK;
> > -}
> > -if (errno == EINTR) {
> > +case EINTR:
> >  goto retry;
> > +case ENOBUFS:
> > +if (sflags & MSG_ZEROCOPY) {
> > +error_setg_errno(errp, errno,
> > + "Process can't lock enough memory for 
> > using MSG_ZEROCOPY");
> > +return -1;
> > +}
>
> I have no idea whether it'll make a real differnece, but - should we better 
> add
> a "break" here?  If you agree and with that fixed, feel free to add:
>
> Reviewed-by: Peter Xu 
>
> I also wonder whether you hit ENOBUFS in any of the environments.  On Fedora
> here it's by default unlimited, but just curious when we should keep an eye.
>
> Thanks,
>
> --
> Peter Xu
>

Re: [RFC PATCH 1/1] i386: Remove features from Epyc-Milan cpu

2022-01-31 Thread Leonardo Bras Soares Passos

On Mon, Jan 31, 2022 at 3:04 PM Daniel P. Berrangé  wrote:
>
> On Mon, Jan 31, 2022 at 02:56:38PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel,
> >
> > On Mon, Jan 31, 2022 at 6:08 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > CC'ing  Babu Moger who aded the Milan CPU model.
> > >
> > > On Sat, Jan 29, 2022 at 07:23:37AM -0300, Leonardo Bras wrote:
> > > > While trying to bring a VM with EPYC-Milan cpu on a host with
> > > > EPYC-Milan cpu (EPYC 7313), the following warning can be seen:
> > > >
> > > > qemu-system-x86_64: warning: host doesn't support requested feature: 
> > > > CPUID.07H:EBX.erms [bit 9]
> > > > qemu-system-x86_64: warning: host doesn't support requested feature: 
> > > > CPUID.07H:EDX.fsrm [bit 4]
> > > >
> > > > Even with this warning, the host goes up.
> > > >
> > > > Then, grep'ing cpuid output on both guest and host, outputs:
> > > >
> > > > extended feature flags (7):
> > > >   enhanced REP MOVSB/STOSB = false
> > > >   fast short REP MOV   = false
> > > >   (simple synth)  = AMD EPYC (3rd Gen) (Milan B1) [Zen 3], 7nm
> > > >brand = "AMD EPYC 7313 16-Core Processor   "
> > > >
> > > > This means that for the same -cpu model (EPYC-Milan), the vcpu may or 
> > > > may
> > > > not have the above feature bits set, which is usually not a good idea 
> > > > for
> > > > live migration:
> > > > Migrating from a host with these features to a host without them can
> > > > be troublesome for the guest.
> > > >
> > > > Remove the "optional" features (erms, fsrm) from Epyc-Milan, in order to
> > > > avoid possible after-migration guest issues.
> > >
> > > Babu,  can you give some insight into availability of erms / fsrm
> > > features across the EPYC 3rd gen CPU line. Is this example missing
> > > erms/fsrm an exception, or common place ?
> > >
> > > >
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >
> > > > Does this make sense? Or maybe I am missing something here.
> > > >
> > > > Having a kvm guest running with a feature bit, while the host
> > > > does not support it seems to cause a possible break the guest.
> > >
> > > The guest won't see the feature bit - that warning message from QEMU
> > > is telling you that it did't honour the request to expose
> > > erms / fsrm - it has dropped them from the CPUO exposed to the guest.
> >
> > Exactly.
> > What I meant here is:
> > 1 - Host with these feature bits start a VM with EPYC-Milan cpu (and
> > thus have those bits enabled)
> > 2 - Guest is migrated to a host such as the above, which does not
> > support those features (bits disabled), but does support EPYC-Milan
> > cpus (without those features).
> > 3 - The migration should be allowed, given the same cpu types. Then
> > either we have:
> > 3a : The guest vcpu stays with the flag enabled (case I tried to
> > explain above), possibly crashing if the new feature is used, or
> > 3b: The guest vcpu disables the flag due to incompatibility,  which
> > may make the guest confuse due to cpu change, and even end up trying
> > to use the new feature on the guest, even if it's disabled.
>
> Neither should happen with a correctly written mgmt app in charge.
>
> When launching a QEMU process for an incoming migration, it is expected
> that the mgmt app has first queried QEMU on the source to get the precise
> CPU model + flags that were added/removed on the source. The QEMU on
> the target is then launched with this exact set of flags, and the
> 'check' flag is also set for -cpu. That will cause QEMU on the target
> to refuse to start unless it can give the guest the 100% identical
> CPUID to what has been requested on the CLI, and thus matching the
> source.
>
> Libvirt will ensure all this is done correctly. If not using libvirt
> then you've got a bunch of work to do to achieve this. It certainly
> isn't sufficient to merely use the same plain '-cpu' arg that the
> soruce was original booted with, unless you have 100% identical
> hardware, microcode, and software on both hosts, or the target host
> offers a superset of features.

Oh, that is very interesting! Thanks for sharing!

Well, then at least one unexpected scenario should happen:
- VM with EPYC-Milan cpu, created

Re: [RFC PATCH 1/1] i386: Remove features from Epyc-Milan cpu

2022-01-31 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Mon, Jan 31, 2022 at 6:08 AM Daniel P. Berrangé  wrote:
>
> CC'ing  Babu Moger who aded the Milan CPU model.
>
> On Sat, Jan 29, 2022 at 07:23:37AM -0300, Leonardo Bras wrote:
> > While trying to bring a VM with EPYC-Milan cpu on a host with
> > EPYC-Milan cpu (EPYC 7313), the following warning can be seen:
> >
> > qemu-system-x86_64: warning: host doesn't support requested feature: 
> > CPUID.07H:EBX.erms [bit 9]
> > qemu-system-x86_64: warning: host doesn't support requested feature: 
> > CPUID.07H:EDX.fsrm [bit 4]
> >
> > Even with this warning, the host goes up.
> >
> > Then, grep'ing cpuid output on both guest and host, outputs:
> >
> > extended feature flags (7):
> >   enhanced REP MOVSB/STOSB = false
> >   fast short REP MOV   = false
> >   (simple synth)  = AMD EPYC (3rd Gen) (Milan B1) [Zen 3], 7nm
> >brand = "AMD EPYC 7313 16-Core Processor   "
> >
> > This means that for the same -cpu model (EPYC-Milan), the vcpu may or may
> > not have the above feature bits set, which is usually not a good idea for
> > live migration:
> > Migrating from a host with these features to a host without them can
> > be troublesome for the guest.
> >
> > Remove the "optional" features (erms, fsrm) from Epyc-Milan, in order to
> > avoid possible after-migration guest issues.
>
> Babu,  can you give some insight into availability of erms / fsrm
> features across the EPYC 3rd gen CPU line. Is this example missing
> erms/fsrm an exception, or common place ?
>
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >
> > Does this make sense? Or maybe I am missing something here.
> >
> > Having a kvm guest running with a feature bit, while the host
> > does not support it seems to cause a possible break the guest.
>
> The guest won't see the feature bit - that warning message from QEMU
> is telling you that it did't honour the request to expose
> erms / fsrm - it has dropped them from the CPUO exposed to the guest.

Exactly.
What I meant here is:
1 - Host with these feature bits start a VM with EPYC-Milan cpu (and
thus have those bits enabled)
2 - Guest is migrated to a host such as the above, which does not
support those features (bits disabled), but does support EPYC-Milan
cpus (without those features).
3 - The migration should be allowed, given the same cpu types. Then
either we have:
3a : The guest vcpu stays with the flag enabled (case I tried to
explain above), possibly crashing if the new feature is used, or
3b: The guest vcpu disables the flag due to incompatibility,  which
may make the guest confuse due to cpu change, and even end up trying
to use the new feature on the guest, even if it's disabled.



>
> >
> >
> >  target/i386/cpu.c | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index aa9e636800..a4bbd38ed0 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -4160,12 +4160,9 @@ static const X86CPUDefinition builtin_x86_defs[] = {
> >  CPUID_7_0_EBX_FSGSBASE | CPUID_7_0_EBX_BMI1 | 
> > CPUID_7_0_EBX_AVX2 |
> >  CPUID_7_0_EBX_SMEP | CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_RDSEED 
> > |
> >  CPUID_7_0_EBX_ADX | CPUID_7_0_EBX_SMAP | 
> > CPUID_7_0_EBX_CLFLUSHOPT |
> > -CPUID_7_0_EBX_SHA_NI | CPUID_7_0_EBX_CLWB | CPUID_7_0_EBX_ERMS 
> > |
> > -CPUID_7_0_EBX_INVPCID,
> > +CPUID_7_0_EBX_SHA_NI | CPUID_7_0_EBX_CLWB | 
> > CPUID_7_0_EBX_INVPCID,
> >  .features[FEAT_7_0_ECX] =
> >  CPUID_7_0_ECX_UMIP | CPUID_7_0_ECX_RDPID | CPUID_7_0_ECX_PKU,
> > -.features[FEAT_7_0_EDX] =
> > -CPUID_7_0_EDX_FSRM,
> >  .features[FEAT_XSAVE] =
> >  CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XSAVEC |
> >  CPUID_XSAVE_XGETBV1 | CPUID_XSAVE_XSAVES,
> > --
> > 2.34.1
> >
> >
>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v7 3/5] migration: Add zero-copy parameter for QMP/HMP for Linux

2022-01-19 Thread Leonardo Bras Soares Passos

On Wed, Jan 19, 2022 at 3:16 PM Daniel P. Berrangé  wrote:
>
> On Wed, Jan 19, 2022 at 03:03:29PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel,
> >
> > On Thu, Jan 13, 2022 at 10:10 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Thu, Jan 06, 2022 at 07:13:40PM -0300, Leonardo Bras wrote:
> > > > Add property that allows zero-copy migration of memory pages,
> > > > and also includes a helper function migrate_use_zero_copy() to check
> > > > if it's enabled.
> > > >
> > > > No code is introduced to actually do the migration, but it allow
> > > > future implementations to enable/disable this feature.
> > > >
> > > > On non-Linux builds this parameter is compiled-out.
> > > >
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >  qapi/migration.json   | 24 
> > > >  migration/migration.h |  5 +
> > > >  migration/migration.c | 32 
> > > >  migration/socket.c|  5 +
> > > >  monitor/hmp-cmds.c|  6 ++
> > > >  5 files changed, 72 insertions(+)
> > >
> > > Reviewed-by: Daniel P. Berrangé 
> >
> > Thanks!
> >
> > >
> > > >
> > > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > > index bbfd48cf0b..2e62ea6ebd 100644
> > > > --- a/qapi/migration.json
> > > > +++ b/qapi/migration.json
> > > > @@ -730,6 +730,13 @@
> > > >  #  will consume more CPU.
> > > >  #  Defaults to 1. (Since 5.0)
> > > >  #
> > > > +# @zero-copy: Controls behavior on sending memory pages on migration.
> > > > +# When true, enables a zero-copy mechanism for sending 
> > > > memory
> > > > +# pages, if host supports it.
> > > > +# Requires that QEMU be permitted to use locked memory for 
> > > > guest
> > > > +# RAM pages.
> > > > +# Defaults to false. (Since 7.0)
> > > > +#
> > > >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> > > >  #aliases for the purpose of dirty bitmap 
> > > > migration.  Such
> > > >  #aliases may for example be the corresponding 
> > > > names on the
> > > > @@ -769,6 +776,7 @@
> > > > 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> > > > 'max-cpu-throttle', 'multifd-compression',
> > > > 'multifd-zlib-level' ,'multifd-zstd-level',
> > > > +   { 'name': 'zero-copy', 'if' : 'CONFIG_LINUX'},
> > > > 'block-bitmap-mapping' ] }
> > > >
> > > >  ##
> > > > @@ -895,6 +903,13 @@
> > > >  #  will consume more CPU.
> > > >  #  Defaults to 1. (Since 5.0)
> > > >  #
> > > > +# @zero-copy: Controls behavior on sending memory pages on migration.
> > > > +# When true, enables a zero-copy mechanism for sending 
> > > > memory
> > > > +# pages, if host supports it.
> > > > +# Requires that QEMU be permitted to use locked memory for 
> > > > guest
> > > > +# RAM pages.
> > > > +# Defaults to false. (Since 7.0)
> > > > +#
> > > >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> > > >  #aliases for the purpose of dirty bitmap 
> > > > migration.  Such
> > > >  #aliases may for example be the corresponding 
> > > > names on the
> > > > @@ -949,6 +964,7 @@
> > > >  '*multifd-compression': 'MultiFDCompression',
> > > >  '*multifd-zlib-level': 'uint8',
> > > >  '*multifd-zstd-level': 'uint8',
> > > > +'*zero-copy': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
> > > >  '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
> > >
> > > The current zerocopy impl is for the send path.
> > >
> > > Do you expect we might get zerocopy in the receive path
> > > later ?
> >
> > It's possible, but I haven't started the implementation yet.
> >
> > >
> > > If so then e

Re: [PATCH v7 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-01-19 Thread Leonardo Bras Soares Passos

On Tue, Jan 18, 2022 at 10:58 PM Peter Xu  wrote:
>
> On Tue, Jan 18, 2022 at 05:45:09PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Peter,
> >
> > On Thu, Jan 13, 2022 at 3:28 AM Peter Xu  wrote:
> > >
> > > On Thu, Jan 06, 2022 at 07:13:38PM -0300, Leonardo Bras wrote:
> > > > diff --git a/io/channel.c b/io/channel.c
> > > > index e8b019dc36..904855e16e 100644
> > > > --- a/io/channel.c
> > > > +++ b/io/channel.c
> > > > @@ -67,12 +67,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> > > >  }
> > > >
> > > >
> > > > -ssize_t qio_channel_writev_full(QIOChannel *ioc,
> > > > -const struct iovec *iov,
> > > > -size_t niov,
> > > > -int *fds,
> > > > -size_t nfds,
> > > > -Error **errp)
> > > > +ssize_t qio_channel_writev_full_flags(QIOChannel *ioc,
> > > > +  const struct iovec *iov,
> > > > +  size_t niov,
> > > > +  int *fds,
> > > > +  size_t nfds,
> > > > +  int flags,
> > > > +  Error **errp)
> > > >  {
> > > >  QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> > > >
> > > > @@ -83,7 +84,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> > > >  return -1;
> > > >  }
> > >
> > > Should we better also check QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY here when
> > > QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is set?  Just like what we do with:
> >
> > Yes, that's correct.
> > I will also test for fds + zerocopy_flag , which should also fail here.
> >
> > >
> > > if ((fds || nfds) &&
> > > !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> > > error_setg_errno(errp, EINVAL,
> > >  "Channel does not support file descriptor 
> > > passing");
> > > return -1;
> > > }
> > >
> > > I still think it's better to have the caller be crystal clear when to use
> > > zero_copy feature because it has implication on buffer lifetime.
> >
> > I don't disagree with that suggestion.
> >
> > But the buffer lifetime limitation is something on the socket
> > implementation, right?
> > There could be some synchronous zerocopy implementation that does not
> > require flush, and thus
> > don't require the buffer to be treated any special. Or am I missing 
> > something?
>
> Currently the flush() is required for zerocopy and not required for all the
> existing non-zerocopy use cases, that's already an API difference so the 
> caller
> needs to identify it anyway.  Then I think it's simpler we expose all of it to
> the user.

Yeah, I agree.
Since one ZC implementation uses flush, all should use them. Even if
it's a no-op.
It was just an observation that not all ZC implementations have buffer
limitations, but I agree the user should expect them anyway, since
they will exist in some implementations.

>
> Not to mention IIUC if we don't fail here, it will just fail later when the
> code will unconditionally convert the flags=ZEROCOPY into MSG_ZEROCOPY in your
> next patch:
>
> if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> sflags = MSG_ZEROCOPY;
> }
>

Correct.

> So AFAIU it'll fail anyway, either here with the cap check I mentioned, or
> later in sendmsg().
>
> IOW, I think it fails cleaner here, rather than reaching sendmsg().

I Agree.

>
> >
> > >
> > > I might have commented similar things before, but I have missed a few 
> > > versions
> > > so I could also have missed some previous discussions..
> > >
> >
> > That's all great suggestions Peter!  Thanks for that!
> >
> > Some of the previous suggestions may have been missed because a lot of
> > code moved.
> > Sorry about that.
>
> Not a problem at all, I just want to make sure my question still makes
> sense. :)

Thanks for asking them!

>
> --
> Peter Xu
>

Best regards,
Leo

Re: [PATCH v7 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jan 13, 2022 at 4:15 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:42PM -0300, Leonardo Bras wrote:
> > Implement zero copy on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so it can distinguish each iteration sync 
> > from
> > the setup and the completion, so a flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent
> > before a new iteration is started.
>
> Leo - could you remind me (since I remembered we've discussed something
> similar) on why we can't simply do the sync() unconditionally for zero copy?

On previous implementations, it would get stuck on the setup, since it
was waiting for any movement on the error queue before even starting
the sending process.
At the time we would sync only at 'complete', so it would only need to
run once. Running every iteration seemed a waste at the time.

Then, after some talk with Juan, it was decided to sync after each
migration, so on 'complete' it was unnecessary.
But sure, now it would add just 2 syncs in the whole migration, and
those should not even get to the syscall due to queued/sent counters.

>
> I remember why we need the sync(), but I can't remember what's the matter if 
> we
> simply sync() too during setup and complete of migration.
>

> Another trivial nit here:
>
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f, bool sync)
>
> I'd name it "bool full" or anything not called "sync", because the function
> already has a name that contains "sync", then it's werid to sync(sync==false).
>

Yeah, I agree.
But if we will flush every time, then there is no need for such parameter :).

> The rest looks good to me.  Thanks.
>

Thanks!

> --
> Peter Xu
>

Best regards,
Leo

Re: [PATCH v7 4/5] migration: Add migrate_use_tls() helper

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Thu, Jan 13, 2022 at 10:11 AM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:41PM -0300, Leonardo Bras wrote:
> > A lot of places check parameters.tls_creds in order to evaluate if TLS is
> > in use, and sometimes call migrate_get_current() just for that test.
> >
> > Add new helper function migrate_use_tls() in order to simplify testing
> > for TLS usage.
> >
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Juan Quintela 
> > ---
> >  migration/migration.h | 1 +
> >  migration/channel.c   | 6 +++---
> >  migration/migration.c | 9 +
> >  migration/multifd.c   | 5 +
> >  4 files changed, 14 insertions(+), 7 deletions(-)
>
> Reviewed-by: Daniel P. Berrangé 
>

Thanks!

> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Best regards,
Leo

Re: [PATCH v7 4/5] migration: Add migrate_use_tls() helper

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jan 13, 2022 at 4:02 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:41PM -0300, Leonardo Bras wrote:
> >  void migration_channel_process_incoming(QIOChannel *ioc)
> >  {
> > -MigrationState *s = migrate_get_current();
> >  Error *local_err = NULL;
> >
> >  trace_migration_set_incoming_channel(
> >  ioc, object_get_typename(OBJECT(ioc)));
> >
> > -if (s->parameters.tls_creds &&
> > -*s->parameters.tls_creds &&
> > +if (migrate_use_tls() &&
> >  !object_dynamic_cast(OBJECT(ioc),
> >   TYPE_QIO_CHANNEL_TLS)) {
> > +MigrationState *s = migrate_get_current();
> > +
>
> Trivial nit: I'd rather keep the line there; as the movement offers nothing,
> imho..

The idea to move the 's' to inside the if  block is to make it clear
it's only used in this case.

But if you think it's better to keep it at the beginning of the
function, sure, I can change that.
Just let me know.

>
> >  migration_tls_channel_process_incoming(s, ioc, _err);
> >  } else {
> >  migration_ioc_register_yank(ioc);
>
> Reviewed-by: Peter Xu 
>

Thanks!

> --
> Peter Xu
>

Best regards,
Leo

Re: [PATCH v7 3/5] migration: Add zero-copy parameter for QMP/HMP for Linux

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Thu, Jan 13, 2022 at 10:10 AM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:40PM -0300, Leonardo Bras wrote:
> > Add property that allows zero-copy migration of memory pages,
> > and also includes a helper function migrate_use_zero_copy() to check
> > if it's enabled.
> >
> > No code is introduced to actually do the migration, but it allow
> > future implementations to enable/disable this feature.
> >
> > On non-Linux builds this parameter is compiled-out.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  qapi/migration.json   | 24 
> >  migration/migration.h |  5 +
> >  migration/migration.c | 32 
> >  migration/socket.c|  5 +
> >  monitor/hmp-cmds.c|  6 ++
> >  5 files changed, 72 insertions(+)
>
> Reviewed-by: Daniel P. Berrangé 

Thanks!

>
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index bbfd48cf0b..2e62ea6ebd 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -730,6 +730,13 @@
> >  #  will consume more CPU.
> >  #  Defaults to 1. (Since 5.0)
> >  #
> > +# @zero-copy: Controls behavior on sending memory pages on migration.
> > +# When true, enables a zero-copy mechanism for sending memory
> > +# pages, if host supports it.
> > +# Requires that QEMU be permitted to use locked memory for 
> > guest
> > +# RAM pages.
> > +# Defaults to false. (Since 7.0)
> > +#
> >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> >  #aliases for the purpose of dirty bitmap 
> > migration.  Such
> >  #aliases may for example be the corresponding 
> > names on the
> > @@ -769,6 +776,7 @@
> > 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> > 'max-cpu-throttle', 'multifd-compression',
> > 'multifd-zlib-level' ,'multifd-zstd-level',
> > +   { 'name': 'zero-copy', 'if' : 'CONFIG_LINUX'},
> > 'block-bitmap-mapping' ] }
> >
> >  ##
> > @@ -895,6 +903,13 @@
> >  #  will consume more CPU.
> >  #  Defaults to 1. (Since 5.0)
> >  #
> > +# @zero-copy: Controls behavior on sending memory pages on migration.
> > +# When true, enables a zero-copy mechanism for sending memory
> > +# pages, if host supports it.
> > +# Requires that QEMU be permitted to use locked memory for 
> > guest
> > +# RAM pages.
> > +# Defaults to false. (Since 7.0)
> > +#
> >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> >  #aliases for the purpose of dirty bitmap 
> > migration.  Such
> >  #aliases may for example be the corresponding 
> > names on the
> > @@ -949,6 +964,7 @@
> >  '*multifd-compression': 'MultiFDCompression',
> >  '*multifd-zlib-level': 'uint8',
> >  '*multifd-zstd-level': 'uint8',
> > +'*zero-copy': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
> >  '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>
> The current zerocopy impl is for the send path.
>
> Do you expect we might get zerocopy in the receive path
> later ?

It's possible, but I haven't started the implementation yet.

>
> If so then either call this 'send-zero-copy', or change it
> from a bool to an enum taking '["send", "recv", "both"]'.
>
> I'd probably take the former and just rename it.
>

Well, my rationale:
- I want to set zero copy sending:
zero-copy is set in the sending host, start migration.

- I want to set zero copy receiving:
zero-copy is set in the receiving host, wait for migration.
(Of course host support is checked when setting the parameter).

The problem with the current approach is trying to enable zero-copy on
receive before it's implemented, which will 'fail' silently .
A possible solution would be to add a patch to check in the receiving
path if zero-copy is enabled, and fail for now.

What do you think?

Best regards,
Leo

>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v7 3/5] migration: Add zero-copy parameter for QMP/HMP for Linux

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jan 13, 2022 at 4:00 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:40PM -0300, Leonardo Bras wrote:
> > Add property that allows zero-copy migration of memory pages,
> > and also includes a helper function migrate_use_zero_copy() to check
> > if it's enabled.
> >
> > No code is introduced to actually do the migration, but it allow
> > future implementations to enable/disable this feature.
> >
> > On non-Linux builds this parameter is compiled-out.
>
> I feel sad every time seeing a new parameter needs to be mostly duplicated 3
> times in the code. :(
>
> > diff --git a/migration/socket.c b/migration/socket.c
> > index 05705a32d8..f7a77aafd3 100644
> > --- a/migration/socket.c
> > +++ b/migration/socket.c
> > @@ -77,6 +77,11 @@ static void socket_outgoing_migration(QIOTask *task,
> >  } else {
> >  trace_migration_socket_outgoing_connected(data->hostname);
> >  }
> > +
> > +if (migrate_use_zero_copy()) {
> > +error_setg(, "Zero copy not available in migration");
> > +}
>
> I got confused the 1st time looking at it..  I think this is not strongly
> needed, but that's okay:

The idea is to avoid some future issues on testing migration while bisecting.

>
> Reviewed-by: Peter Xu 

Thanks Peter!

>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-19 Thread Leonardo Bras Soares Passos

On Thu, Jan 13, 2022 at 10:06 AM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error 
> > occurs.
> >
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> > copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent 
> > instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before 
> > freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, 
> > and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may 
> > require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  include/io/channel-socket.h |   2 +
> >  io/channel-socket.c | 107 ++--
> >  2 files changed, 105 insertions(+), 4 deletions(-)
>
> Reviewed-by: Daniel P. Berrangé 
>

Thanks!

>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Thu, Jan 13, 2022 at 7:42 AM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 13, 2022 at 06:34:12PM +0800, Peter Xu wrote:
> > On Thu, Jan 13, 2022 at 10:06:14AM +, Daniel P. Berrangé wrote:
> > > On Thu, Jan 13, 2022 at 02:48:15PM +0800, Peter Xu wrote:
> > > > On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > > > > @@ -558,15 +575,26 @@ static ssize_t 
> > > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > > >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> > > > >  }
> > > > >
> > > > > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > > > > +sflags = MSG_ZEROCOPY;
> > > > > +}
> > > > > +
> > > > >   retry:
> > > > > -ret = sendmsg(sioc->fd, , 0);
> > > > > +ret = sendmsg(sioc->fd, , sflags);
> > > > >  if (ret <= 0) {
> > > > > -if (errno == EAGAIN) {
> > > > > +switch (errno) {
> > > > > +case EAGAIN:
> > > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > > > -}
> > > > > -if (errno == EINTR) {
> > > > > +case EINTR:
> > > > >  goto retry;
> > > > > +case ENOBUFS:
> > > > > +if (sflags & MSG_ZEROCOPY) {
> > > > > +error_setg_errno(errp, errno,
> > > > > + "Process can't lock enough memory 
> > > > > for using MSG_ZEROCOPY");
> > > > > +return -1;
> > > > > +}
> > > >
> > > > I have no idea whether it'll make a real differnece, but - should we 
> > > > better add
> > > > a "break" here?  If you agree and with that fixed, feel free to add:
> > > >
> > > > Reviewed-by: Peter Xu 
> > > >
> > > > I also wonder whether you hit ENOBUFS in any of the environments.  On 
> > > > Fedora
> > > > here it's by default unlimited, but just curious when we should keep an 
> > > > eye.
> > >
> > > Fedora doesn't allow unlimited locked memory by default
> > >
> > > $ grep "locked memory" /proc/self/limits
> > > Max locked memory 6553665536bytes
> > >
> > > And  regardless of Fedora defaults, libvirt will set a limit
> > > for the guest. It will only be unlimited if requiring certain
> > > things like VFIO.
> >
> > Thanks, I obviously checked up the wrong host..
> >
> > Leo, do you know how much locked memory will be needed by zero copy?  Will
> > there be a limit?  Is it linear to the number of sockets/channels?
>
> IIRC we decided it would be limited by the socket send buffer size, rather
> than guest RAM, because writes will block once the send buffer is full.
>
> This has a default global setting, with per-socket override. On one box I
> have it is 200 Kb. With multifd you'll need  "num-sockets * send buffer".

Oh, I was not aware there is a send buffer size (or maybe I am unable
to recall).
That sure makes things much easier.

>
> > It'll be better if we can fail at enabling the feature when we detected that
> > the specified locked memory limit may not be suffice.

sure

>
> Checking this value against available locked memory though will always
> have an error margin because other things in QEMU can use locked memory
> too

We can get the current limit (before zerocopy) as an error margin:
req_lock_mem = num-sockets * send buffer + BASE_LOCKED

Where BASE_LOCKED is the current libvirt value, or so on.

What do you think?

Best regards,
Leo

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-19 Thread Leonardo Bras Soares Passos

On Thu, Jan 13, 2022 at 9:12 AM Peter Xu  wrote:
>
> On Thu, Jan 13, 2022 at 10:42:39AM +, Daniel P. Berrangé wrote:
> > On Thu, Jan 13, 2022 at 06:34:12PM +0800, Peter Xu wrote:
> > > On Thu, Jan 13, 2022 at 10:06:14AM +, Daniel P. Berrangé wrote:
> > > > On Thu, Jan 13, 2022 at 02:48:15PM +0800, Peter Xu wrote:
> > > > > On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > > > > > @@ -558,15 +575,26 @@ static ssize_t 
> > > > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > > > >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> > > > > >  }
> > > > > >
> > > > > > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > > > > > +sflags = MSG_ZEROCOPY;
> > > > > > +}
> > > > > > +
> > > > > >   retry:
> > > > > > -ret = sendmsg(sioc->fd, , 0);
> > > > > > +ret = sendmsg(sioc->fd, , sflags);
> > > > > >  if (ret <= 0) {
> > > > > > -if (errno == EAGAIN) {
> > > > > > +switch (errno) {
> > > > > > +case EAGAIN:
> > > > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > > > > -}
> > > > > > -if (errno == EINTR) {
> > > > > > +case EINTR:
> > > > > >  goto retry;
> > > > > > +case ENOBUFS:
> > > > > > +if (sflags & MSG_ZEROCOPY) {
> > > > > > +error_setg_errno(errp, errno,
> > > > > > + "Process can't lock enough memory 
> > > > > > for using MSG_ZEROCOPY");
> > > > > > +return -1;
> > > > > > +}
> > > > >
> > > > > I have no idea whether it'll make a real differnece, but - should we 
> > > > > better add
> > > > > a "break" here?  If you agree and with that fixed, feel free to add:
> > > > >
> > > > > Reviewed-by: Peter Xu 
> > > > >
> > > > > I also wonder whether you hit ENOBUFS in any of the environments.  On 
> > > > > Fedora
> > > > > here it's by default unlimited, but just curious when we should keep 
> > > > > an eye.
> > > >
> > > > Fedora doesn't allow unlimited locked memory by default
> > > >
> > > > $ grep "locked memory" /proc/self/limits
> > > > Max locked memory 6553665536
> > > > bytes
> > > >
> > > > And  regardless of Fedora defaults, libvirt will set a limit
> > > > for the guest. It will only be unlimited if requiring certain
> > > > things like VFIO.
> > >
> > > Thanks, I obviously checked up the wrong host..
> > >
> > > Leo, do you know how much locked memory will be needed by zero copy?  Will
> > > there be a limit?  Is it linear to the number of sockets/channels?
> >
> > IIRC we decided it would be limited by the socket send buffer size, rather
> > than guest RAM, because writes will block once the send buffer is full.
> >
> > This has a default global setting, with per-socket override. On one box I
> > have it is 200 Kb. With multifd you'll need  "num-sockets * send buffer".
> >
> > > It'll be better if we can fail at enabling the feature when we detected 
> > > that
> > > the specified locked memory limit may not be suffice.
> >
> > Checking this value against available locked memory though will always
> > have an error margin because other things in QEMU can use locked memory
> > too
>
> We could always still allow false positive in this check, so we can fail if we
> have a solid clue to know we'll fail later (e.g. minimum locked_vm needed is
> already less than total).  But no strong opinion; we could have this merged 
> and
> see whether that's needed in real life.  Thanks,

I agree, this is a good approach.

Leo

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-19 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jan 13, 2022 at 3:48 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > @@ -558,15 +575,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> > *ioc,
> >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >  }
> >
> > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +sflags = MSG_ZEROCOPY;
> > +}
> > +
> >   retry:
> > -ret = sendmsg(sioc->fd, , 0);
> > +ret = sendmsg(sioc->fd, , sflags);
> >  if (ret <= 0) {
> > -if (errno == EAGAIN) {
> > +switch (errno) {
> > +case EAGAIN:
> >  return QIO_CHANNEL_ERR_BLOCK;
> > -}
> > -if (errno == EINTR) {
> > +case EINTR:
> >  goto retry;
> > +case ENOBUFS:
> > +if (sflags & MSG_ZEROCOPY) {
> > +error_setg_errno(errp, errno,
> > + "Process can't lock enough memory for 
> > using MSG_ZEROCOPY");
> > +return -1;
> > +}
>
> I have no idea whether it'll make a real differnece, but - should we better 
> add
> a "break" here?

Here I followed the standard of the EAGAIN error, that's why I just returned -1.

IIUC A break here would cause the errp to be re-set to the default
message, after the switch.
Another option would be to add a 'default' clause, and move the
default error msg there, and return the -1
after the switch.

In the end I thought the current way was simpler, but it's no issue
to change if you think the 'default' idea would be better.

>  If you agree and with that fixed, feel free to add:
>
> Reviewed-by: Peter Xu 
>

Thanks!

> I also wonder whether you hit ENOBUFS in any of the environments.  On Fedora
> here it's by default unlimited, but just curious when we should keep an eye.

It's unlimited if you run as root IIRC.

>
> Thanks,
>
> --
> Peter Xu
>

Re: [PATCH v7 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-01-19 Thread Leonardo Bras Soares Passos

On Thu, Jan 13, 2022 at 7:34 AM Peter Xu  wrote:
>
> On Thu, Jan 13, 2022 at 10:06:14AM +, Daniel P. Berrangé wrote:
> > On Thu, Jan 13, 2022 at 02:48:15PM +0800, Peter Xu wrote:
> > > On Thu, Jan 06, 2022 at 07:13:39PM -0300, Leonardo Bras wrote:
> > > > @@ -558,15 +575,26 @@ static ssize_t 
> > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > >  memcpy(CMSG_DATA(cmsg), fds, fdsize);
> > > >  }
> > > >
> > > > +if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > > > +sflags = MSG_ZEROCOPY;
> > > > +}
> > > > +
> > > >   retry:
> > > > -ret = sendmsg(sioc->fd, , 0);
> > > > +ret = sendmsg(sioc->fd, , sflags);
> > > >  if (ret <= 0) {
> > > > -if (errno == EAGAIN) {
> > > > +switch (errno) {
> > > > +case EAGAIN:
> > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > > -}
> > > > -if (errno == EINTR) {
> > > > +case EINTR:
> > > >  goto retry;
> > > > +case ENOBUFS:
> > > > +if (sflags & MSG_ZEROCOPY) {
> > > > +error_setg_errno(errp, errno,
> > > > + "Process can't lock enough memory for 
> > > > using MSG_ZEROCOPY");
> > > > +return -1;
> > > > +}
> > >
> > > I have no idea whether it'll make a real differnece, but - should we 
> > > better add
> > > a "break" here?  If you agree and with that fixed, feel free to add:
> > >
> > > Reviewed-by: Peter Xu 
> > >
> > > I also wonder whether you hit ENOBUFS in any of the environments.  On 
> > > Fedora
> > > here it's by default unlimited, but just curious when we should keep an 
> > > eye.
> >
> > Fedora doesn't allow unlimited locked memory by default
> >
> > $ grep "locked memory" /proc/self/limits
> > Max locked memory 6553665536bytes
> >
> > And  regardless of Fedora defaults, libvirt will set a limit
> > for the guest. It will only be unlimited if requiring certain
> > things like VFIO.
>
> Thanks, I obviously checked up the wrong host..
>
> Leo, do you know how much locked memory will be needed by zero copy?  Will
> there be a limit?  Is it linear to the number of sockets/channels?

It depends on the number of channels, of course, but there are
influencing factors, like network bandwidth & usage, and cpu speed &
usage, network queue size, VM pagesize and so on.

A simple exemple:
If the cpu is free/fast, but there are other applications using the
network, we may enqueue a lot of stuff for sending, and end up needing
a lot of locked memory.

I don't think it's easy to calculate a good reference value for locked
memory here.

>
> It'll be better if we can fail at enabling the feature when we detected that
> the specified locked memory limit may not be suffice.

I agree it's a good idea. But having this reference value calculated
is not much simple, IIUC.

>
> --
> Peter Xu
>

Re: [PATCH v7 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-01-18 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Thu, Jan 13, 2022 at 7:53 AM Daniel P. Berrangé  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:38PM -0300, Leonardo Bras wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using qio_channel_writev(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  include/io/channel.h | 67 +++-
> >  io/channel-buffer.c  |  1 +
> >  io/channel-command.c |  1 +
> >  io/channel-file.c|  1 +
> >  io/channel-socket.c  |  2 ++
> >  io/channel-tls.c |  1 +
> >  io/channel-websock.c |  1 +
> >  io/channel.c | 51 +++--
> >  migration/rdma.c |  1 +
> >  9 files changed, 98 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/io/channel.h b/include/io/channel.h
> > index 88988979f8..343766ce5b 100644
> > --- a/include/io/channel.h
> > +++ b/include/io/channel.h
> > @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
> >
> >  #define QIO_CHANNEL_ERR_BLOCK -2
> >
> > +#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
> > +
> >  typedef enum QIOChannelFeature QIOChannelFeature;
> >
> >  enum QIOChannelFeature {
> >  QIO_CHANNEL_FEATURE_FD_PASS,
> >  QIO_CHANNEL_FEATURE_SHUTDOWN,
> >  QIO_CHANNEL_FEATURE_LISTEN,
> > +QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
> >  };
> >
> >
> > @@ -104,6 +107,7 @@ struct QIOChannelClass {
> >   size_t niov,
> >   int *fds,
> >   size_t nfds,
> > + int flags,
> >   Error **errp);
> >  ssize_t (*io_readv)(QIOChannel *ioc,
> >  const struct iovec *iov,
> > @@ -136,6 +140,8 @@ struct QIOChannelClass {
> >IOHandler *io_read,
> >IOHandler *io_write,
> >void *opaque);
> > +int (*io_flush)(QIOChannel *ioc,
> > +Error **errp);
> >  };
> >
> >  /* General I/O handling functions */
> > @@ -222,12 +228,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> >
> >
> >  /**
> > - * qio_channel_writev_full:
> > + * qio_channel_writev_full_flags:
> >   * @ioc: the channel object
> >   * @iov: the array of memory regions to write data from
> >   * @niov: the length of the @iov array
> >   * @fds: an array of file handles to send
> >   * @nfds: number of file handles in @fds
> > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   * Write data to the IO channel, reading it from the
> > @@ -255,12 +262,16 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> >   * or QIO_CHANNEL_ERR_BLOCK if no data is can be sent
> >   * and the channel is non-blocking
> >   */
> > -ssize_t qio_channel_writev_full(QIOChannel *ioc,
> > -const struct iovec *iov,
> > -size_t niov,
> > -int *fds,
> > -size_t nfds,
> > -Error **errp);
> > +ssize_t qio_channel_writev_full_flags(QIOChannel *ioc,
> > +  const struct iovec *iov,
> > +  size_t niov,
> > +  int *fds,
> > +  size_t nfds,
> > +  int flags,
> > +  Error **errp);
> > +
> > +#define qio_channel_writev_full(ioc, iov, niov, fds, nfds, errp) \
> > +qio_channel_writev_full_flags(ioc, iov, niov, fds, nfds, 0, errp)
>
> Don't introduce yet another API variant here. Just add flags to
> all the existing write APIs with "full" in their name. The word
> "full" in their name was intended to indicate that they are
> accepting all possible parameters, so it doesn't mean sense to
> add APIs which take even more possible parameters.

Oh, I was not aware of that. Thanks for letting me know!

Sure, I will do this change for v8.


>
> > +int qio_channel_writev_full_all_flags(QIOChannel

Re: [PATCH v7 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-01-18 Thread Leonardo Bras Soares Passos

Hello Peter,

On Thu, Jan 13, 2022 at 3:28 AM Peter Xu  wrote:
>
> On Thu, Jan 06, 2022 at 07:13:38PM -0300, Leonardo Bras wrote:
> > diff --git a/io/channel.c b/io/channel.c
> > index e8b019dc36..904855e16e 100644
> > --- a/io/channel.c
> > +++ b/io/channel.c
> > @@ -67,12 +67,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> >  }
> >
> >
> > -ssize_t qio_channel_writev_full(QIOChannel *ioc,
> > -const struct iovec *iov,
> > -size_t niov,
> > -int *fds,
> > -size_t nfds,
> > -Error **errp)
> > +ssize_t qio_channel_writev_full_flags(QIOChannel *ioc,
> > +  const struct iovec *iov,
> > +  size_t niov,
> > +  int *fds,
> > +  size_t nfds,
> > +  int flags,
> > +  Error **errp)
> >  {
> >  QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> >
> > @@ -83,7 +84,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> >  return -1;
> >  }
>
> Should we better also check QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY here when
> QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is set?  Just like what we do with:

Yes, that's correct.
I will also test for fds + zerocopy_flag , which should also fail here.

>
> if ((fds || nfds) &&
> !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> error_setg_errno(errp, EINVAL,
>  "Channel does not support file descriptor passing");
> return -1;
> }
>
> I still think it's better to have the caller be crystal clear when to use
> zero_copy feature because it has implication on buffer lifetime.

I don't disagree with that suggestion.

But the buffer lifetime limitation is something on the socket
implementation, right?
There could be some synchronous zerocopy implementation that does not
require flush, and thus
don't require the buffer to be treated any special. Or am I missing something?

>
> I might have commented similar things before, but I have missed a few versions
> so I could also have missed some previous discussions..
>

That's all great suggestions Peter!  Thanks for that!

Some of the previous suggestions may have been missed because a lot of
code moved.
Sorry about that.

Best regards,
Leo

Re: [PATCH v5 6/6] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-12-09 Thread Leonardo Bras Soares Passos

On Thu, Dec 9, 2021 at 5:51 AM Leonardo Bras Soares Passos
 wrote:
>
> Hello Juan,
>
> On Tue, Nov 16, 2021 at 1:08 PM Juan Quintela  wrote:
> >
> > Leonardo Bras  wrote:
> > > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > > zerocopy interface.
> > >
> > > Change multifd_send_sync_main() so it can distinguish each iteration sync 
> > > from
> > > the setup and the completion, so a flush_zerocopy() can be called
> > > at the after each iteration in order to make sure all dirty pages are sent
> > > before a new iteration is started.
> > >
> > > Also make it return -1 if flush_zerocopy() fails, in order to cancel
> > > the migration process, and avoid resuming the guest in the target host
> > > without receiving all current RAM.
> > >
> > > This will work fine on RAM migration because the RAM pages are not 
> > > usually freed,
> > > and there is no problem on changing the pages content between 
> > > async_send() and
> > > the actual sending of the buffer, because this change will dirty the page 
> > > and
> > > cause it to be re-sent on a next iteration anyway.
> > >
> > > Given a lot of locked memory may be needed in order to use multid 
> > > migration
> > > with zerocopy enabled, make it optional by creating a new migration 
> > > parameter
> > > "zerocopy" on qapi, so low-privileged users can still perform multifd
> > > migrations.
> >
> > How much memory can a non-root program use by default?
> >
> >
> > >  static void *multifd_send_thread(void *opaque)
> > > @@ -853,6 +875,10 @@ static void multifd_new_send_channel_async(QIOTask 
> > > *task, gpointer opaque)
> > >  goto cleanup;
> > >  }
> > >
> > > +if (migrate_use_zerocopy()) {
> > > +p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZEROCOPY;
> > > +}
> >
> > This belongs
> >
> >
> > >  p->c = QIO_CHANNEL(sioc);
> > >  qio_channel_set_delay(p->c, false);
> > >  p->running = true;
> > > @@ -918,6 +944,7 @@ int multifd_save_setup(Error **errp)
> > >  p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> > >  p->name = g_strdup_printf("multifdsend_%d", i);
> > >  p->tls_hostname = g_strdup(s->hostname);
> > > +p->write_flags = 0;
> >
> > here?
> >
> > >  socket_send_channel_create(multifd_new_send_channel_async, p);
> > >  }
> > > diff --git a/migration/socket.c b/migration/socket.c
> > > index e26e94aa0c..8e40e0a3fd 100644
> > > --- a/migration/socket.c
> > > +++ b/migration/socket.c
> > > @@ -78,8 +78,13 @@ static void socket_outgoing_migration(QIOTask *task,
> > >  trace_migration_socket_outgoing_connected(data->hostname);
> > >  }
> > >
> > > -if (migrate_use_zerocopy()) {
> > > -error_setg(, "Zerocopy not available in migration");
> > > +if (migrate_use_zerocopy() &&
> > > +(!migrate_use_multifd() ||
> > > + !qio_channel_has_feature(sioc, 
> > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY) ||
> > > +  migrate_multifd_compression() != MULTIFD_COMPRESSION_NONE ||
> > > +  migrate_use_tls())) {
> > > +error_setg(,
> > > +   "Zerocopy only available for non-compressed non-TLS 
> > > multifd migration");
> > >  }
> > >
> > >  migration_channel_connect(data->s, sioc, data->hostname, err);
> >
> > Do we really want to do this check here?  I think this is really too
> > late.
> >
> > You are not patching migrate_params_check().
> >
> > I think that the proper way of doing this is something like:
> >
> > if (params->zerocopy &&
> > (params->parameters.multifd_compression != MULTIFD_COMPRESSION_NONE 
> > ||
> >  migrate_use_tls())) {
> >error_setg(,
> >  "Zerocopy only available for non-compressed non-TLS 
> > multifd migration");
> > return false;
> > }
>
> Don't we also need a check for multifd enabled here?
> We could have zerocopy, multifd_compression=none, tls=disabled but it
> will not fail if multifd=disabled.
>
> Is this correct?
>

I did some tests and this case actually seems to not fail, even though
it should.
So IIUC we really need to check for multifd here.

Sending v6.

>
> >
> > You have to do the equivalent of multifd_compression and tls enablement,
> > to see that zerocopy is not enabled, of course.
> >
> > I would prefer to check for QIO_CHANNEL_FEATUR_WRITE_ZEROCPY there, but
> > I can't see a way of doing that without a qio.
> >
> > Later, Juan.
> >

Re: [PATCH v5 6/6] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-12-09 Thread Leonardo Bras Soares Passos

Hello Juan,

On Tue, Nov 16, 2021 at 1:08 PM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > zerocopy interface.
> >
> > Change multifd_send_sync_main() so it can distinguish each iteration sync 
> > from
> > the setup and the completion, so a flush_zerocopy() can be called
> > at the after each iteration in order to make sure all dirty pages are sent
> > before a new iteration is started.
> >
> > Also make it return -1 if flush_zerocopy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between async_send() 
> > and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > Given a lot of locked memory may be needed in order to use multid migration
> > with zerocopy enabled, make it optional by creating a new migration 
> > parameter
> > "zerocopy" on qapi, so low-privileged users can still perform multifd
> > migrations.
>
> How much memory can a non-root program use by default?
>
>
> >  static void *multifd_send_thread(void *opaque)
> > @@ -853,6 +875,10 @@ static void multifd_new_send_channel_async(QIOTask 
> > *task, gpointer opaque)
> >  goto cleanup;
> >  }
> >
> > +if (migrate_use_zerocopy()) {
> > +p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZEROCOPY;
> > +}
>
> This belongs
>
>
> >  p->c = QIO_CHANNEL(sioc);
> >  qio_channel_set_delay(p->c, false);
> >  p->running = true;
> > @@ -918,6 +944,7 @@ int multifd_save_setup(Error **errp)
> >  p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> >  p->name = g_strdup_printf("multifdsend_%d", i);
> >  p->tls_hostname = g_strdup(s->hostname);
> > +p->write_flags = 0;
>
> here?
>
> >  socket_send_channel_create(multifd_new_send_channel_async, p);
> >  }
> > diff --git a/migration/socket.c b/migration/socket.c
> > index e26e94aa0c..8e40e0a3fd 100644
> > --- a/migration/socket.c
> > +++ b/migration/socket.c
> > @@ -78,8 +78,13 @@ static void socket_outgoing_migration(QIOTask *task,
> >  trace_migration_socket_outgoing_connected(data->hostname);
> >  }
> >
> > -if (migrate_use_zerocopy()) {
> > -error_setg(, "Zerocopy not available in migration");
> > +if (migrate_use_zerocopy() &&
> > +(!migrate_use_multifd() ||
> > + !qio_channel_has_feature(sioc, 
> > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY) ||
> > +  migrate_multifd_compression() != MULTIFD_COMPRESSION_NONE ||
> > +  migrate_use_tls())) {
> > +error_setg(,
> > +   "Zerocopy only available for non-compressed non-TLS 
> > multifd migration");
> >  }
> >
> >  migration_channel_connect(data->s, sioc, data->hostname, err);
>
> Do we really want to do this check here?  I think this is really too
> late.
>
> You are not patching migrate_params_check().
>
> I think that the proper way of doing this is something like:
>
> if (params->zerocopy &&
> (params->parameters.multifd_compression != MULTIFD_COMPRESSION_NONE ||
>  migrate_use_tls())) {
>error_setg(,
>  "Zerocopy only available for non-compressed non-TLS 
> multifd migration");
> return false;
> }

Don't we also need a check for multifd enabled here?
We could have zerocopy, multifd_compression=none, tls=disabled but it
will not fail if multifd=disabled.

Is this correct?


>
> You have to do the equivalent of multifd_compression and tls enablement,
> to see that zerocopy is not enabled, of course.
>
> I would prefer to check for QIO_CHANNEL_FEATUR_WRITE_ZEROCPY there, but
> I can't see a way of doing that without a qio.
>
> Later, Juan.
>

Re: [PATCH v5 3/6] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX

2021-12-09 Thread Leonardo Bras Soares Passos

On Thu, Dec 9, 2021 at 5:38 AM Leonardo Bras Soares Passos
 wrote:
>
> Hello Daniel,
>
> On Fri, Dec 3, 2021 at 6:18 AM Daniel P. Berrangé  wrote:
> >
> > On Fri, Dec 03, 2021 at 02:42:19AM -0300, Leonardo Bras Soares Passos wrote:
> > > Hello Daniel,
> > >
> > > On Tue, Nov 23, 2021 at 6:56 AM Daniel P. Berrangé  
> > > wrote:
> > > >
> > > > On Tue, Nov 23, 2021 at 01:46:44AM -0300, Leonardo Bras Soares Passos 
> > > > wrote:
> > > > > Hello Daniel,
> > > > >
> > > > > On Fri, Nov 12, 2021 at 7:54 AM Daniel P. Berrangé 
> > > > >  wrote:
> > > > > > > +
> > > > > > > +#ifdef CONFIG_LINUX
> > > > > > > +
> > > > > > > +static int qio_channel_socket_poll(QIOChannelSocket *sioc, bool 
> > > > > > > zerocopy,
> > > > > > > +   Error **errp)
> > > > > >
> > > > > > There's only one caller and it always passes zerocopy=true,
> > > > > > so this parmeter looks pointless.
> > > > >
> > > > > I did that for possible reuse of this function in the future:
> > > > > - As of today, this is certainly compiled out, but if at some point
> > > > > someone wants to use poll for something other
> > > > > than the reading of an zerocopy errqueue, it could be reused.
> > > > >
> > > > > But sure, if that's not desirable, I can remove the parameter (and the
> > > > > if clause for !zerocopy).
> > > > >
> > > > > >
> > > > > > > +{
> > > > > > > +struct pollfd pfd;
> > > > > > > +int ret;
> > > > > > > +
> > > > > > > +pfd.fd = sioc->fd;
> > > > > > > +pfd.events = 0;
> > > > > > > +
> > > > > > > + retry:
> > > > > > > +ret = poll(, 1, -1);
> > > > > > > +if (ret < 0) {
> > > > > > > +switch (errno) {
> > > > > > > +case EAGAIN:
> > > > > > > +case EINTR:
> > > > > > > +goto retry;
> > > > > > > +default:
> > > > > > > +error_setg_errno(errp, errno,
> > > > > > > + "Poll error");
> > > > > > > +return ret;
> > > > > >
> > > > > >return -1;
> > > > > >
> > > > > > > +}
> > > > > > > +}
> > > > > > > +
> > > > > > > +if (pfd.revents & (POLLHUP | POLLNVAL)) {
> > > > > > > +error_setg(errp, "Poll error: Invalid or disconnected 
> > > > > > > fd");
> > > > > > > +return -1;
> > > > > > > +}
> > > > > > > +
> > > > > > > +if (!zerocopy && (pfd.revents & POLLERR)) {
> > > > > > > +error_setg(errp, "Poll error: Errors present in 
> > > > > > > errqueue");
> > > > > > > +return -1;
> > > > > > > +}
> > > > > >
> > > > > > > +
> > > > > > > +return ret;
> > > > > >
> > > > > >   return 0;
> > > > >
> > > > > In the idea of future reuse I spoke above, returning zero here would
> > > > > make this function always look like the poll timed out. Some future
> > > > > users may want to repeat the waiting if poll() timed out, or if
> > > > > (return > 0) stop polling.
> > > >
> > > > Now that I'm looking again, we should not really use poll() at all,
> > > > as GLib provides us higher level APIs. We in fact already have the
> > > > qio_channel_wait() method as a general purpose helper for waiting
> > > > for an I/O condition to occur.;
> > > >
> > >
> > > So you suggest using
> > > qio_channel_wait(sioc, G_IO_IN);
> > > instead of creating the new qio_channel_socket_poll().
> > >
> > > Is the above correct? I mean, is it as simple as that?
> >
> > Yes, hopefully it is that

Re: [PATCH v5 3/6] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX

2021-12-09 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Fri, Dec 3, 2021 at 6:18 AM Daniel P. Berrangé  wrote:
>
> On Fri, Dec 03, 2021 at 02:42:19AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel,
> >
> > On Tue, Nov 23, 2021 at 6:56 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Tue, Nov 23, 2021 at 01:46:44AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > Hello Daniel,
> > > >
> > > > On Fri, Nov 12, 2021 at 7:54 AM Daniel P. Berrangé 
> > > >  wrote:
> > > > > > +
> > > > > > +#ifdef CONFIG_LINUX
> > > > > > +
> > > > > > +static int qio_channel_socket_poll(QIOChannelSocket *sioc, bool 
> > > > > > zerocopy,
> > > > > > +   Error **errp)
> > > > >
> > > > > There's only one caller and it always passes zerocopy=true,
> > > > > so this parmeter looks pointless.
> > > >
> > > > I did that for possible reuse of this function in the future:
> > > > - As of today, this is certainly compiled out, but if at some point
> > > > someone wants to use poll for something other
> > > > than the reading of an zerocopy errqueue, it could be reused.
> > > >
> > > > But sure, if that's not desirable, I can remove the parameter (and the
> > > > if clause for !zerocopy).
> > > >
> > > > >
> > > > > > +{
> > > > > > +struct pollfd pfd;
> > > > > > +int ret;
> > > > > > +
> > > > > > +pfd.fd = sioc->fd;
> > > > > > +pfd.events = 0;
> > > > > > +
> > > > > > + retry:
> > > > > > +ret = poll(, 1, -1);
> > > > > > +if (ret < 0) {
> > > > > > +switch (errno) {
> > > > > > +case EAGAIN:
> > > > > > +case EINTR:
> > > > > > +goto retry;
> > > > > > +default:
> > > > > > +error_setg_errno(errp, errno,
> > > > > > + "Poll error");
> > > > > > +return ret;
> > > > >
> > > > >return -1;
> > > > >
> > > > > > +}
> > > > > > +}
> > > > > > +
> > > > > > +if (pfd.revents & (POLLHUP | POLLNVAL)) {
> > > > > > +error_setg(errp, "Poll error: Invalid or disconnected fd");
> > > > > > +return -1;
> > > > > > +}
> > > > > > +
> > > > > > +if (!zerocopy && (pfd.revents & POLLERR)) {
> > > > > > +error_setg(errp, "Poll error: Errors present in errqueue");
> > > > > > +return -1;
> > > > > > +}
> > > > >
> > > > > > +
> > > > > > +return ret;
> > > > >
> > > > >   return 0;
> > > >
> > > > In the idea of future reuse I spoke above, returning zero here would
> > > > make this function always look like the poll timed out. Some future
> > > > users may want to repeat the waiting if poll() timed out, or if
> > > > (return > 0) stop polling.
> > >
> > > Now that I'm looking again, we should not really use poll() at all,
> > > as GLib provides us higher level APIs. We in fact already have the
> > > qio_channel_wait() method as a general purpose helper for waiting
> > > for an I/O condition to occur.;
> > >
> >
> > So you suggest using
> > qio_channel_wait(sioc, G_IO_IN);
> > instead of creating the new qio_channel_socket_poll().
> >
> > Is the above correct? I mean, is it as simple as that?
>
> Yes, hopefully it is that simple.

It seems not to be the case.
After some testing, I found out using this stalls the migration.

This happens when multifd_send_sync_main() calls flush_zerocopy(), but
the migration threads are
in multifd_send_thread() calling qemu_sem_wait(>sem);

I don't really understand enough of GLib to know how much this is
different from a poll(), but seems to make a difference.

>
> > > > I understand the idea of testing SO_EE_CODE_ZEROCOPY_COPIED to be able
> > > > to tell whenever zerocopy fell back to copying for some reason, but I
> > > > don't see how this can be helpful here.
> > > >
> > > > Other than that I would do rv++ instead of rv=1 here, if I want to
> > > > keep track of how many buffers were sent with zerocopy and how many
> > > > ended up being copied.
> > >
> > > Sure, we could do   "ret > 0 == number of buffers that were copied"
> > > as the API contract, rather than just treating it as a boolean.
> >
> > Ok, then you suggest the responsibility of checking the number of
> > writes with SO_EE_CODE_ZEROCOPY_COPIED, comparing with the total
> > number of writes,  and deciding whether to disable or not zerocopy
> > should be on the caller.
>
> Yep, its a usage policy so nicer to allow caller to decide the
> policy.
>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v5 3/6] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX

2021-12-02 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Tue, Nov 23, 2021 at 6:56 AM Daniel P. Berrangé  wrote:
>
> On Tue, Nov 23, 2021 at 01:46:44AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel,
> >
> > On Fri, Nov 12, 2021 at 7:54 AM Daniel P. Berrangé  
> > wrote:
> > [...]
> > > > @@ -561,12 +577,15 @@ static ssize_t 
> > > > qio_channel_socket_writev_flags(QIOChannel *ioc,
> > > >   retry:
> > > >  ret = sendmsg(sioc->fd, , flags);
> > > >  if (ret <= 0) {
> > > > -if (errno == EAGAIN) {
> > > > +switch (errno) {
> > > > +case EAGAIN:
> > > >  return QIO_CHANNEL_ERR_BLOCK;
> > > > -}
> > > > -if (errno == EINTR) {
> > > > +case EINTR:
> > > >  goto retry;
> > > > +case ENOBUFS:
> > > > +return QIO_CHANNEL_ERR_NOBUFS;
> > >
> > > Why does ENOBUFS need handling separately instead of letting
> > > the error_setg_errno below handle it ?
> > >
> > > The caller immediately invokes error_setg_errno() again,
> > > just with different error message.
> > >
> > > No code in this series ever looks at QIO_CHANNEL_ERR_NOBUFS
> > > either, so we don't even need that special error return code
> > > added AFAICT ?
> > >
> >
> > The idea was to add a custom message for ENOBUFS return when sending
> > with MSG_ZEROCOPY.
> > I mean, having this message is important for the user to understand
> > why the migration is failing, but it would
> > not make any sense to have this message while a non-zerocopy sendmsg()
> > returns with ENOBUFS.
> >
> > ENOBUFS : The output queue for a network interface was full.  This
> > generally indicates that the interface has stopped sending, but may be
> > caused by transient congestion.
> >
> > As an alternative, I could add this message inside the switch, inside
> > an if (flags & MSG_ZEROCOPY) on qio_channel_socket_writev_flags()
> > instead of in it's caller.
> > But for me it looks bloated, I mean, dealing with an error for
> > ZEROCOPY only in the general function.
>
> It is perfectly reasonable to check flags in this method.
>
> > OTOH, if you think that it's a better idea to deal with every error in
> > qio_channel_socket_writev_flags() instead of in the caller, I will
> > change it for v6. Please let me know.
>
> Yes, this method is already taking an ERror **errp parameter and
> reporting a user facing error. If we need to report different
> message text for ENOBUFS, it should be done in this method too.
>
> The reason QIO_CHANNEL_ERR_BLOCK is special is because we are
> explicitly not treating it as an error scenario at all.  That's
> different to the ENOBUFS case.
>

Ok, I will change it for v6.

>
> >
> > > >  }
> > > > +
> > > >  error_setg_errno(errp, errno,
> > > >   "Unable to write to socket");
> > > >  return -1;
> > > > @@ -670,6 +689,127 @@ static ssize_t 
> > > > qio_channel_socket_writev(QIOChannel *ioc,
> > > >  }
> > > >  #endif /* WIN32 */
> > > >
> > > > +
> > > > +#ifdef CONFIG_LINUX
> > > > +
> > > > +static int qio_channel_socket_poll(QIOChannelSocket *sioc, bool 
> > > > zerocopy,
> > > > +   Error **errp)
> > >
> > > There's only one caller and it always passes zerocopy=true,
> > > so this parmeter looks pointless.
> >
> > I did that for possible reuse of this function in the future:
> > - As of today, this is certainly compiled out, but if at some point
> > someone wants to use poll for something other
> > than the reading of an zerocopy errqueue, it could be reused.
> >
> > But sure, if that's not desirable, I can remove the parameter (and the
> > if clause for !zerocopy).
> >
> > >
> > > > +{
> > > > +struct pollfd pfd;
> > > > +int ret;
> > > > +
> > > > +pfd.fd = sioc->fd;
> > > > +pfd.events = 0;
> > > > +
> > > > + retry:
> > > > +ret = poll(, 1, -1);
> > > > +if (ret < 0) {
> > > > +switch (errno) {
> > > > +case EAGAIN:
> > > > +case EINTR:
> > > > +goto retry;
> > > > +def

Re: [PATCH v5 1/6] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks

2021-12-02 Thread Leonardo Bras Soares Passos

Hello Daniel,

On Tue, Nov 23, 2021 at 6:45 AM Daniel P. Berrangé  wrote:
>
> On Mon, Nov 22, 2021 at 08:18:09PM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel,
> > Thanks for the feedback!
> >
> > On Fri, Nov 12, 2021 at 7:13 AM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Fri, Nov 12, 2021 at 02:10:36AM -0300, Leonardo Bras wrote:
> > > > -int qio_channel_writev_all(QIOChannel *ioc,
> > > > -   const struct iovec *iov,
> > > > -   size_t niov,
> > > > -   Error **erp);
> > > > +int qio_channel_writev_all_flags(QIOChannel *ioc,
> > > > + const struct iovec *iov,
> > > > + size_t niov,
> > > > + int flags,
> > > > + Error **errp);
> > > > +#define qio_channel_writev_all(ioc, iov, niov, errp) \
> > > > +qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)
> > >
> > > We already have separate methods for zerocopy, instead of adding
> > > flags, so we shouldn't add flags to this either.
> > >
> > > Add a qio_channel_writev_zerocopy_all method instead.
> > >
> > > Internally, we can still make both qio_channel_writev_zerocopy_all
> > > and qio_channel_writev_all use the same helper method, just don't
> > > expose flags in the public API. Even internally we don't really
> > > need flags, just a bool
> >
> > I see.
> > The idea of having a flag was to make it easier to expand the
> > interface in the future.
> > I got some feedback on v1 that would suggest it would be desired:
> > http://patchwork.ozlabs.org/project/qemu-devel/patch/20210831110238.299458-2-leob...@redhat.com/
> >
> >
> > >
> > [...]
> > > > +#define qio_channel_writev_full_all(ioc, iov, niov, fds, nfds, errp) \
> > > > +qio_channel_writev_full_all_flags(ioc, iov, niov, fds, nfds, 0, 
> > > > errp)
> > >
> > > There's no need for this at all. Since fd passing is not supported
> > > with zerocopy, there's no reason to ever use this method.
> > >
> > > > +/**
> > > > + * qio_channel_writev_zerocopy:
> > > > + * @ioc: the channel object
> > > > + * @iov: the array of memory regions to write data from
> > > > + * @niov: the length of the @iov array
> > > > + * @errp: pointer to a NULL-initialized error object
> > > > + *
> > > > + * Behaves like qio_channel_writev_full_all_flags, but may write
> > >
> > > qio_channel_writev
> > >
> > > > + * data asynchronously while avoiding unnecessary data copy.
> > > > + * This function may return before any data is actually written,
> > > > + * but should queue every buffer for writing.
> > >
> > > Callers mustn't rely on "should" docs - they must rely on the
> > > return value indicating how many bytes were accepted.
> > >
> > > Also mention that this requires locked memory and can/will fail if
> > > insufficient locked memory is available.
> > >
> >
> > Sure, I will update that.
> >
> > > > +/**
> > > > + * qio_channel_flush_zerocopy:
> > > > + * @ioc: the channel object
> > > > + * @errp: pointer to a NULL-initialized error object
> > > > + *
> > > > + * Will block until every packet queued with
> > > > + * qio_channel_writev_zerocopy() is sent, or return
> > > > + * in case of any error.
> > > > + *
> > > > + * Returns -1 if any error is found, 0 otherwise.
> > >
> > >   Returns -1 if any error is found, 0 if all data was sent,
> > >or 1 if all data was sent but at least some was copied.
> > >
> >
> > I don't really get the return 1 part, I mean, per description it will
> > 'block until every queued packet was sent, so "at least some was
> > copied" doesn't seem to fit here.
> > Could you elaborate?
>
> Passing the ZEROCOPY flag to the kernel does not guarantee
> that the copy is avoided, it is merely a hint to the kernel
>
> When getting the notification, the ee_code  field in the
> notification struct will have the flag
> SO_EE_CODE_ZEROCOPY_COPIED  set if the kernel could not
> avoid the copy.
>

Correct,

> In this case, it is better for the application to stop
> using the ZEROCOPY flag and ju

Re: [PATCH v5 6/6] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-12-01 Thread Leonardo Bras Soares Passos

On Tue, Nov 16, 2021 at 1:40 PM Daniel P. Berrangé  wrote:
>
> On Tue, Nov 16, 2021 at 05:34:50PM +0100, Juan Quintela wrote:
> > Daniel P. Berrangé  wrote:
> >
> > >>
> > >> if (params->zerocopy &&
> > >> (params->parameters.multifd_compression != 
> > >> MULTIFD_COMPRESSION_NONE ||
> > >>  migrate_use_tls())) {
> > >>error_setg(,
> > >>  "Zerocopy only available for non-compressed non-TLS 
> > >> multifd migration");
> > >> return false;
> > >> }
> > >>
> > >> You have to do the equivalent of multifd_compression and tls enablement,
> > >> to see that zerocopy is not enabled, of course.
> > >>
> > >> I would prefer to check for QIO_CHANNEL_FEATUR_WRITE_ZEROCPY there, but
> > >> I can't see a way of doing that without a qio.
> > >
> > > I don't think you need to check that feature flag.
> >
> > Oh, I mean other thing.
> >
> > When you set "zerocopy" capability, you don't know if the kernel support
> > it.  My understanding is that the only way to check if it supported is
> > here.
>
> If you reqest it and it isn't supported you'll get an error back from
> qio_channel_writev_zerocopy(). That's a bit too late though.
>
> Ideally we should report an error straight after the migration code
> creates the I/O channel, by querying for the feature.
>
>

Agree.
I suggested checking the feature presence where the test is happening
in v5, and the other combinations of migration parameters at
migrate_params_check() as Juan suggested.

What do you think?

> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Best regards,
Leo

Re: [PATCH v5 6/6] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-12-01 Thread Leonardo Bras Soares Passos

On Tue, Nov 16, 2021 at 1:35 PM Daniel P. Berrangé  wrote:
>
> On Tue, Nov 16, 2021 at 04:17:47PM +, Daniel P. Berrangé wrote:
> > On Tue, Nov 16, 2021 at 05:08:06PM +0100, Juan Quintela wrote:
> > > Leonardo Bras  wrote:
> > > > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > > > zerocopy interface.
> > > >
> > > > Change multifd_send_sync_main() so it can distinguish each iteration 
> > > > sync from
> > > > the setup and the completion, so a flush_zerocopy() can be called
> > > > at the after each iteration in order to make sure all dirty pages are 
> > > > sent
> > > > before a new iteration is started.
> > > >
> > > > Also make it return -1 if flush_zerocopy() fails, in order to cancel
> > > > the migration process, and avoid resuming the guest in the target host
> > > > without receiving all current RAM.
> > > >
> > > > This will work fine on RAM migration because the RAM pages are not 
> > > > usually freed,
> > > > and there is no problem on changing the pages content between 
> > > > async_send() and
> > > > the actual sending of the buffer, because this change will dirty the 
> > > > page and
> > > > cause it to be re-sent on a next iteration anyway.
> > > >
> > > > Given a lot of locked memory may be needed in order to use multid 
> > > > migration
> > > > with zerocopy enabled, make it optional by creating a new migration 
> > > > parameter
> > > > "zerocopy" on qapi, so low-privileged users can still perform multifd
> > > > migrations.
> > >
> > > How much memory can a non-root program use by default?
> > >
> > >
> > > >  static void *multifd_send_thread(void *opaque)
> > > > @@ -853,6 +875,10 @@ static void multifd_new_send_channel_async(QIOTask 
> > > > *task, gpointer opaque)
> > > >  goto cleanup;
> > > >  }
> > > >
> > > > +if (migrate_use_zerocopy()) {
> > > > +p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZEROCOPY;
> > > > +}
> > >
> > > This belongs
> > >
> > >
> > > >  p->c = QIO_CHANNEL(sioc);
> > > >  qio_channel_set_delay(p->c, false);
> > > >  p->running = true;
> > > > @@ -918,6 +944,7 @@ int multifd_save_setup(Error **errp)
> > > >  p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> > > >  p->name = g_strdup_printf("multifdsend_%d", i);
> > > >  p->tls_hostname = g_strdup(s->hostname);
> > > > +p->write_flags = 0;
> > >
> > > here?
> > >
> > > >  socket_send_channel_create(multifd_new_send_channel_async, p);
> > > >  }
> > > > diff --git a/migration/socket.c b/migration/socket.c
> > > > index e26e94aa0c..8e40e0a3fd 100644
> > > > --- a/migration/socket.c
> > > > +++ b/migration/socket.c
> > > > @@ -78,8 +78,13 @@ static void socket_outgoing_migration(QIOTask *task,
> > > >  trace_migration_socket_outgoing_connected(data->hostname);
> > > >  }
> > > >
> > > > -if (migrate_use_zerocopy()) {
> > > > -error_setg(, "Zerocopy not available in migration");
> > > > +if (migrate_use_zerocopy() &&
> > > > +(!migrate_use_multifd() ||
> > > > + !qio_channel_has_feature(sioc, 
> > > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY) ||
> > > > +  migrate_multifd_compression() != MULTIFD_COMPRESSION_NONE ||
> > > > +  migrate_use_tls())) {
> > > > +error_setg(,
> > > > +   "Zerocopy only available for non-compressed non-TLS 
> > > > multifd migration");
> > > >  }
> > > >
> > > >  migration_channel_connect(data->s, sioc, data->hostname, err);
> > >
> > > Do we really want to do this check here?  I think this is really too
> > > late.
> > >
> > > You are not patching migrate_params_check().
> > >
> > > I think that the proper way of doing this is something like:
> > >
> > > if (params->zerocopy &&
> > > (params->parameters.multifd_compression != 
> > > MULTIFD_COMPRESSION_NONE ||
> > >  migrate_use_tls())) {
> > >error_setg(,
> > >  "Zerocopy only available for non-compressed non-TLS 
> > > multifd migration");
> > > return false;
> > > }
> > >
> > > You have to do the equivalent of multifd_compression and tls enablement,
> > > to see that zerocopy is not enabled, of course.
> > >
> > > I would prefer to check for QIO_CHANNEL_FEATUR_WRITE_ZEROCPY there, but
> > > I can't see a way of doing that without a qio.
> >
> > I don't think you need to check that feature flag.
> >
> > The combination of zerocopy and compression is simply illogical
> > and can be rejected unconditionally.
>
> Or we could think of "zerocopy"  in a more targetted way.
> It is only "zerocopy" in terms the final I/O operation.
> Earlier parts of the process may involve copies. IOW, we
> can copy as part of the compress operation, but skip the
> when then sending the compressed page.
>
> In practice though this is still unlikely to be something
> we can practically do, as we would need to keep compressed
> pages around for an entire migration iteration until we can
> call

Re: [PATCH v5 6/6] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-12-01 Thread Leonardo Bras Soares Passos

On Tue, Nov 16, 2021 at 1:08 PM Juan Quintela  wrote:
>
> Leonardo Bras  wrote:
> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > zerocopy interface.
> >
> > Change multifd_send_sync_main() so it can distinguish each iteration sync 
> > from
> > the setup and the completion, so a flush_zerocopy() can be called
> > at the after each iteration in order to make sure all dirty pages are sent
> > before a new iteration is started.
> >
> > Also make it return -1 if flush_zerocopy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between async_send() 
> > and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > Given a lot of locked memory may be needed in order to use multid migration
> > with zerocopy enabled, make it optional by creating a new migration 
> > parameter
> > "zerocopy" on qapi, so low-privileged users can still perform multifd
> > migrations.
>
> How much memory can a non-root program use by default?

On RHEL 8, a standard user is created allowing 64kB max locked memory.
(memory seems 'unlimited', though)


>
>
> >  static void *multifd_send_thread(void *opaque)
> > @@ -853,6 +875,10 @@ static void multifd_new_send_channel_async(QIOTask 
> > *task, gpointer opaque)
> >  goto cleanup;
> >  }
> >
> > +if (migrate_use_zerocopy()) {
> > +p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZEROCOPY;
> > +}
>
> This belongs
>
>
> >  p->c = QIO_CHANNEL(sioc);
> >  qio_channel_set_delay(p->c, false);
> >  p->running = true;
> > @@ -918,6 +944,7 @@ int multifd_save_setup(Error **errp)
> >  p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> >  p->name = g_strdup_printf("multifdsend_%d", i);
> >  p->tls_hostname = g_strdup(s->hostname);
> > +p->write_flags = 0;
>
> here?

yeah, makes sense.
Moving on v6.

>
> >  socket_send_channel_create(multifd_new_send_channel_async, p);
> >  }
> > diff --git a/migration/socket.c b/migration/socket.c
> > index e26e94aa0c..8e40e0a3fd 100644
> > --- a/migration/socket.c
> > +++ b/migration/socket.c
> > @@ -78,8 +78,13 @@ static void socket_outgoing_migration(QIOTask *task,
> >  trace_migration_socket_outgoing_connected(data->hostname);
> >  }
> >
> > -if (migrate_use_zerocopy()) {
> > -error_setg(, "Zerocopy not available in migration");
> > +if (migrate_use_zerocopy() &&
> > +(!migrate_use_multifd() ||
> > + !qio_channel_has_feature(sioc, 
> > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY) ||
> > +  migrate_multifd_compression() != MULTIFD_COMPRESSION_NONE ||
> > +  migrate_use_tls())) {
> > +error_setg(,
> > +   "Zerocopy only available for non-compressed non-TLS 
> > multifd migration");
> >  }
> >
> >  migration_channel_connect(data->s, sioc, data->hostname, err);
>
> Do we really want to do this check here?  I think this is really too
> late.
>
> You are not patching migrate_params_check().
>
> I think that the proper way of doing this is something like:
>
> if (params->zerocopy &&
> (params->parameters.multifd_compression != MULTIFD_COMPRESSION_NONE ||
>  migrate_use_tls())) {
>error_setg(,
>  "Zerocopy only available for non-compressed non-TLS 
> multifd migration");
> return false;
> }
>
> You have to do the equivalent of multifd_compression and tls enablement,
> to see that zerocopy is not enabled, of course.

IIUC, by following your suggestion and changing this in
migrate_params_check() instead would allow the misconfiguration to be
known before migration is attempted.
That seems the best thing to do here.


>
> I would prefer to check for QIO_CHANNEL_FEATUR_WRITE_ZEROCPY there, but
> I can't see a way of doing that without a qio.

Yeah, I think I should leave the feature testing in here, and move the
parameter testing to migrate_params_check() as commented before.

What do you think?

>
> Later, Juan.
>

Best regards,
Leo

Re: [PATCH v5 4/6] migration: Add zerocopy parameter for QMP/HMP for Linux

2021-12-01 Thread Leonardo Bras Soares Passos

Hello Markus,

On Fri, Nov 12, 2021 at 9:01 AM Markus Armbruster  wrote:
>
> Juan Quintela  writes:
>
> > Leonardo Bras  wrote:
> >> Add property that allows zerocopy migration of memory pages,
> >> and also includes a helper function migrate_use_zerocopy() to check
> >> if it's enabled.
> >>
> >> No code is introduced to actually do the migration, but it allow
> >> future implementations to enable/disable this feature.
> >>
> >> On non-Linux builds this parameter is compiled-out.
> >>
> >> Signed-off-by: Leonardo Bras 
> >
> > Hi
> >
> >> +# @zerocopy: Controls behavior on sending memory pages on migration.
> >> +#When true, enables a zerocopy mechanism for sending memory
> >> +#pages, if host supports it.
> >> +#Defaults to false. (Since 6.2)
> >> +#
> >
> > This needs to be changed to next release, but not big deal.
>
> Rename to zero-copy while there.  QAPI/QMP strongly prefer separating
> words with dashes.  "zerocopy" is not a word, "zero" and "copy" are.
>
> [...]
>

Fine then.
To make sure it does not look strange, I will change the naming for
all the code (zerocopy becomes zero-copy or zero_copy according to the
context).

Thanks for reviewing!

Best regards,
Leo

1 2 >

1 - 100 of 153 matches

Mail list logo