[virtio-dev] RE: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-02 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Saturday, March 3, 2018 2:14 AM
> 
> On Fri, 2 Mar 2018 06:54:17 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson
> > > Sent: Friday, March 2, 2018 4:22 AM
> > > >
> > > > I am pretty sure that you are describing is true of some, but not for
> > > > all. I think the Amazon solutions and the virtio solution are doing
> > > > hard partitioning of the part. I will leave it to those guys to speak
> > > > for themselves since I don't know anything about the hardware design
> > > > of those parts.
> > >
> > > I think we'd need device specific knowledge and enablement to be able
> > > to take advantage of any hardware partitioning, otherwise we need to
> > > assume the pf is privileged, as implemented in other sriov devices.
> > >
> > > I'm also trying to imagine whether there's a solution via the new
> > > vfio/mdev interface, where the mdev vendor driver would bind to the
> pf
> > > and effectively present itself as the mdev device.  The vendor driver
> > > could provide sriov capabilities and bear the burden of ensuring that
> > > the pf is used cooperatively.  The only existing mdev vendor drivers are
> > > vGPUs and rely on on-device DMA translation and isolation, such as
> > > through GTTs, but there have been some thoughts on providing IOMMU
> > > based
> > > isolation of mdev/sriov mixed devices (assuming DMA is even required
> > > for userspace management of the pf in this use case).  [Cc Kirti]
> > > Thanks,
> > >
> >
> > Hope not distracting this thread, but above sounds like an interesting
> > idea. Actually we ever brainstormed similar thought for another
> > potential usage - supporting VF live migration. We are already working
> > on an generic extension to allow state save/restore of mdev instance.
> > If vendor driver could further wrap pf as a mdev instance, it could
> > leverage the same framework for a clean state migration on VF. based
> > on mmap callback the vendor driver can easily switch back-and-forth
> > between pass through and trap/emulation of the VF resources. Of
> > course doing so alone doesn't address all the demands of VF live
> > migration (e.g. dirty page tracking still requires other techniques),
> > but it does pave a way toward a general framework to support VF
> > live migration.
> >
> > If above is feasible, finally we could use one mdev framework to
> > manage both mdev and pf/vf assignment, while providing added
> > values which are difficult to achieve today. :-)
> 
> mdev drivers may be the first to support migration, but I wonder if a
> full mdev implementation is necessary for it.  Once the vfio api is
> define, device specific extensions to vfio-pci might be able to
> implement migration more directly.  Thanks,
> 
> Alex

yes technically a full mdev implementation is not necessary. If
device specific extensions will be placed within vfio module, it's
obviously straightforward. What I thought earlier was in case vfio
wants to stay device-agnostic then we probably want device
specific extensions in vendor driver which is loaded but in a 
dummy mode which simply do basic PCI initialization as vfio-pci
and then wrap vf as mdev (since vfio-pci is not the vf driver in
this scenario). It's especially useful for vendor drivers which aim
to support both mdev and sr-iov by sharing common state mgmt.
knowledge, but looks an overkill to other vendor drivers. Possibly
finally we'll allow both - simple device extensions in vfio-pci and 
complex device extensions in vendor drivers through vfio-mdev.

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 1/3] pci-iov: Add support for unmanaged SR-IOV

2018-03-02 Thread Alexander Duyck
On Fri, Mar 2, 2018 at 3:59 PM, Alex Williamson
 wrote:
> On Fri, 02 Mar 2018 15:44:25 -0800
> Alexander Duyck  wrote:
>
>> From: Alexander Duyck 
>>
>> This patch is meant to add some basic functionality to support for SR-IOV
>> on devices when the VFs are not managed by the kernel. The functions
>> provided here can be used by drivers such as vfio-pci and virtio to enable
>> SR-IOV on devices that are either managed by userspace, or by some sort of
>> firmware entity respectively.
>>
>> A new sysfs value called sriov_unmanaged_autoprobe has been added. This
>> value is used as the drivers_autoprobe setting of the VFs when they are
>> being managed by an external entity such as userspace or device firmware
>> instead of being managed by the kernel.
>>
>> One side effect of this change is that the sriov_drivers_autoprobe and
>> sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is
>> disabled. Attempts to update them when SR-IOV is in use will only update
>> the local value and will not update sriov->autoprobe.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>  Documentation/ABI/testing/sysfs-bus-pci |   17 ++
>>  drivers/pci/iov.c   |   37 
>> +++
>>  drivers/pci/pci-driver.c|2 +-
>>  drivers/pci/pci-sysfs.c |   29 
>>  drivers/pci/pci.h   |4 +++
>>  include/linux/pci.h |1 +
>>  6 files changed, 88 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
>> b/Documentation/ABI/testing/sysfs-bus-pci
>> index 44d4b2be92fd..ff0b6c19cb1a 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -323,3 +323,20 @@ Description:
>>
>>   This is similar to /sys/bus/pci/drivers_autoprobe, but
>>   affects only the VFs associated with a specific PF.
>> +
>> +What:/sys/bus/pci/devices/.../sriov_unmanaged_autoprobe
>> +Date:March 2018
>> +Contact: Alexander Duyck 
>> +Description:
>> + This file is associated with the PF of a device that
>> + supports SR-IOV.  It determines whether newly-enabled VFs
>> + are immediately bound to a driver when the PF driver does
>> + not manage the VFs itself.  It initially contains 0, which
>> + means the kernel will not automatically bind VFs to a driver.
>> + If an application writes 1 to the file before enabling VFs,
>> + the kernel will bind VFs to a compatible driver immediately
>> + after they are enabled.
>> +
>> + This overrides /sys/bus/pci/devices/.../sriov_drivers_autoprobe
>> + when a PF driver is not present to manage a device, or the PF
>> + driver does not provide functionality to support SR-IOV.
>
>
> Given a pf, how does a user determine whether it is managed or unmanaged
> and therefore which autoprobe attributes are in effect?  Thanks,
>
> Alex

Basically it comes down to what driver is loaded on it. For now
vfio-pci and virtio would be the only two using the "unmanaged"
version of things.

Really you don't know which autoprobe is in effect until SR-IOV is
enabled by whatever driver. As such you should really be setting both
the drivers_autoprobe and the unmanaged_autoprobe based on the
expected use case.

- Alex

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
 wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>>  wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>
>>> Signed-off-by: Sridhar Samudrala 
>>> Signed-off-by: Alexander Duyck 
>>> Reviewed-by: Jesse Brandeburg 
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>  u32 speed;
>>>
>>>  unsigned long guest_offloads;
>>> +
>>> +   /* upper netdev created when BACKUP feature enabled */
>>> +   struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>  }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> + size_t len)
>>> +{
>>> +   struct virtnet_info *vi = netdev_priv(dev);
>>> +   int ret;
>>> +
>>> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   ret = snprintf(buf, len, "_bkup");
>>> +   if (ret >= len)
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices,  then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.

If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?

> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.

This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.

>
>
>
>>
>>>   static const struct net_device_ops virtnet_netdev = {
>>>  .ndo_open= virtnet_open,
>>>  .ndo_stop= virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>  .ndo_xdp_xmit   = virtnet_xdp_xmit,
>>>  .ndo_xdp_flush  = virtnet_xdp_flush,
>>>  .ndo_features_check = passthru_features_check,
>>> +   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>  return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>>> + * is created that acts as a master device and tracks the state of the
>>> + * 2 lower netdevs. The original virtio_net netdev is registered as
>>> + * 'backup' netdev and a passthru device with the same MAC is registered
>>> + * as 'active' netdev.
>>> + *

[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>>  wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> >
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> >
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> >
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> >
>> > Signed-off-by: Sridhar Samudrala 
>> > Signed-off-by: Alexander Duyck 
>> > Reviewed-by: Jesse Brandeburg 
>> > ---
>> >  drivers/net/virtio_net.c | 683 
>> > ++-
>> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index bcd13fe906ca..f2860d86c952 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -30,6 +30,8 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >
>> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> > u32 speed;
>> >
>> > unsigned long guest_offloads;
>> > +
>> > +   /* upper netdev created when BACKUP feature enabled */
>> > +   struct net_device *bypass_netdev;
>> >  };
>> >
>> >  struct padded_vnet_hdr {
>> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, 
>> > struct netdev_bpf *xdp)
>> > }
>> >  }
>> >
>> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> > + size_t len)
>> > +{
>> > +   struct virtnet_info *vi = netdev_priv(dev);
>> > +   int ret;
>> > +
>> > +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> > +   return -EOPNOTSUPP;
>> > +
>> > +   ret = snprintf(buf, len, "_bkup");
>> > +   if (ret >= len)
>> > +   return -EOPNOTSUPP;
>> > +
>> > +   return 0;
>> > +}
>> > +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n naming? Would virtio_bypass get a different name
>> than the original virtio_net?
>
> You mean people using ethX names? Any hardware config change breaks
> these, I don't think that can be helped.

I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
and it does not completely solve the problem it tries to address.
Imagine what can end up with if getting an old udevd, or users already
have exsiting explicit udev rules around phys_port_name. It does not
give you the an ack in saying "yes, I know you're the bypass and
you're the backup, please continue and I will give you both correct
names", or an unacknowlegment saying "no, I don't know what these
extra interfaces are, please go back and leave the VF device alone".
We need new udev API for both feature negotiation and naming, or may
even completely hide the lower interfaces.

>
>> Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
> I don't think we can do this with existing kernel/userspace APIs.

That's why I ever said to make udev aware of this new type of combined
device instead of doing hacks here and there around.

Regards,
-Siwei

>
> --
> MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 3/3] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices

2018-03-02 Thread Alexander Duyck
From: Alexander Duyck 

Hardware-realized virtio_pci devices can implement SR-IOV, so this
patch enables its use. The device in question is an upcoming Intel
NIC that implements both a virtio_net PF and virtio_net VFs. These
are hardware realizations of what has been up to now been a software
interface.

The device in question has the following 4-part PCI IDs:

PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe

The patch currently needs no check for device ID, because the callback
will never be made for devices that do not assert the capability or
when run on a platform incapable of SR-IOV.

One reason for this patch is because the hardware requires the
vendor ID of a VF to be the same as the vendor ID of the PF that
created it. So it seemed logical to simply have a fully-functioning
virtio_net PF create the VFs. This patch makes that possible.

Signed-off-by: Mark Rustad 
Signed-off-by: Alexander Duyck 
---
 drivers/virtio/virtio_pci_common.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index 48d4d1cf1cb6..ca1549393255 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -584,6 +584,7 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
else
virtio_pci_modern_remove(vp_dev);
 
+   pci_disable_sriov(pci_dev);
pci_disable_device(pci_dev);
put_device(dev);
 }
@@ -596,6 +597,9 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
 #ifdef CONFIG_PM_SLEEP
.driver.pm  = &virtio_pci_pm_ops,
 #endif
+#ifdef CONFIG_PCI_IOV
+   .sriov_configure = pci_sriov_configure_unmanaged,
+#endif
 };
 
 module_pci_driver(virtio_pci_driver);


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 2/3] vfio: Add support for unmanaged or userspace managed SR-IOV

2018-03-02 Thread Alexander Duyck
From: Alexander Duyck 

This patch is meant to allow assignment of an SR-IOV enabled PF, as in VFs
have been generated, with vfio-pci. My understanding is the primary use
case for this is something like DPDK running the PF while the VFs are all
assigned to guests.

A secondary effect of this is that it provides an interface through which
it would be possible to enable SR-IOV on drivers that may not have a
physical function that actually manages the device.

Enabling SR-IOV should be pretty straight forward. As long as there are no
userspace processes currently controlling the interface the number of VFs
can be changed, and VFs will be generated without drivers being loaded on
the host. Once the userspace process begins controlling the interface the
number of VFs cannot be updated via the sysfs until the control is
released.

Note the VFs will have drivers load on them in the host if the
sriov_unmanaged_autoprobe is updated to a value of 1. However the behavior
of the VFs in such a setup cannot be guaranteed as the PF will not be
available until the userspace process starts and begins to manage the
device.

For now I am leaving the value as locked when the PF is being controlled
from userspace as a form of synchronization. Basically this way we cannot
have the number of VFs change out from under the process so it should not
require any notification framework, and the configuration can just be read
out via configuration space accesses.

Signed-off-by: Alexander Duyck 
---
 drivers/vfio/pci/vfio_pci.c |   57 +++
 1 file changed, 57 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index b0f759476900..3023bda39aa9 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1224,6 +1224,8 @@ static void vfio_pci_remove(struct pci_dev *pdev)
VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM);
}
 
+   pci_disable_sriov(pdev);
+
if (!disable_idle_d3)
pci_set_power_state(pdev, PCI_D0);
 }
@@ -1260,12 +1262,67 @@ static pci_ers_result_t 
vfio_pci_aer_err_detected(struct pci_dev *pdev,
.error_detected = vfio_pci_aer_err_detected,
 };
 
+static int vfio_pci_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
+{
+   struct vfio_pci_device *vdev;
+   struct vfio_device *device;
+   int err;
+
+   device = vfio_device_get_from_dev(&pdev->dev);
+   if (device == NULL)
+   return -ENODEV;
+
+   vdev = vfio_device_data(device);
+   if (vdev == NULL) {
+   vfio_device_put(device);
+   return -ENODEV;
+   }
+
+   /*
+* If a userspace process is already using this device just return
+* busy and don't allow for any changes.
+*/
+   if (vdev->refcnt) {
+   pci_warn(pdev,
+"PF is currently in use, blocked until released by 
user\n");
+   return -EBUSY;
+   }
+
+   err = pci_sriov_configure_unmanaged(pdev, nr_virtfn);
+   if (err <= 0)
+   return err;
+
+   /*
+* We are now leaving VFs in the control of some unknown PF entity.
+*
+* Best case is a well behaved userspace PF is expected and any VMs
+* that the VFs will be assigned to are dependent on the userspace
+* entity anyway. An example being NFV where maybe the PF is acting
+* as an accelerated interface for a firewall or switch.
+*
+* Worst case is somebody really messed up and just enabled SR-IOV
+* on a device they were planning to assign to a VM somwhere.
+*
+* In either case it is probably best for us to set the taint flag
+* and warn the user since this could get really ugly really quick
+* if this wasn't what they were planning to do.
+*/
+   add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+   pci_warn(pdev,
+"Adding kernel taint for vfio-pci now managing SR-IOV PF 
device\n");
+
+   return nr_virtfn;
+}
+
 static struct pci_driver vfio_pci_driver = {
.name   = "vfio-pci",
.id_table   = NULL, /* only dynamic ids */
.probe  = vfio_pci_probe,
.remove = vfio_pci_remove,
.err_handler= &vfio_err_handlers,
+#ifdef CONFIG_PCI_IOV
+   .sriov_configure = vfio_pci_sriov_configure,
+#endif
 };
 
 struct vfio_devices {


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 0/3] pci-iov: Add support for unmanaged SR-IOV

2018-03-02 Thread Alexander Duyck
This series is meant to add support for SR-IOV on devices when the VFs are
not managed by the kernel. Examples of recent patches attempting to do this
include:
virto - https://patchwork.kernel.org/patch/10241225/
pci-stub - https://patchwork.kernel.org/patch/10109935/
vfio - https://patchwork.kernel.org/patch/10103353/
uio - https://patchwork.kernel.org/patch/9974031/

Since this is quickly blowing up into a multi-driver problem it is probably
best to implement this solution as generically as possible.

This series is an attempt to do that. What we do with this patch set is 
provide a generic framework to enable SR-IOV in the case that the PF driver 
doesn't support managing the VFs itself.

I based my patch set originally on the patch by Mark Rustad but there isn't
much left after going through and cleaning out the bits that were no longer
needed, and after incorporating the feedback from David Miller. At this point
the only items to be fully reused was his patch description which is now 
present in patch 3 of the set.

I have included the authors of the original 4 patches above in the Cc here.
My hope is to get feedback and/or review on if this works for their use
cases.

My hope is that for now the pci-stub and uio driver approaches can be 
addressed using the current patch that enables vfio-pci support. The only
limitation is that it is also setting the taint flag until we have a better 
solution.

v2: Reduced scope back to just virtio_pci and vfio-pci
Broke into 3 patch set from single patch
Changed autoprobe behavior to always set when num_vfs is set non-zero

Cc: Mark Rustad 
Cc: Maximilian Heyne 
Cc: Liang-Min Wang 
Cc: David Woodhouse 

---

Alexander Duyck (3):
  pci-iov: Add support for unmanaged SR-IOV
  vfio: Add support for unmanaged or userspace managed SR-IOV
  virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices


 Documentation/ABI/testing/sysfs-bus-pci |   17 +
 drivers/pci/iov.c   |   37 
 drivers/pci/pci-driver.c|2 +
 drivers/pci/pci-sysfs.c |   29 
 drivers/pci/pci.h   |4 ++
 drivers/vfio/pci/vfio_pci.c |   57 +++
 drivers/virtio/virtio_pci_common.c  |4 ++
 include/linux/pci.h |1 +
 8 files changed, 149 insertions(+), 2 deletions(-)

--

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH 1/3] pci-iov: Add support for unmanaged SR-IOV

2018-03-02 Thread Alexander Duyck
From: Alexander Duyck 

This patch is meant to add some basic functionality to support for SR-IOV
on devices when the VFs are not managed by the kernel. The functions
provided here can be used by drivers such as vfio-pci and virtio to enable
SR-IOV on devices that are either managed by userspace, or by some sort of
firmware entity respectively.

A new sysfs value called sriov_unmanaged_autoprobe has been added. This
value is used as the drivers_autoprobe setting of the VFs when they are
being managed by an external entity such as userspace or device firmware
instead of being managed by the kernel.

One side effect of this change is that the sriov_drivers_autoprobe and
sriov_unmanaged_autoprobe will only apply their updates when SR-IOV is
disabled. Attempts to update them when SR-IOV is in use will only update
the local value and will not update sriov->autoprobe.

Signed-off-by: Alexander Duyck 
---
 Documentation/ABI/testing/sysfs-bus-pci |   17 ++
 drivers/pci/iov.c   |   37 +++
 drivers/pci/pci-driver.c|2 +-
 drivers/pci/pci-sysfs.c |   29 
 drivers/pci/pci.h   |4 +++
 include/linux/pci.h |1 +
 6 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..ff0b6c19cb1a 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,20 @@ Description:
 
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What:  /sys/bus/pci/devices/.../sriov_unmanaged_autoprobe
+Date:  March 2018
+Contact:   Alexander Duyck 
+Description:
+   This file is associated with the PF of a device that
+   supports SR-IOV.  It determines whether newly-enabled VFs
+   are immediately bound to a driver when the PF driver does
+   not manage the VFs itself.  It initially contains 0, which
+   means the kernel will not automatically bind VFs to a driver.
+   If an application writes 1 to the file before enabling VFs,
+   the kernel will bind VFs to a compatible driver immediately
+   after they are enabled.
+
+   This overrides /sys/bus/pci/devices/.../sriov_drivers_autoprobe
+   when a PF driver is not present to manage a device, or the PF
+   driver does not provide functionality to support SR-IOV.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 677924ae0350..3dcec1fa86bd 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -446,6 +446,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
pci_read_config_word(dev, pos + PCI_SRIOV_VF_DID, &iov->vf_device);
iov->pgsz = pgsz;
iov->self = dev;
+   iov->autoprobe = true;
iov->drivers_autoprobe = true;
pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap);
pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link);
@@ -683,6 +684,9 @@ int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
if (!dev->is_physfn)
return -ENOSYS;
 
+   /* Update autoprobe setting to reflect managed device */
+   dev->sriov->autoprobe = dev->sriov->drivers_autoprobe;
+
return sriov_enable(dev, nr_virtfn);
 }
 EXPORT_SYMBOL_GPL(pci_enable_sriov);
@@ -807,3 +811,36 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
return dev->sriov->total_VFs;
 }
 EXPORT_SYMBOL_GPL(pci_sriov_get_totalvfs);
+
+/**
+ * pci_sriov_configure_unmanaged - helper to configure unmanaged SR-IOV
+ * @dev: the PCI device
+ * @nr_virtfn: number of virtual functions to enable, 0 to disable
+ *
+ * Used to provide generic enable/disable SR-IOV option for devices
+ * that do not manage the VFs generated by their driver, or have no
+ * driver present.
+ */
+int pci_sriov_configure_unmanaged(struct pci_dev *dev, int nr_virtfn)
+{
+   int err;
+
+   might_sleep();
+
+   if (!dev->is_physfn)
+   return -ENODEV;
+
+   if (!nr_virtfn) {
+   sriov_disable(dev);
+
+   return 0;
+   }
+
+   /* Update autoprobe setting to reflect unmanaged device */
+   dev->sriov->autoprobe = dev->sriov->unmanaged_autoprobe;
+
+   err = sriov_enable(dev, nr_virtfn);
+
+   return err ? err : nr_virtfn;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_configure_unmanaged);
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 3bed6beda051..2cc68dff6130 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -398,7 +398,7 @@ void __weak pcibios_free_irq(struct pci_dev *dev)
 #ifdef CONFIG_PCI_IOV
 static inline bool pci_device_can_probe(struct pci_dev *pdev)
 {
-   return (!pdev->is_virtfn || pdev->ph

Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Samudrala, Sridhar

On 3/2/2018 1:11 PM, Siwei Liu wrote:

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
 wrote:

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
Reviewed-by: Jesse Brandeburg 
---
  drivers/net/virtio_net.c | 683 ++-
  1 file changed, 682 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bcd13fe906ca..f2860d86c952 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,6 +30,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 
  #include 
  #include 

@@ -206,6 +208,9 @@ struct virtnet_info {
 u32 speed;

 unsigned long guest_offloads;
+
+   /* upper netdev created when BACKUP feature enabled */
+   struct net_device *bypass_netdev;
  };

  struct padded_vnet_hdr {
@@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
 }
  }

+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+ size_t len)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int ret;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+   return -EOPNOTSUPP;
+
+   ret = snprintf(buf, len, "_bkup");
+   if (ret >= len)
+   return -EOPNOTSUPP;
+
+   return 0;
+}
+

What if the systemd/udevd is not new enough to enforce the
n naming? Would virtio_bypass get a different name
than the original virtio_net? Should we detect this earlier and fall
back to legacy mode without creating the bypass netdev and ensalving
the VF?


If udev doesn't support renaming of the devices,  then the upper bypass device
should get the original name and the lower virtio netdev will get the next name.
Hopefully the distros updating the kernel will also move to the new 
systemd/udev.





  static const struct net_device_ops virtnet_netdev = {
 .ndo_open= virtnet_open,
 .ndo_stop= virtnet_close,
@@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
 .ndo_xdp_xmit   = virtnet_xdp_xmit,
 .ndo_xdp_flush  = virtnet_xdp_flush,
 .ndo_features_check = passthru_features_check,
+   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
  };

  static void virtnet_config_changed_work(struct work_struct *work)
@@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev)
 return 0;
  }

+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
+ * is created that acts as a master device and tracks the state of the
+ * 2 lower netdevs. The original virtio_net netdev is registered as
+ * 'backup' netdev and a passthru device with the same MAC is registered
+ * as 'active' netdev.
+ */
+
+/* bypass state maintained when BACKUP feature is enabled */
+struct virtnet_bypass_info {
+   /* passthru netdev with same MAC */
+   struct net_device __rcu *active_netdev;
+
+   /* virtio_net netdev */
+   struct net_device __rcu *backup_netdev;
+
+   /* active netdev stats */
+   struct rtnl_link_stats64 active_stats;
+
+   /* backup netdev stats */
+   struct rtnl_link_stats64 backup_stats;
+
+   /* aggregated stats */
+   struct rtnl_link_stats64 bypass_stats;
+
+   /* spinlock while updating stats */
+   spinlock_t stats_lock;
+};
+
+static void virtnet_bypass_child_open(struct net_device *dev,
+ struct net_device *child_netdev)
+{
+   int err = dev_open(child_netdev);
+
+   if (err)
+   netdev_warn(dev, "unable to open slave: %s: %

Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:
>> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>> >>
>> >>
>> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> >> > > The design limits things to a 1:1 relationship since we just have the
>> >> > > child and backup pointers, but I don't think I am seeing exception
>> >> > > handling to prevent us from overwriting the child pointers so there
>> >> > > may be a leak there.
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > > - Alex
>> >> > In fact maintaining a list in that case would be nicer, and
>> >> > just use an arbitrary one.
>> >> > E.g. one can see how a user wanting to swap device 1 for device 2
>> >> > might first add device 2 with same MAC then drop device 1.
>> >>
>> >> It should be possible to swap VF1 with VF2 by
>> >> 1.- enabling virtio link
>> >> 2.- unplugging VF1
>> >> 3.- plugging VF2
>> >> 4.- disabling virtio link
>> >>
>> >
>> > True, but it isn't hard to avoid breakage if user
>> > swapped steps 2 and 3. No need to make it more
>> > fragile that it has to be.
>>
>> The migration case, VF2 is associated with another PF on another
>> machine (destination), I wonder how it is possible.
>
> E.g. you want to remove the PF so you unplug the VF
> then add another VF of the same PF.
>
>> Even with local plugging of VF2 on the same PF, the MAC address
>> requirement (VF1's == VF2's) would fail the MAC address assignment on
>> VF2.
>>
>> -Siwei
>
> Why would it fail? These are separate cards.

OK. I realized that you may talk about assigning a VF on a diffferent
PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
device rather than a VF. Yes, it's indeed possible that may happen but
I take it as a further step down (another patch maybe) as it would
involve changes to notify the network with gratuituious ARP and/or
unsolicited ND advertisement of the MAC address association with the
new port.

-Siwei

>
>> >
>> > --
>> > MST
>> >
>> > -
>> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
>> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>> >

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>  wrote:
> > This patch enables virtio_net to switch over to a VF datapath when a VF
> > netdev is present with the same MAC address. It allows live migration
> > of a VM with a direct attached VF without the need to setup a bond/team
> > between a VF and virtio net device in the guest.
> >
> > The hypervisor needs to enable only one datapath at any time so that
> > packets don't get looped back to the VM over the other datapath. When a VF
> > is plugged, the virtio datapath link state can be marked as down. The
> > hypervisor needs to unplug the VF device from the guest on the source host
> > and reset the MAC filter of the VF to initiate failover of datapath to
> > virtio before starting the migration. After the migration is completed,
> > the destination hypervisor sets the MAC filter on the VF and plugs it back
> > to the guest to switch over to VF datapath.
> >
> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> > created that acts as a master device and tracks the state of the 2 lower
> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> > passthru device with the same MAC is registered as 'active' netdev.
> >
> > This patch is based on the discussion initiated by Jesse on this thread.
> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> >
> > Signed-off-by: Sridhar Samudrala 
> > Signed-off-by: Alexander Duyck 
> > Reviewed-by: Jesse Brandeburg 
> > ---
> >  drivers/net/virtio_net.c | 683 
> > ++-
> >  1 file changed, 682 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index bcd13fe906ca..f2860d86c952 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -30,6 +30,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  #include 
> >  #include 
> >
> > @@ -206,6 +208,9 @@ struct virtnet_info {
> > u32 speed;
> >
> > unsigned long guest_offloads;
> > +
> > +   /* upper netdev created when BACKUP feature enabled */
> > +   struct net_device *bypass_netdev;
> >  };
> >
> >  struct padded_vnet_hdr {
> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, 
> > struct netdev_bpf *xdp)
> > }
> >  }
> >
> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> > + size_t len)
> > +{
> > +   struct virtnet_info *vi = netdev_priv(dev);
> > +   int ret;
> > +
> > +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> > +   return -EOPNOTSUPP;
> > +
> > +   ret = snprintf(buf, len, "_bkup");
> > +   if (ret >= len)
> > +   return -EOPNOTSUPP;
> > +
> > +   return 0;
> > +}
> > +
> 
> What if the systemd/udevd is not new enough to enforce the
> n naming? Would virtio_bypass get a different name
> than the original virtio_net?

You mean people using ethX names? Any hardware config change breaks
these, I don't think that can be helped.

> Should we detect this earlier and fall
> back to legacy mode without creating the bypass netdev and ensalving
> the VF?

I don't think we can do this with existing kernel/userspace APIs.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 12:56:21PM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 12:44 PM, Siwei Liu wrote:
> > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:
> > > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> > > > 
> > > > On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > > > > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > > > > The design limits things to a 1:1 relationship since we just have 
> > > > > > the
> > > > > > child and backup pointers, but I don't think I am seeing exception
> > > > > > handling to prevent us from overwriting the child pointers so there
> > > > > > may be a leak there.
> > > > > > 
> > > > > > Thanks.
> > > > > > 
> > > > > > - Alex
> > > > > In fact maintaining a list in that case would be nicer, and
> > > > > just use an arbitrary one.
> > > > > E.g. one can see how a user wanting to swap device 1 for device 2
> > > > > might first add device 2 with same MAC then drop device 1.
> > > > It should be possible to swap VF1 with VF2 by
> > > > 1.- enabling virtio link
> > > > 2.- unplugging VF1
> > > > 3.- plugging VF2
> > > > 4.- disabling virtio link
> > > > 
> > > True, but it isn't hard to avoid breakage if user
> > > swapped steps 2 and 3. No need to make it more
> > > fragile that it has to be.
> > The migration case, VF2 is associated with another PF on another
> > machine (destination), I wonder how it is possible.
> > 
> > Even with local plugging of VF2 on the same PF, the MAC address
> > requirement (VF1's == VF2's) would fail the MAC address assignment on
> > VF2.
> > 
> > 
> I didn't include updating the MAC filter step in the above sequence.
> So definitely plugging 2 VFs with the same MAC address will be an issue.

If these are two separate PFs then I don't see why -
each has its own MAC filter.

> Here is the more complete sequence of steps that are required to
> enable live migration.

Replacing VF1 by VF2 is not about migration. It's to remove PF
from host e.g. for service.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:
> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> >> > > The design limits things to a 1:1 relationship since we just have the
> >> > > child and backup pointers, but I don't think I am seeing exception
> >> > > handling to prevent us from overwriting the child pointers so there
> >> > > may be a leak there.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > In fact maintaining a list in that case would be nicer, and
> >> > just use an arbitrary one.
> >> > E.g. one can see how a user wanting to swap device 1 for device 2
> >> > might first add device 2 with same MAC then drop device 1.
> >>
> >> It should be possible to swap VF1 with VF2 by
> >> 1.- enabling virtio link
> >> 2.- unplugging VF1
> >> 3.- plugging VF2
> >> 4.- disabling virtio link
> >>
> >
> > True, but it isn't hard to avoid breakage if user
> > swapped steps 2 and 3. No need to make it more
> > fragile that it has to be.
> 
> The migration case, VF2 is associated with another PF on another
> machine (destination), I wonder how it is possible.

E.g. you want to remove the PF so you unplug the VF
then add another VF of the same PF.

> Even with local plugging of VF2 on the same PF, the MAC address
> requirement (VF1's == VF2's) would fail the MAC address assignment on
> VF2.
> 
> -Siwei

Why would it fail? These are separate cards.

> >
> > --
> > MST
> >
> > -
> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
> >

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
 wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> Reviewed-by: Jesse Brandeburg 
> ---
>  drivers/net/virtio_net.c | 683 
> ++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> +   /* upper netdev created when BACKUP feature enabled */
> +   struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
> netdev_bpf *xdp)
> }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> + size_t len)
> +{
> +   struct virtnet_info *vi = netdev_priv(dev);
> +   int ret;
> +
> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +   return -EOPNOTSUPP;
> +
> +   ret = snprintf(buf, len, "_bkup");
> +   if (ret >= len)
> +   return -EOPNOTSUPP;
> +
> +   return 0;
> +}
> +

What if the systemd/udevd is not new enough to enforce the
n naming? Would virtio_bypass get a different name
than the original virtio_net? Should we detect this earlier and fall
back to legacy mode without creating the bypass netdev and ensalving
the VF?

>  static const struct net_device_ops virtnet_netdev = {
> .ndo_open= virtnet_open,
> .ndo_stop= virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
> .ndo_xdp_xmit   = virtnet_xdp_xmit,
> .ndo_xdp_flush  = virtnet_xdp_flush,
> .ndo_features_check = passthru_features_check,
> +   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device 
> *vdev)
> return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +   /* passthru netdev with same MAC */
> +   struct net_device __rcu *active_netdev;
> +
> +   /* virtio_net netdev */
> +   struct net_device __rcu *backup_netdev;
> +
> +   /* active netdev stats */
> +   struct rtnl_link_stats64 active_stats;
> +
> +   /* backup netdev stats */
> +   struct rtnl_link_stats64 backup_stats;
> +
> +   /* aggregated stats */
> +   struct rtnl_link_stats64 bypass_stats;
> +
> +   /* spinlock while updating stats */
> +   spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> + struct net_device *child_netdev)
> +{
> +   int err = dev_open(child_netdev);
> +
> +   if (err)
> +   netdev_warn(dev, "unable to open slave: %s: %d\n",
> +   child_netdev->name, err);
> +}
> +

Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Samudrala, Sridhar



On 3/2/2018 12:44 PM, Siwei Liu wrote:

On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:

On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:


On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:

On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:

The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.

Thanks.

- Alex

In fact maintaining a list in that case would be nicer, and
just use an arbitrary one.
E.g. one can see how a user wanting to swap device 1 for device 2
might first add device 2 with same MAC then drop device 1.

It should be possible to swap VF1 with VF2 by
1.- enabling virtio link
2.- unplugging VF1
3.- plugging VF2
4.- disabling virtio link


True, but it isn't hard to avoid breakage if user
swapped steps 2 and 3. No need to make it more
fragile that it has to be.

The migration case, VF2 is associated with another PF on another
machine (destination), I wonder how it is possible.

Even with local plugging of VF2 on the same PF, the MAC address
requirement (VF1's == VF2's) would fail the MAC address assignment on
VF2.



I didn't include updating the MAC filter step in the above sequence.
So definitely plugging 2 VFs with the same MAC address will be an issue.

Here is the more complete sequence of steps that are required to
enable live migration.

Source Hypervisor
- Bring up the virtio link
- Hot Unplug VF from the VM
- Delete FDB entry for the MAC on the Bridge associated with virtio/tap
- Remove the MAC filter associated with the VF
- Migrate VM to destination

Destination Hypervisor (after migration is completed)
- Set MAC filter for the VF
- Hot Plug VF to the VM
- Bring down the virtio link

Thanks
Sridhar




-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 11:42 AM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>> >Yeah, this code essentially calls out the "shareable" code with a
>> >comment at the start and end of the section what defines the
>> >virtio_bypass functionality. It would just be a matter of mostly
>> >cutting and pasting to put it into a separate driver module.
>>
>> Please put it there and unite the use of it with netvsc.
>
> Surely, adding this to other drivers (e.g. might this be handy for xen
> too?) can be left for a separate patchset. Let's get one device merged
> first.
Agreed.

-Siwei

>
> --
> MST
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> > > The design limits things to a 1:1 relationship since we just have the
>> > > child and backup pointers, but I don't think I am seeing exception
>> > > handling to prevent us from overwriting the child pointers so there
>> > > may be a leak there.
>> > >
>> > > Thanks.
>> > >
>> > > - Alex
>> > In fact maintaining a list in that case would be nicer, and
>> > just use an arbitrary one.
>> > E.g. one can see how a user wanting to swap device 1 for device 2
>> > might first add device 2 with same MAC then drop device 1.
>>
>> It should be possible to swap VF1 with VF2 by
>> 1.- enabling virtio link
>> 2.- unplugging VF1
>> 3.- plugging VF2
>> 4.- disabling virtio link
>>
>
> True, but it isn't hard to avoid breakage if user
> swapped steps 2 and 3. No need to make it more
> fragile that it has to be.

The migration case, VF2 is associated with another PF on another
machine (destination), I wonder how it is possible.

Even with local plugging of VF2 on the same PF, the MAC address
requirement (VF1's == VF2's) would fail the MAC address assignment on
VF2.

-Siwei

>
> --
> MST
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 4/4] virtio-net: add linkspeed and duplex settings to virtio-net

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 11:59:00AM -0500, Jason Baron wrote:
> On 03/02/2018 02:14 AM, Jason Wang wrote:
> > 
> > 
> > On 2018年03月02日 11:46, Jason Baron wrote:
> >> Although linkspeed and duplex can be set in a linux guest via 'ethtool
> >> -s',
> >> this requires custom ethtool commands for virtio-net by default.
> >>
> >> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
> >> the hypervisor to export a linkspeed and duplex setting. The user can
> >> subsequently overwrite it later if desired via: 'ethtool -s'.
> >>
> >> Linkspeed and duplex settings can be set as:
> >> '-device virtio-net,speed=1,duplex=full'
> > 
> > I was thinking whether or not it's better to decide the duplex by the
> > type of backends.
> > 
> > E.g userspace and vhost-kernel implement a in fact half duplex. But dpdk
> > implement a full duplex.
> 
> Interesting - could this be derived only from the backend 'type'. IE:
> NET_CLIENT_DRIVER_TAP, NET_CLIENT_DRIVER_VHOST_USER...
> 
> 
> I was also thinking this could be specified as 'duplex=backend', in
> addition to the proposed 'duplex=full' or 'duplex=half'?
> 
> Thanks,
> 
> -Jason

I'd say it would make more sense to teach backends to obey what's
specified by the user. E.g. if vhost gets a duplex config,
create two threads.

But I think all that's for future, we can just fake it for
now - the current uses don't seem to particularly care about whether
virtio actually is or isn't a duplex.



> > 
> > Thanks
> > 
> >>
> >> where speed is [0...INT_MAX], and duplex is ["half"|"full"].
> >>
> >> Signed-off-by: Jason Baron
> >> Cc: "Michael S. Tsirkin"
> >> Cc: Jason Wang
> >> Cc:virtio-dev@lists.oasis-open.org
> >> ---
> > 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 4/4] virtio-net: add linkspeed and duplex settings to virtio-net

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 03:14:01PM +0800, Jason Wang wrote:
> 
> 
> On 2018年03月02日 11:46, Jason Baron wrote:
> > Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
> > this requires custom ethtool commands for virtio-net by default.
> > 
> > Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
> > the hypervisor to export a linkspeed and duplex setting. The user can
> > subsequently overwrite it later if desired via: 'ethtool -s'.
> > 
> > Linkspeed and duplex settings can be set as:
> > '-device virtio-net,speed=1,duplex=full'
> 
> I was thinking whether or not it's better to decide the duplex by the type
> of backends.
> 
> E.g userspace and vhost-kernel implement a in fact half duplex. But dpdk
> implement a full duplex.
> 
> Thanks

OTOH it's a priority for some people to be able to support migration
between different backend types. Breaking that won't be nice.

> > 
> > where speed is [0...INT_MAX], and duplex is ["half"|"full"].
> > 
> > Signed-off-by: Jason Baron
> > Cc: "Michael S. Tsirkin"
> > Cc: Jason Wang
> > Cc:virtio-dev@lists.oasis-open.org
> > ---

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
> 
> 
> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> > > The design limits things to a 1:1 relationship since we just have the
> > > child and backup pointers, but I don't think I am seeing exception
> > > handling to prevent us from overwriting the child pointers so there
> > > may be a leak there.
> > > 
> > > Thanks.
> > > 
> > > - Alex
> > In fact maintaining a list in that case would be nicer, and
> > just use an arbitrary one.
> > E.g. one can see how a user wanting to swap device 1 for device 2
> > might first add device 2 with same MAC then drop device 1.
> 
> It should be possible to swap VF1 with VF2 by
> 1.- enabling virtio link
> 2.- unplugging VF1
> 3.- plugging VF2
> 4.- disabling virtio link
> 

True, but it isn't hard to avoid breakage if user
swapped steps 2 and 3. No need to make it more
fragile that it has to be.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH] README.md: add hints for contributors

2018-03-02 Thread Michael S. Tsirkin
Add links to admin repo and mailing lists.

Signed-off-by: Michael S. Tsirkin 
---
 README.md | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index 7dadb68..a4fdc4e 100644
--- a/README.md
+++ b/README.md
@@ -27,13 +27,19 @@
 
 
 
+
 Further Description of this Repository
+Providing Feedback
+See https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback";>
+https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=virtio#feedback
+Note for Contributors
+Contributors should review TC specific
+process rules under "Further Description of this Repository"
+in https://github.com/oasis-tcs/virtio-admin";>https://github.com/oasis-tcs/virtio-admin.
 
-[Any narrative content may be provided here by the TC, for example, if the 
Members wish to provide an extended statement of purpose.]
 
-
-
-
 Contact
 Please send questions or comments about https://www.oasis-open.org/resources/tcadmin/github-repositories-for-oasis-tc-members-chartered-work";>OASIS
 TC GitHub repositories to mailto:ro...@oasis-open.org";>Robin 
Cover and mailto:chet.ens...@oasis-open.org";>Chet Ensign.  For 
questions about content in this repository, please contact the TC Chair or 
Co-Chairs as listed on the the virtio TC's https://www.oasis-open.org/committees/virtio/";>home page.
 
-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Samudrala, Sridhar



On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:

On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:

The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.

Thanks.

- Alex

In fact maintaining a list in that case would be nicer, and
just use an arbitrary one.
E.g. one can see how a user wanting to swap device 1 for device 2
might first add device 2 with same MAC then drop device 1.


It should be possible to swap VF1 with VF2 by
- enabling virtio link
- unplugging VF1
- plugging VF2
- disabling virtio link



-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
> >Yeah, this code essentially calls out the "shareable" code with a
> >comment at the start and end of the section what defines the
> >virtio_bypass functionality. It would just be a matter of mostly
> >cutting and pasting to put it into a separate driver module.
> 
> Please put it there and unite the use of it with netvsc.

Surely, adding this to other drivers (e.g. might this be handy for xen
too?) can be left for a separate patchset. Let's get one device merged
first.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
> 
> Thanks.
> 
> - Alex

In fact maintaining a list in that case would be nicer, and
just use an arbitrary one.
E.g. one can see how a user wanting to swap device 1 for device 2
might first add device 2 with same MAC then drop device 1.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 2/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 04:47:29PM +0800, Wei Wang wrote:
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index af49e19..16a2aae 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h

...

> +typedef void (QEMUBalloonFreePageStart)(void *opaque);
> +typedef void (QEMUBalloonFreePageStop)(void *opaque);

So I think the rule is that no bitmap sync must happen
between these two, otherwise a hint might arrive and
override the sync output.

Should be documented I think.

-- 
MST

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v3 2/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-02 Thread Michael S. Tsirkin
On Fri, Mar 02, 2018 at 04:47:29PM +0800, Wei Wang wrote:
> The new feature enables the virtio-balloon device to receive hints of
> guest free pages from the free page vq. Callers call the
> free_page_start API to start the reporting, which creates a thread to
> poll for free page hints. The free_page_stop API stops the reporting and
> makes the thread exit.
> 
> Signed-off-by: Wei Wang 
> Signed-off-by: Liang Li 
> CC: Michael S. Tsirkin 
> CC: Dr. David Alan Gilbert 
> CC: Juan Quintela 
> ---
>  balloon.c   |  49 +--
>  hw/virtio/virtio-balloon.c  | 172 
> +---
>  include/hw/virtio/virtio-balloon.h  |  14 +-
>  include/standard-headers/linux/virtio_balloon.h |   7 +
>  include/sysemu/balloon.h|  15 ++-
>  5 files changed, 228 insertions(+), 29 deletions(-)
> 
> diff --git a/balloon.c b/balloon.c
> index d8dd6fe..b0b0749 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -36,6 +36,9 @@
>  
>  static QEMUBalloonEvent *balloon_event_fn;
>  static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonFreePageSupport *balloon_free_page_support_fn;
> +static QEMUBalloonFreePageStart *balloon_free_page_start_fn;
> +static QEMUBalloonFreePageStop *balloon_free_page_stop_fn;
>  static void *balloon_opaque;
>  static bool balloon_inhibited;
>  
> @@ -64,19 +67,42 @@ static bool have_balloon(Error **errp)
>  return true;
>  }
>  
> -int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> - QEMUBalloonStatus *stat_func, void *opaque)
> +bool balloon_free_page_support(void)
>  {
> -if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> -/* We're already registered one balloon handler.  How many can
> - * a guest really have?
> - */
> -return -1;
> +return balloon_free_page_support_fn &&
> +   balloon_free_page_support_fn(balloon_opaque);
> +}
> +
> +void balloon_free_page_start(void)
> +{
> +balloon_free_page_start_fn(balloon_opaque);
> +}
> +
> +void balloon_free_page_stop(void)
> +{
> +balloon_free_page_stop_fn(balloon_opaque);
> +}
> +
> +void qemu_add_balloon_handler(QEMUBalloonEvent *event_fn,
> +  QEMUBalloonStatus *stat_fn,
> +  QEMUBalloonFreePageSupport 
> *free_page_support_fn,
> +  QEMUBalloonFreePageStart *free_page_start_fn,
> +  QEMUBalloonFreePageStop *free_page_stop_fn,
> +  void *opaque)
> +{
> +if (balloon_event_fn || balloon_stat_fn || balloon_free_page_support_fn 
> ||
> +balloon_free_page_start_fn || balloon_free_page_stop_fn ||
> +balloon_opaque) {
> +/* We already registered one balloon handler. */
> +return;
>  }
> -balloon_event_fn = event_func;
> -balloon_stat_fn = stat_func;
> +
> +balloon_event_fn = event_fn;
> +balloon_stat_fn = stat_fn;
> +balloon_free_page_support_fn = free_page_support_fn;
> +balloon_free_page_start_fn = free_page_start_fn;
> +balloon_free_page_stop_fn = free_page_stop_fn;
>  balloon_opaque = opaque;
> -return 0;
>  }
>  
>  void qemu_remove_balloon_handler(void *opaque)
> @@ -86,6 +112,9 @@ void qemu_remove_balloon_handler(void *opaque)
>  }
>  balloon_event_fn = NULL;
>  balloon_stat_fn = NULL;
> +balloon_free_page_support_fn = NULL;
> +balloon_free_page_start_fn = NULL;
> +balloon_free_page_stop_fn = NULL;
>  balloon_opaque = NULL;
>  }
>  
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index 4822449..4607879 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -31,6 +31,7 @@
>  
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-access.h"
> +#include "migration/misc.h"
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>  
> @@ -308,6 +309,100 @@ out:
>  }
>  }
>  
> +static void *virtio_balloon_poll_free_page_hints(void *opaque)
> +{
> +VirtQueueElement *elem;
> +VirtIOBalloon *dev = opaque;
> +VirtQueue *vq = dev->free_page_vq;
> +uint32_t id;
> +size_t size;
> +
> +/*
> + * Poll the vq till the status changed to STOP. This happens when
> + * the guest finishes reporting hints or the migration thread actively
> + * stops the reporting.
> + */
> +while (dev->free_page_report_status != FREE_PAGE_REPORT_S_STOP) {
> +elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> +if (!elem) {
> +continue;
> +}
> +
> +if (elem->out_num) {
> +size = iov_to_buf(elem->out_sg, elem->out_num, 0, &id, 
> sizeof(id));
> +virtqueue_push(vq, elem, size);
> +g_free(elem);
> +if (unlikely(size != sizeof(id))) {
> +warn_report("%s: received an incorrect cmd id", __func__);
> +break;

[virtio-dev] Re: [PATCH 2/4] rocker: drop local duplex definitions

2018-03-02 Thread Michael S. Tsirkin
On Thu, Mar 01, 2018 at 10:46:34PM -0500, Jason Baron wrote:
> Make use of duplex definitions from net/eth.h.
> 
> Signed-off-by: Jason Baron 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: Jiri Pirko 
> Cc: virtio-dev@lists.oasis-open.org
> ---
>  hw/net/rocker/rocker_fp.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/hw/net/rocker/rocker_fp.c b/hw/net/rocker/rocker_fp.c
> index 4b3c984..13a14a0 100644
> --- a/hw/net/rocker/rocker_fp.c
> +++ b/hw/net/rocker/rocker_fp.c
> @@ -16,17 +16,13 @@
>  
>  #include "qemu/osdep.h"
>  #include "net/clients.h"
> +#include "net/eth.h"
>  
>  #include "rocker.h"
>  #include "rocker_hw.h"
>  #include "rocker_fp.h"
>  #include "rocker_world.h"
>  
> -enum duplex {
> -DUPLEX_HALF = 0,
> -DUPLEX_FULL
> -};
> -

This is hardware emulation, so I actually suspect it
should be renamed ROCKER_FP_HALF/ROCKER_FP_FULL.

>  struct fp_port {
>  Rocker *r;
>  World *world;
> -- 
> 2.7.4

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 1/4] eth: add speed and duplex definitions

2018-03-02 Thread Michael S. Tsirkin
On Thu, Mar 01, 2018 at 10:46:33PM -0500, Jason Baron wrote:
> Pull in definitions for SPEED_UNKNOWN, DUPLEX_UNKNOWN, DUPLEX_HALF,
> and DUPLEX_FULL.
> 
> Signed-off-by: Jason Baron 
> Cc: "Michael S. Tsirkin" 
> Cc: Jason Wang 
> Cc: virtio-dev@lists.oasis-open.org
> ---
>  include/net/eth.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/net/eth.h b/include/net/eth.h
> index 09054a5..9843678 100644
> --- a/include/net/eth.h
> +++ b/include/net/eth.h
> @@ -417,4 +417,11 @@ bool
>  eth_parse_ipv6_hdr(const struct iovec *pkt, int pkt_frags,
> size_t ip6hdr_off, eth_ip6_hdr_info *info);
>  
> +/* ethtool defines - from linux/ethtool.h */
> +#define SPEED_UNKNOWN   -1
> +
> +#define DUPLEX_HALF 0x00
> +#define DUPLEX_FULL 0x01
> +#define DUPLEX_UNKNOWN  0xff
> +
>  #endif

While that's not a lot, I think we should import linux/ethtool.h into
include/standard-headers/linux/ using scripts/update-linux-headers.sh

> -- 
> 2.7.4

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Alexander Duyck
On Fri, Mar 2, 2018 at 8:37 AM, Samudrala, Sridhar
 wrote:
>
>
> On 3/2/2018 8:20 AM, Jiri Pirko wrote:
>
> Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.du...@gmail.com wrote:
>
> On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko  wrote:
>
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudr...@intel.com wrote:
>
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> Reviewed-by: Jesse Brandeburg 
> ---
> drivers/net/virtio_net.c | 683
> ++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.
>
> Yeah, this code essentially calls out the "shareable" code with a
> comment at the start and end of the section what defines the
> virtio_bypass functionality. It would just be a matter of mostly
> cutting and pasting to put it into a separate driver module.
>
> Please put it there and unite the use of it with netvsc.
>
> Based on Stephen's responses, it is not clear if netvsc would be OK to move
> to
> the 3 netdev solution at this time. When another paravirtual driver would
> like
> to use this solution, we can do refactoring at that time.
>
> The design limits things to a 1:1 relationship since we just have the
> child and backup pointers, but I don't think I am seeing exception
> handling to prevent us from overwriting the child pointers so there
> may be a leak there.
>
>
> We only allow one active netdev registered as a child. There is a check to
> make sure that if an active netdev is registered, another one with same MAC
> is not allowed. I don't think there is a leak.
>
> vbi = netdev_priv(dev);
> backup = (child_netdev->dev.parent == dev->dev.parent);
> if (backup ? rtnl_dereference(vbi->backup_netdev) :
> rtnl_dereference(vbi->active_netdev)) {
> netdev_info(dev,
> "%s attempting to join bypass dev when %s
> already present\n",
> child_netdev->name, backup ? "backup" :
> "active");
> return NOTIFY_DONE;
> }
>
> /* Avoid non pci devices as active netdev */
> if (!backup && (!child_netdev->dev.parent ||
> !dev_is_pci(child_netdev->dev.parent)))
> return NOTIFY_DONE;
>
> Thanks
> Sridhar

I overlooked that part.

Thanks.

- Alex

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH 4/4] virtio-net: add linkspeed and duplex settings to virtio-net

2018-03-02 Thread Jason Baron
On 03/02/2018 02:14 AM, Jason Wang wrote:
> 
> 
> On 2018年03月02日 11:46, Jason Baron wrote:
>> Although linkspeed and duplex can be set in a linux guest via 'ethtool
>> -s',
>> this requires custom ethtool commands for virtio-net by default.
>>
>> Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
>> the hypervisor to export a linkspeed and duplex setting. The user can
>> subsequently overwrite it later if desired via: 'ethtool -s'.
>>
>> Linkspeed and duplex settings can be set as:
>> '-device virtio-net,speed=1,duplex=full'
> 
> I was thinking whether or not it's better to decide the duplex by the
> type of backends.
> 
> E.g userspace and vhost-kernel implement a in fact half duplex. But dpdk
> implement a full duplex.

Interesting - could this be derived only from the backend 'type'. IE:
NET_CLIENT_DRIVER_TAP, NET_CLIENT_DRIVER_VHOST_USER...


I was also thinking this could be specified as 'duplex=backend', in
addition to the proposed 'duplex=full' or 'duplex=half'?

Thanks,

-Jason

> 
> Thanks
> 
>>
>> where speed is [0...INT_MAX], and duplex is ["half"|"full"].
>>
>> Signed-off-by: Jason Baron
>> Cc: "Michael S. Tsirkin"
>> Cc: Jason Wang
>> Cc:virtio-dev@lists.oasis-open.org
>> ---
> 

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Samudrala, Sridhar



On 3/2/2018 8:20 AM, Jiri Pirko wrote:

Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.du...@gmail.com wrote:

On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko  wrote:

Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudr...@intel.com wrote:

This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

When BACKUP feature is enabled, an additional netdev(bypass netdev) is
created that acts as a master device and tracks the state of the 2 lower
netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
passthru device with the same MAC is registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala 
Signed-off-by: Alexander Duyck 
Reviewed-by: Jesse Brandeburg 
---
drivers/net/virtio_net.c | 683 ++-

As I wrote to the discussion of the other version of this patchset,
I strongly believe you need to have a common part in net/core/ that will
be shared by both virtio_net and netvsc. There there should be explicit
limits for this in-driver bonding solution, like max 1 vf slave, etc.

Yeah, this code essentially calls out the "shareable" code with a
comment at the start and end of the section what defines the
virtio_bypass functionality. It would just be a matter of mostly
cutting and pasting to put it into a separate driver module.

Please put it there and unite the use of it with netvsc.


Based on Stephen's responses, it is not clear if netvsc would be OK to move to
the 3 netdev solution at this time. When another paravirtual driver would like
to use this solution, we can do refactoring at that time.




The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.



We only allow one active netdev registered as a child. There is a check to
make sure that if an active netdev is registered, another one with same MAC
is not allowed. I don't think there is a leak.

vbi = netdev_priv(dev);
backup = (child_netdev->dev.parent == dev->dev.parent);
if (backup ? rtnl_dereference(vbi->backup_netdev) :
rtnl_dereference(vbi->active_netdev)) {
netdev_info(dev,
"%s attempting to join bypass dev when %s already 
present\n",
child_netdev->name, backup ? "backup" : "active");
return NOTIFY_DONE;
}

/* Avoid non pci devices as active netdev */
if (!backup && (!child_netdev->dev.parent ||
!dev_is_pci(child_netdev->dev.parent)))
return NOTIFY_DONE;

Thanks
Sridhar



[virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Alexander Duyck
On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko  wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudr...@intel.com wrote:
>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>netdev is present with the same MAC address. It allows live migration
>>of a VM with a direct attached VF without the need to setup a bond/team
>>between a VF and virtio net device in the guest.
>>
>>The hypervisor needs to enable only one datapath at any time so that
>>packets don't get looped back to the VM over the other datapath. When a VF
>>is plugged, the virtio datapath link state can be marked as down. The
>>hypervisor needs to unplug the VF device from the guest on the source host
>>and reset the MAC filter of the VF to initiate failover of datapath to
>>virtio before starting the migration. After the migration is completed,
>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>to the guest to switch over to VF datapath.
>>
>>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>created that acts as a master device and tracks the state of the 2 lower
>>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>>passthru device with the same MAC is registered as 'active' netdev.
>>
>>This patch is based on the discussion initiated by Jesse on this thread.
>>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>>Signed-off-by: Sridhar Samudrala 
>>Signed-off-by: Alexander Duyck 
>>Reviewed-by: Jesse Brandeburg 
>>---
>> drivers/net/virtio_net.c | 683 
>> ++-
>
> As I wrote to the discussion of the other version of this patchset,
> I strongly believe you need to have a common part in net/core/ that will
> be shared by both virtio_net and netvsc. There there should be explicit
> limits for this in-driver bonding solution, like max 1 vf slave, etc.

Yeah, this code essentially calls out the "shareable" code with a
comment at the start and end of the section what defines the
virtio_bypass functionality. It would just be a matter of mostly
cutting and pasting to put it into a separate driver module.

The design limits things to a 1:1 relationship since we just have the
child and backup pointers, but I don't think I am seeing exception
handling to prevent us from overwriting the child pointers so there
may be a leak there.

Thanks.

- Alex

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] Re: [PATCH v2 1/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-02 Thread Wei Wang

On 02/07/2018 09:04 AM, Michael S. Tsirkin wrote:

On Tue, Feb 06, 2018 at 07:08:17PM +0800, Wei Wang wrote:

The new feature enables the virtio-balloon device to receive the hint of
guest free pages from the free page vq, and clears the corresponding bits
of the free page from the dirty bitmap, so that those free pages are not
transferred by the migration thread.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
CC: Michael S. Tsirkin 
CC: Juan Quintela 
---
  balloon.c   |  39 +--
  hw/virtio/virtio-balloon.c  | 145 +---
  include/hw/virtio/virtio-balloon.h  |  11 +-
  include/migration/misc.h|   3 +
  include/standard-headers/linux/virtio_balloon.h |   7 ++
  include/sysemu/balloon.h|  12 +-
  migration/ram.c |  10 ++
  7 files changed, 198 insertions(+), 29 deletions(-)

diff --git a/balloon.c b/balloon.c
index 1d720ff..0f0b30c 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,8 @@
  
  static QEMUBalloonEvent *balloon_event_fn;

  static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonFreePageSupport *balloon_free_page_support_fn;
+static QEMUBalloonFreePagePoll *balloon_free_page_poll_fn;
  static void *balloon_opaque;
  static bool balloon_inhibited;
  
@@ -64,19 +66,34 @@ static bool have_balloon(Error **errp)

  return true;
  }
  
-int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,

- QEMUBalloonStatus *stat_func, void *opaque)
+bool balloon_free_page_support(void)
  {
-if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
-/* We're already registered one balloon handler.  How many can
- * a guest really have?
- */
-return -1;
+return balloon_free_page_support_fn &&
+   balloon_free_page_support_fn(balloon_opaque);
+}
+
+void balloon_free_page_poll(void)
+{
+balloon_free_page_poll_fn(balloon_opaque);
+}
+
+void qemu_add_balloon_handler(QEMUBalloonEvent *event_fn,
+  QEMUBalloonStatus *stat_fn,
+  QEMUBalloonFreePageSupport *free_page_support_fn,
+  QEMUBalloonFreePagePoll *free_page_poll_fn,
+  void *opaque)
+{
+if (balloon_event_fn || balloon_stat_fn || balloon_free_page_support_fn ||
+balloon_free_page_poll_fn || balloon_opaque) {
+/* We already registered one balloon handler. */
+return;
  }
-balloon_event_fn = event_func;
-balloon_stat_fn = stat_func;
+
+balloon_event_fn = event_fn;
+balloon_stat_fn = stat_fn;
+balloon_free_page_support_fn = free_page_support_fn;
+balloon_free_page_poll_fn = free_page_poll_fn;
  balloon_opaque = opaque;
-return 0;
  }
  
  void qemu_remove_balloon_handler(void *opaque)

@@ -86,6 +103,8 @@ void qemu_remove_balloon_handler(void *opaque)
  }
  balloon_event_fn = NULL;
  balloon_stat_fn = NULL;
+balloon_free_page_support_fn = NULL;
+balloon_free_page_poll_fn = NULL;
  balloon_opaque = NULL;
  }
  
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c

index 14e08d2..b424d4e 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -23,6 +23,7 @@
  #include "hw/virtio/virtio-balloon.h"
  #include "sysemu/kvm.h"
  #include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
  #include "qapi/visitor.h"
  #include "qapi-event.h"
  #include "trace.h"
@@ -30,6 +31,7 @@
  
  #include "hw/virtio/virtio-bus.h"

  #include "hw/virtio/virtio-access.h"
+#include "migration/misc.h"
  
  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
  
@@ -305,6 +307,87 @@ out:

  }
  }
  
+static void virtio_balloon_poll_free_page_hints(VirtIOBalloon *dev)

+{
+VirtQueueElement *elem;
+VirtQueue *vq = dev->free_page_vq;
+VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+bool page_poisoning = virtio_vdev_has_feature(vdev,
+  VIRTIO_BALLOON_F_PAGE_POISON);
+uint32_t id;
+
+/* Poll the vq till a stop cmd id is received */
+while (dev->free_page_report_status != FREE_PAGE_REPORT_S_STOP) {
+elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+if (!elem) {
+continue;
+}
+
+if (elem->out_num) {
+iov_to_buf(elem->out_sg, elem->out_num, 0, &id, sizeof(uint32_t));
+virtqueue_push(vq, elem, sizeof(id));
+g_free(elem);
+if (id == dev->free_page_report_cmd_id) {
+dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
+} else {
+dev->free_page_report_status = FREE_PAGE_REPORT_S_STOP;


So if migration on source times out, then you migrate to
another destination, this will cancel the in-progress reporting
due to while loop above.  Probably not what was intended.


+break;

[virtio-dev] [PATCH v3 2/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-02 Thread Wei Wang
The new feature enables the virtio-balloon device to receive hints of
guest free pages from the free page vq. Callers call the
free_page_start API to start the reporting, which creates a thread to
poll for free page hints. The free_page_stop API stops the reporting and
makes the thread exit.

Signed-off-by: Wei Wang 
Signed-off-by: Liang Li 
CC: Michael S. Tsirkin 
CC: Dr. David Alan Gilbert 
CC: Juan Quintela 
---
 balloon.c   |  49 +--
 hw/virtio/virtio-balloon.c  | 172 +---
 include/hw/virtio/virtio-balloon.h  |  14 +-
 include/standard-headers/linux/virtio_balloon.h |   7 +
 include/sysemu/balloon.h|  15 ++-
 5 files changed, 228 insertions(+), 29 deletions(-)

diff --git a/balloon.c b/balloon.c
index d8dd6fe..b0b0749 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,9 @@
 
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonFreePageSupport *balloon_free_page_support_fn;
+static QEMUBalloonFreePageStart *balloon_free_page_start_fn;
+static QEMUBalloonFreePageStop *balloon_free_page_stop_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
 
@@ -64,19 +67,42 @@ static bool have_balloon(Error **errp)
 return true;
 }
 
-int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+bool balloon_free_page_support(void)
 {
-if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
-/* We're already registered one balloon handler.  How many can
- * a guest really have?
- */
-return -1;
+return balloon_free_page_support_fn &&
+   balloon_free_page_support_fn(balloon_opaque);
+}
+
+void balloon_free_page_start(void)
+{
+balloon_free_page_start_fn(balloon_opaque);
+}
+
+void balloon_free_page_stop(void)
+{
+balloon_free_page_stop_fn(balloon_opaque);
+}
+
+void qemu_add_balloon_handler(QEMUBalloonEvent *event_fn,
+  QEMUBalloonStatus *stat_fn,
+  QEMUBalloonFreePageSupport *free_page_support_fn,
+  QEMUBalloonFreePageStart *free_page_start_fn,
+  QEMUBalloonFreePageStop *free_page_stop_fn,
+  void *opaque)
+{
+if (balloon_event_fn || balloon_stat_fn || balloon_free_page_support_fn ||
+balloon_free_page_start_fn || balloon_free_page_stop_fn ||
+balloon_opaque) {
+/* We already registered one balloon handler. */
+return;
 }
-balloon_event_fn = event_func;
-balloon_stat_fn = stat_func;
+
+balloon_event_fn = event_fn;
+balloon_stat_fn = stat_fn;
+balloon_free_page_support_fn = free_page_support_fn;
+balloon_free_page_start_fn = free_page_start_fn;
+balloon_free_page_stop_fn = free_page_stop_fn;
 balloon_opaque = opaque;
-return 0;
 }
 
 void qemu_remove_balloon_handler(void *opaque)
@@ -86,6 +112,9 @@ void qemu_remove_balloon_handler(void *opaque)
 }
 balloon_event_fn = NULL;
 balloon_stat_fn = NULL;
+balloon_free_page_support_fn = NULL;
+balloon_free_page_start_fn = NULL;
+balloon_free_page_stop_fn = NULL;
 balloon_opaque = NULL;
 }
 
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 4822449..4607879 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -31,6 +31,7 @@
 
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "migration/misc.h"
 
 #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
 
@@ -308,6 +309,100 @@ out:
 }
 }
 
+static void *virtio_balloon_poll_free_page_hints(void *opaque)
+{
+VirtQueueElement *elem;
+VirtIOBalloon *dev = opaque;
+VirtQueue *vq = dev->free_page_vq;
+uint32_t id;
+size_t size;
+
+/*
+ * Poll the vq till the status changed to STOP. This happens when
+ * the guest finishes reporting hints or the migration thread actively
+ * stops the reporting.
+ */
+while (dev->free_page_report_status != FREE_PAGE_REPORT_S_STOP) {
+elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+if (!elem) {
+continue;
+}
+
+if (elem->out_num) {
+size = iov_to_buf(elem->out_sg, elem->out_num, 0, &id, sizeof(id));
+virtqueue_push(vq, elem, size);
+g_free(elem);
+if (unlikely(size != sizeof(id))) {
+warn_report("%s: received an incorrect cmd id", __func__);
+break;
+}
+if (id == dev->free_page_report_cmd_id) {
+dev->free_page_report_status = FREE_PAGE_REPORT_S_START;
+} else if (dev->free_page_report_status ==
+   FREE_PAGE_REPORT_S_START) {
+/*
+ * Stop the optimization only when it has started. This 

[virtio-dev] [PATCH v3 1/3] migration: API to clear bits of guest free pages from the dirty bitmap

2018-03-02 Thread Wei Wang
This patch adds an API to clear bits corresponding to guest free pages
from the dirty bitmap. Spilt the free page block if it crosses the QEMU
RAMBlock boundary.

Signed-off-by: Wei Wang 
CC: Dr. David Alan Gilbert 
CC: Juan Quintela 
CC: Michael S. Tsirkin 
---
 include/migration/misc.h |  2 ++
 migration/ram.c  | 20 
 2 files changed, 22 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 77fd4f5..fae1acf 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -14,11 +14,13 @@
 #ifndef MIGRATION_MISC_H
 #define MIGRATION_MISC_H
 
+#include "exec/cpu-common.h"
 #include "qemu/notify.h"
 
 /* migration/ram.c */
 
 void ram_mig_init(void);
+void qemu_guest_free_page_hint(void *addr, size_t len);
 
 /* migration/block.c */
 
diff --git a/migration/ram.c b/migration/ram.c
index 5e33e5c..769a0f6 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2189,6 +2189,26 @@ static int ram_init_all(RAMState **rsp)
 return 0;
 }
 
+void qemu_guest_free_page_hint(void *addr, size_t len)
+{
+RAMBlock *block;
+ram_addr_t offset;
+size_t used_len, start, npages;
+
+for (used_len = len; len > 0; len -= used_len) {
+block = qemu_ram_block_from_host(addr, false, &offset);
+if (unlikely(offset + len > block->used_length)) {
+used_len = block->used_length - offset;
+addr += used_len;
+}
+
+start = offset >> TARGET_PAGE_BITS;
+npages = used_len >> TARGET_PAGE_BITS;
+bitmap_clear(block->bmap, start, npages);
+ram_state->migration_dirty_pages -= npages;
+}
+}
+
 /*
  * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
  * long-running RCU critical section.  When rcu-reclaims in the code
-- 
1.8.3.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 0/3] virtio-balloon: free page hint reporting support

2018-03-02 Thread Wei Wang
This is the deivce part implementation to add a new feature,
VIRTIO_BALLOON_F_FREE_PAGE_HINT to the virtio-balloon device. The device
receives the guest free page hints from the driver and clears the
corresponding bits in the dirty bitmap, so that those free pages are
not transferred by the migration thread to the destination.

- Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Guest: 8G RAM, 4 vCPU
Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
- Idle Guest Live Migration Time (results are averaged over 10 runs):
- Optimization v.s. Legacy = 259ms vs 1769ms --> ~86% reduction
- Guest with Linux Compilation Workload (make bzImage -j4):
- Live Migration Time (average)
  Optimization v.s. Legacy = 1290ms v.s. 2634ms --> ~51% reduction
- Linux Compilation Time
  Optimization v.s. Legacy = 5min v.s. 5min3s
  --> no obvious difference

- Source Code
- QEMU:  https://github.com/wei-w-wang/qemu-free-page-lm.git
- Linux: https://github.com/wei-w-wang/linux-free-page-lm.git

ChangeLog:
v2->v3:
1) virtio-balloon
- virtio_balloon_free_page_start: poll the hints using a new
  thread;
- use cmd id between [0x8000, UINT_MAX];
- virtio_balloon_poll_free_page_hints:
- stop the optimization only when it has started;
- don't skip free pages when !poison_val;
- add poison_val to vmsd to migrate;
- virtio_balloon_get_features: add the F_PAGE_POISON feature when
  host has F_FREE_PAGE_HINT;
- remove the timer patch which is not needed now.
2) migration
   - new api, qemu_guest_free_page_hint;
   - rs->free_page_support set only in the precopy case;
   - use the new balloon APIs.
v1->v2: 
1) virtio-balloon
- use subsections to save free_page_report_cmd_id;
- poll the free page vq after sending a cmd id to the driver;
- change the free page vq size to VIRTQUEUE_MAX_SIZE;
- virtio_balloon_poll_free_page_hints: handle the corner case
  that the free page block reported from the driver may cross
  the RAMBlock boundary.
2) migration/ram.c
- use balloon_free_page_poll to start the optimization

Wei Wang (3):
  migration: API to clear bits of guest free pages from the dirty bitmap
  virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
  migration: use the free page hint feature from balloon

 balloon.c   |  49 +--
 hw/virtio/virtio-balloon.c  | 172 +---
 include/hw/virtio/virtio-balloon.h  |  14 +-
 include/migration/misc.h|   2 +
 include/standard-headers/linux/virtio_balloon.h |   7 +
 include/sysemu/balloon.h|  15 ++-
 migration/ram.c |  39 +-
 7 files changed, 268 insertions(+), 30 deletions(-)

-- 
1.8.3.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] [PATCH v3 3/3] migration: use the free page hint feature from balloon

2018-03-02 Thread Wei Wang
Start the free page optimization when the bulk stage starts. In case the
guest is slow in reporting, actively stops it when the bulk stage ends.
The optimization avoids sending guest free pages during the bulk stage.
Currently, the optimization is added to precopy only.

Signed-off-by: Wei Wang 
CC: Dr. David Alan Gilbert 
CC: Juan Quintela 
CC: Michael S. Tsirkin 
---
 migration/ram.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 769a0f6..f6af7e6 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -51,6 +51,7 @@
 #include "qemu/rcu_queue.h"
 #include "migration/colo.h"
 #include "migration/block.h"
+#include "sysemu/balloon.h"
 
 /***/
 /* ram save/restore */
@@ -208,6 +209,8 @@ struct RAMState {
 uint32_t last_version;
 /* We are in the first round */
 bool ram_bulk_stage;
+/* The free pages optimization feature is supported */
+bool free_page_support;
 /* How many times we have dirty too many pages */
 int dirty_rate_high_cnt;
 /* these variables are used for bitmap sync */
@@ -775,7 +778,7 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, 
RAMBlock *rb,
 unsigned long *bitmap = rb->bmap;
 unsigned long next;
 
-if (rs->ram_bulk_stage && start > 0) {
+if (rs->ram_bulk_stage && start > 0 && !rs->free_page_support) {
 next = start + 1;
 } else {
 next = find_next_bit(bitmap, size, start);
@@ -1225,6 +1228,10 @@ static bool find_dirty_block(RAMState *rs, 
PageSearchStatus *pss, bool *again)
 /* Flag that we've looped */
 pss->complete_round = true;
 rs->ram_bulk_stage = false;
+if (rs->free_page_support) {
+balloon_free_page_stop();
+rs->free_page_support = false;
+}
 if (migrate_use_xbzrle()) {
 /* If xbzrle is on, stop using the data compression at this
  * point. In theory, xbzrle can do better than compression.
@@ -1656,6 +1663,8 @@ static void ram_state_reset(RAMState *rs)
 rs->last_page = 0;
 rs->last_version = ram_list.version;
 rs->ram_bulk_stage = true;
+rs->free_page_support = balloon_free_page_support() &
+!migration_in_postcopy();
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -2235,6 +2244,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 return -1;
 }
 }
+
+if ((*rsp)->free_page_support) {
+balloon_free_page_start();
+}
+
 (*rsp)->f = f;
 
 rcu_read_lock();
@@ -2329,6 +2343,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
 ret = qemu_file_get_error(f);
 if (ret < 0) {
+if (rs->ram_bulk_stage && rs->free_page_support) {
+balloon_free_page_stop();
+}
 return ret;
 }
 
-- 
1.8.3.1


-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org