Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 28, 2018 at 08:25:01PM +0100, Jiri Pirko wrote: > Wed, Feb 28, 2018 at 04:45:39PM CET, m...@redhat.com wrote: > >On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote: > >> Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote: > >> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote: > >> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: > >> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: > >> >> >> Basically we need some sort of PCI or PCIe topology mapping for the > >> >> >> devices that can be translated into something we can communicate over > >> >> >> the communication channel. > >> >> > > >> >> >Hm. This is probably a completely stupid idea, but if we need to > >> >> >start marshalling configuration requests/hints maybe the entire problem > >> >> >could be solved by opening a netlink socket from hypervisor? Even make > >> >> >teamd run on the hypervisor side... > >> >> > >> >> Interesting. That would be more trickier then just to fwd 1 genetlink > >> >> socket to the hypervisor. > >> >> > >> >> Also, I think that the solution should handle multiple guest oses. What > >> >> I'm thinking about is some generic bonding description passed over some > >> >> communication channel into vm. The vm either use it for configuration, > >> >> or ignores it if it is not smart enough/updated enough. > >> > > >> >For sure, we could build virtio-bond to pass that info to guests. > >> > >> What do you mean by "virtio-bond". virtio_net extension? > > > >I mean a new device supplying topology information to guests, > >with updates whenever VMs are started, stopped or migrated. > > Good. Any idea how that device would look like? Also, any idea how to > handle in in kernel and how to pass along this info to userspace? > Is there anything similar out there? > > Thanks! E.g. balloon is used to pass hints about amount of memory guest should use. We could do something similar. I imagine device can send a configuration interrupt on each topology change. Kernel wakes up userspace pollers. Userspace starts doing reads from a char device and figures out what changed. Which info is needed there? I am not sure. How about list of MAC/VLAN addresses coupled to list of devices to queue on (specified by mac? by PCI address)? Or do we ever need to go higher level and make decisions based on IP addresses as well? -- MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 28, 2018 at 04:45:39PM CET, m...@redhat.com wrote: >On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote: >> Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote: >> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote: >> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: >> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: >> >> >> Basically we need some sort of PCI or PCIe topology mapping for the >> >> >> devices that can be translated into something we can communicate over >> >> >> the communication channel. >> >> > >> >> >Hm. This is probably a completely stupid idea, but if we need to >> >> >start marshalling configuration requests/hints maybe the entire problem >> >> >could be solved by opening a netlink socket from hypervisor? Even make >> >> >teamd run on the hypervisor side... >> >> >> >> Interesting. That would be more trickier then just to fwd 1 genetlink >> >> socket to the hypervisor. >> >> >> >> Also, I think that the solution should handle multiple guest oses. What >> >> I'm thinking about is some generic bonding description passed over some >> >> communication channel into vm. The vm either use it for configuration, >> >> or ignores it if it is not smart enough/updated enough. >> > >> >For sure, we could build virtio-bond to pass that info to guests. >> >> What do you mean by "virtio-bond". virtio_net extension? > >I mean a new device supplying topology information to guests, >with updates whenever VMs are started, stopped or migrated. Good. Any idea how that device would look like? Also, any idea how to handle in in kernel and how to pass along this info to userspace? Is there anything similar out there? Thanks!
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote: > Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote: > >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote: > >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: > >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: > >> >> Basically we need some sort of PCI or PCIe topology mapping for the > >> >> devices that can be translated into something we can communicate over > >> >> the communication channel. > >> > > >> >Hm. This is probably a completely stupid idea, but if we need to > >> >start marshalling configuration requests/hints maybe the entire problem > >> >could be solved by opening a netlink socket from hypervisor? Even make > >> >teamd run on the hypervisor side... > >> > >> Interesting. That would be more trickier then just to fwd 1 genetlink > >> socket to the hypervisor. > >> > >> Also, I think that the solution should handle multiple guest oses. What > >> I'm thinking about is some generic bonding description passed over some > >> communication channel into vm. The vm either use it for configuration, > >> or ignores it if it is not smart enough/updated enough. > > > >For sure, we could build virtio-bond to pass that info to guests. > > What do you mean by "virtio-bond". virtio_net extension? I mean a new device supplying topology information to guests, with updates whenever VMs are started, stopped or migrated. > > > >Such an advisory mechanism would not be a replacement for the mandatory > >passthrough fallback flag proposed, but OTOH it's much more flexible. > > > >-- > >MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote: >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote: >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: >> >> Basically we need some sort of PCI or PCIe topology mapping for the >> >> devices that can be translated into something we can communicate over >> >> the communication channel. >> > >> >Hm. This is probably a completely stupid idea, but if we need to >> >start marshalling configuration requests/hints maybe the entire problem >> >could be solved by opening a netlink socket from hypervisor? Even make >> >teamd run on the hypervisor side... >> >> Interesting. That would be more trickier then just to fwd 1 genetlink >> socket to the hypervisor. >> >> Also, I think that the solution should handle multiple guest oses. What >> I'm thinking about is some generic bonding description passed over some >> communication channel into vm. The vm either use it for configuration, >> or ignores it if it is not smart enough/updated enough. > >For sure, we could build virtio-bond to pass that info to guests. What do you mean by "virtio-bond". virtio_net extension? > >Such an advisory mechanism would not be a replacement for the mandatory >passthrough fallback flag proposed, but OTOH it's much more flexible. > >-- >MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote: > Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: > >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: > >> Basically we need some sort of PCI or PCIe topology mapping for the > >> devices that can be translated into something we can communicate over > >> the communication channel. > > > >Hm. This is probably a completely stupid idea, but if we need to > >start marshalling configuration requests/hints maybe the entire problem > >could be solved by opening a netlink socket from hypervisor? Even make > >teamd run on the hypervisor side... > > Interesting. That would be more trickier then just to fwd 1 genetlink > socket to the hypervisor. > > Also, I think that the solution should handle multiple guest oses. What > I'm thinking about is some generic bonding description passed over some > communication channel into vm. The vm either use it for configuration, > or ignores it if it is not smart enough/updated enough. For sure, we could build virtio-bond to pass that info to guests. Such an advisory mechanism would not be a replacement for the mandatory passthrough fallback flag proposed, but OTOH it's much more flexible. -- MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote: >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: >> Basically we need some sort of PCI or PCIe topology mapping for the >> devices that can be translated into something we can communicate over >> the communication channel. > >Hm. This is probably a completely stupid idea, but if we need to >start marshalling configuration requests/hints maybe the entire problem >could be solved by opening a netlink socket from hypervisor? Even make >teamd run on the hypervisor side... Interesting. That would be more trickier then just to fwd 1 genetlink socket to the hypervisor. Also, I think that the solution should handle multiple guest oses. What I'm thinking about is some generic bonding description passed over some communication channel into vm. The vm either use it for configuration, or ignores it if it is not smart enough/updated enough.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote: > Basically we need some sort of PCI or PCIe topology mapping for the > devices that can be translated into something we can communicate over > the communication channel. Hm. This is probably a completely stupid idea, but if we need to start marshalling configuration requests/hints maybe the entire problem could be solved by opening a netlink socket from hypervisor? Even make teamd run on the hypervisor side...
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 27, 2018 at 09:49:59AM +0100, Jiri Pirko wrote: > Now the question is: is it possible to merge the demands you have and > the generic needs I described into a single solution? From what I see, > that would be quite hard/impossible. So at the end, I think that we have > to end-up with 2 solutions: > 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config >solution that works for all (no matter what OS you use in VM) > 2) team/bond solution with assistance of preferably userspace daemon >getting info from baremetal. This is not 0config, but minimal config >- user just have to define this "magic bonding" should be on. >This covers all possible usecases, including multiple VFs, RDMA, etc. > > Thoughts? I think I agree. This RFC is trying to do 1 above. Looks like we now all agree 1 and 2 are not exclusive, both have place in the kernel. Is that right? -- MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 27, 2018 at 01:16:21PM -0800, Alexander Duyck wrote: > The other thing I am looking at is trying to find a good way to do > dirty page tracking in the hypervisor using something like a > para-virtual IOMMU. However I don't have any ETA on that as I am just > starting out and have limited development time. If we get that in > place we can leave the VF in the guest until the very last moments > instead of having to remove it before we start the live migration. > > - Alex I actually think your old RFC would be a good starting point: https://lkml.org/lkml/2016/1/5/104 What is missing is I think enabling/disabling dynamically. Seems to be easier than tracking by the hypervisor. -- MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirkowrote: > Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko wrote: >>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be used by hypervisor to indicate that virtio_net interface should act as a backup for another device with the same MAC address. Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. >>> >>> Sorry, but this is ridiculous. You are apparently re-implemeting part >>> of bonding driver as a part of NIC driver. Bond and team drivers >>> are mature solutions, well tested, broadly used, with lots of issues >>> resolved in the past. What you try to introduce is a weird shortcut >>> that already has couple of issues as you mentioned and will certanly >>> have many more. Also, I'm pretty sure that in future, someone comes up >>> with ideas like multiple VFs, LACP and similar bonding things. >> >>The problem with the bond and team drivers is they are too large and >>have too many interfaces available for configuration so as a result >>they can really screw this interface up. >> >>Essentially this is meant to be a bond that is more-or-less managed by >>the host, not the guest. We want the host to be able to configure it >>and have it automatically kick in on the guest. For now we want to >>avoid adding too much complexity as this is meant to be just the first >>step. Trying to go in and implement the whole solution right from the >>start based on existing drivers is going to be a massive time sink and >>will likely never get completed due to the fact that there is always >>going to be some other thing that will interfere. >> >>My personal hope is that we can look at doing a virtio-bond sort of >>device that will handle all this as well as providing a communication >>channel, but that is much further down the road. For now we only have >>a single bit so the goal for now is trying to keep this as simple as >>possible. > > I have another usecase that would require the solution to be different > then what you suggest. Consider following scenario: > - baremetal has 2 sr-iov nics > - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net > - baremetal would like to somehow tell the VM to bond vf0 and vf1 > together and how this bonding should be configured, according to how > the VF representors are configured on the baremetal (LACP for example) > > The baremetal could decide to remove any VF during the VM runtime, it > can add another VF there. For migration, it can add virtio_net. The VM > should be inctructed to bond all interfaces together according to how > baremetal decided - as it knows better. > > For this we need a separate communication channel from baremetal to VM > (perhaps something re-usable already exists), we need something to > listen to the events coming from this channel (kernel/userspace) and to > react accordingly (create bond/team, enslave, etc). > > Now the question is: is it possible to merge the demands you have and > the generic needs I described into a single solution? From what I see, > that would be quite hard/impossible. So at the end, I think that we have > to end-up with 2 solutions: > 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config >solution that works for all (no matter what OS you use in VM) > 2) team/bond solution with assistance of preferably userspace daemon >getting info from baremetal. This is not 0config, but minimal config >- user just have to define this "magic bonding" should be on. >This covers all possible usecases, including multiple VFs, RDMA, etc. > > Thoughts? So that is about what I had in mind. We end up having to do something completely different to support this more complex solution. I think we might have referred to it as v2/v3 in a different thread, and virt-bond in this thread. Basically we need some sort of PCI or PCIe topology mapping for the devices that can be translated into something we can communicate over the communication channel. After that we also have the added complexity of how do we figure out
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirkowrote: >> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>>used by hypervisor to indicate that virtio_net interface should act as >>>a backup for another device with the same MAC address. >>> >>>Ppatch 2 is in response to the community request for a 3 netdev >>>solution. However, it creates some issues we'll get into in a moment. >>>It extends virtio_net to use alternate datapath when available and >>>registered. When BACKUP feature is enabled, virtio_net driver creates >>>an additional 'bypass' netdev that acts as a master device and controls >>>2 slave devices. The original virtio_net netdev is registered as >>>'backup' netdev and a passthru/vf device with the same MAC gets >>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>>associated with the same 'pci' device. The user accesses the network >>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>>as default for transmits when it is available with link up and running. >> >> Sorry, but this is ridiculous. You are apparently re-implemeting part >> of bonding driver as a part of NIC driver. Bond and team drivers >> are mature solutions, well tested, broadly used, with lots of issues >> resolved in the past. What you try to introduce is a weird shortcut >> that already has couple of issues as you mentioned and will certanly >> have many more. Also, I'm pretty sure that in future, someone comes up >> with ideas like multiple VFs, LACP and similar bonding things. > >The problem with the bond and team drivers is they are too large and >have too many interfaces available for configuration so as a result >they can really screw this interface up. > >Essentially this is meant to be a bond that is more-or-less managed by >the host, not the guest. We want the host to be able to configure it >and have it automatically kick in on the guest. For now we want to >avoid adding too much complexity as this is meant to be just the first >step. Trying to go in and implement the whole solution right from the >start based on existing drivers is going to be a massive time sink and >will likely never get completed due to the fact that there is always >going to be some other thing that will interfere. > >My personal hope is that we can look at doing a virtio-bond sort of >device that will handle all this as well as providing a communication >channel, but that is much further down the road. For now we only have >a single bit so the goal for now is trying to keep this as simple as >possible. I have another usecase that would require the solution to be different then what you suggest. Consider following scenario: - baremetal has 2 sr-iov nics - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net - baremetal would like to somehow tell the VM to bond vf0 and vf1 together and how this bonding should be configured, according to how the VF representors are configured on the baremetal (LACP for example) The baremetal could decide to remove any VF during the VM runtime, it can add another VF there. For migration, it can add virtio_net. The VM should be inctructed to bond all interfaces together according to how baremetal decided - as it knows better. For this we need a separate communication channel from baremetal to VM (perhaps something re-usable already exists), we need something to listen to the events coming from this channel (kernel/userspace) and to react accordingly (create bond/team, enslave, etc). Now the question is: is it possible to merge the demands you have and the generic needs I described into a single solution? From what I see, that would be quite hard/impossible. So at the end, I think that we have to end-up with 2 solutions: 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config solution that works for all (no matter what OS you use in VM) 2) team/bond solution with assistance of preferably userspace daemon getting info from baremetal. This is not 0config, but minimal config - user just have to define this "magic bonding" should be on. This covers all possible usecases, including multiple VFs, RDMA, etc. Thoughts?
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 27, 2018 at 02:18:12AM CET, m...@redhat.com wrote: >On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote: >> On Mon, 26 Feb 2018 08:19:24 +0100 >> Jiri Pirkowrote: >> >> > Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote: >> > >On Thu, 22 Feb 2018 13:30:12 -0800 >> > >Alexander Duyck wrote: >> > > >> > >> > Again, I undertand your motivation. Yet I don't like your solution. >> > >> > But if the decision is made to do this in-driver bonding. I would like >> > >> > to see it baing done some generic way: >> > >> > 1) share the same "in-driver bonding core" code with netvsc >> > >> >put to net/core. >> > >> > 2) the "in-driver bonding core" will strictly limit the functionality, >> > >> >like active-backup mode only, one vf, one backup, vf netdev type >> > >> >check (so noone could enslave a tap or anything else) >> > >> > If user would need something more, he should employ team/bond. >> > > >> > >Sharing would be good, but netvsc world would really like to only have >> > >one visible network device. >> > >> > Why do you mind? All would be the same, there would be just another >> > netdevice unused by the vm user (same as the vf netdev). >> > >> >> I mind because our requirement is no changes to userspace. >> No special udev rules, no bonding script, no setup. > >Agreed. It is mostly fine from this point of view, except that you need >to know to skip the slaves. Maybe we could look at some kind of >trick e.g. pretending link is down for slaves? :O Another hack. Please, don't. > >> Things like cloudinit running on current distro's expect to see a single >> eth0. The VF device show up can also be an issue because distro's have >> stupid rules like Network Manager trying to start DHCP on every interface. >> We deal with that now by doing stuff like udev rules to get it to stop >> but that is still causing user errors. So that means that with an extra netdev for "virtio_net bypass" you will face exactly the same problems. Should not be an issue for you then. > >So the ideal of a single net device isn't achieved by netvsc. > >Since you have scripts to skip the PT device, can't they >hind the PV slave too? How do they identify the device to skip? > >I agree it would be nice to have a way to hide the extra netdev >from userspace. "A hidden netdevice", hmm. I believe that instead of doing hacks like this, we should fix userspace to treat particular netdevices correctly. > >The benefit of the separation is that each slave device can >be configured with e.g. its own native ethtool commands for >optimum performance. > >-- >MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote: > On Mon, 26 Feb 2018 08:19:24 +0100 > Jiri Pirkowrote: > > > Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote: > > >On Thu, 22 Feb 2018 13:30:12 -0800 > > >Alexander Duyck wrote: > > > > > >> > Again, I undertand your motivation. Yet I don't like your solution. > > >> > But if the decision is made to do this in-driver bonding. I would like > > >> > to see it baing done some generic way: > > >> > 1) share the same "in-driver bonding core" code with netvsc > > >> >put to net/core. > > >> > 2) the "in-driver bonding core" will strictly limit the functionality, > > >> >like active-backup mode only, one vf, one backup, vf netdev type > > >> >check (so noone could enslave a tap or anything else) > > >> > If user would need something more, he should employ team/bond. > > > > > >Sharing would be good, but netvsc world would really like to only have > > >one visible network device. > > > > Why do you mind? All would be the same, there would be just another > > netdevice unused by the vm user (same as the vf netdev). > > > > I mind because our requirement is no changes to userspace. > No special udev rules, no bonding script, no setup. Agreed. It is mostly fine from this point of view, except that you need to know to skip the slaves. Maybe we could look at some kind of trick e.g. pretending link is down for slaves? > Things like cloudinit running on current distro's expect to see a single > eth0. The VF device show up can also be an issue because distro's have > stupid rules like Network Manager trying to start DHCP on every interface. > We deal with that now by doing stuff like udev rules to get it to stop > but that is still causing user errors. So the ideal of a single net device isn't achieved by netvsc. Since you have scripts to skip the PT device, can't they hind the PV slave too? How do they identify the device to skip? I agree it would be nice to have a way to hide the extra netdev from userspace. The benefit of the separation is that each slave device can be configured with e.g. its own native ethtool commands for optimum performance. -- MST
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Mon, 26 Feb 2018 08:19:24 +0100 Jiri Pirkowrote: > Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote: > >On Thu, 22 Feb 2018 13:30:12 -0800 > >Alexander Duyck wrote: > > > >> > Again, I undertand your motivation. Yet I don't like your solution. > >> > But if the decision is made to do this in-driver bonding. I would like > >> > to see it baing done some generic way: > >> > 1) share the same "in-driver bonding core" code with netvsc > >> >put to net/core. > >> > 2) the "in-driver bonding core" will strictly limit the functionality, > >> >like active-backup mode only, one vf, one backup, vf netdev type > >> >check (so noone could enslave a tap or anything else) > >> > If user would need something more, he should employ team/bond. > > > >Sharing would be good, but netvsc world would really like to only have > >one visible network device. > > Why do you mind? All would be the same, there would be just another > netdevice unused by the vm user (same as the vf netdev). > I mind because our requirement is no changes to userspace. No special udev rules, no bonding script, no setup. Things like cloudinit running on current distro's expect to see a single eth0. The VF device show up can also be an issue because distro's have stupid rules like Network Manager trying to start DHCP on every interface. We deal with that now by doing stuff like udev rules to get it to stop but that is still causing user errors.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote: >On Thu, 22 Feb 2018 13:30:12 -0800 >Alexander Duyckwrote: > >> > Again, I undertand your motivation. Yet I don't like your solution. >> > But if the decision is made to do this in-driver bonding. I would like >> > to see it baing done some generic way: >> > 1) share the same "in-driver bonding core" code with netvsc >> >put to net/core. >> > 2) the "in-driver bonding core" will strictly limit the functionality, >> >like active-backup mode only, one vf, one backup, vf netdev type >> >check (so noone could enslave a tap or anything else) >> > If user would need something more, he should employ team/bond. > >Sharing would be good, but netvsc world would really like to only have >one visible network device. Why do you mind? All would be the same, there would be just another netdevice unused by the vm user (same as the vf netdev).
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemmingerwrote: > On Thu, 22 Feb 2018 13:30:12 -0800 > Alexander Duyck wrote: > >> > Again, I undertand your motivation. Yet I don't like your solution. >> > But if the decision is made to do this in-driver bonding. I would like >> > to see it baing done some generic way: >> > 1) share the same "in-driver bonding core" code with netvsc >> >put to net/core. >> > 2) the "in-driver bonding core" will strictly limit the functionality, >> >like active-backup mode only, one vf, one backup, vf netdev type >> >check (so noone could enslave a tap or anything else) >> > If user would need something more, he should employ team/bond. > > Sharing would be good, but netvsc world would really like to only have > one visible network device. Other than the netdev count are there any other issues we need to be thinking about? If I am not mistaken you netvsc doesn't put any broadcast/multicast filters on the VF. If we ended up doing that in order to support the virtio based solution would that cause any issues? I just realized we had overlooked dealing with multicast in our current solution so we will probably be looking at syncing the multicast list like what occurs in netvsc, however we will need to do it for both the VF and the virtio interfaces. Thanks. - Alex
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemmingerwrote: > (pruned to reduce thread) > > On Wed, 21 Feb 2018 16:17:19 -0800 > Alexander Duyck wrote: > >> >>> FWIW two solutions that immediately come to mind is to export "backup" >> >>> as phys_port_name of the backup virtio link and/or assign a name to the >> >>> master like you are doing already. I think team uses team%d and bond >> >>> uses bond%d, soft naming of master devices seems quite natural in this >> >>> case. >> >> >> >> I figured I had overlooked something like that.. Thanks for pointing >> >> this out. Okay so I think the phys_port_name approach might resolve >> >> the original issue. If I am reading things correctly what we end up >> >> with is the master showing up as "ens1" for example and the backup >> >> showing up as "ens1nbackup". Am I understanding that right? >> >> >> >> The problem with the team/bond%d approach is that it creates a new >> >> netdevice and so it would require guest configuration changes. >> >> >> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >> >>> link is quite neat. >> >> >> >> I agree. For non-"backup" virio_net devices would it be okay for us to >> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy >> >> behavior could be maintained although the function still exists. >> >> >> - When the 'active' netdev is unplugged OR not present on a destination >> system after live migration, the user will see 2 virtio_net netdevs. >> >>> >> >>> That's necessary and expected, all configuration applies to the master >> >>> so master must exist. >> >> >> >> With the naming issue resolved this is the only item left outstanding. >> >> This becomes a matter of form vs function. >> >> >> >> The main complaint about the "3 netdev" solution is a bit confusing to >> >> have the 2 netdevs present if the VF isn't there. The idea is that >> >> having the extra "master" netdev there if there isn't really a bond is >> >> a bit ugly. >> > >> > Is it this uglier in terms of user experience rather than >> > functionality? I don't want it dynamically changed between 2-netdev >> > and 3-netdev depending on the presence of VF. That gets back to my >> > original question and suggestion earlier: why not just hide the lower >> > netdevs from udev renaming and such? Which important observability >> > benefits users may get if exposing the lower netdevs? >> > >> > Thanks, >> > -Siwei >> >> The only real advantage to a 2 netdev solution is that it looks like >> the netvsc solution, however it doesn't behave like it since there are >> some features like XDP that may not function correctly if they are >> left enabled in the virtio_net interface. >> >> As far as functionality the advantage of not hiding the lower devices >> is that they are free to be managed. The problem with pushing all of >> the configuration into the upper device is that you are limited to the >> intersection of the features of the lower devices. This can be >> limiting for some setups as some VFs support things like more queues, >> or better interrupt moderation options than others so trying to make >> everything work with one config would be ugly. >> > > > Let's not make XDP the blocker for doing the best solution > from the end user point of view. XDP is just yet another offload > thing which needs to be handled. The current backup device solution > used in netvsc doesn't handle the full range of offload options > (things like flow direction, DCB, etc); no one but the HW vendors > seems to care. XDP isn't the blocker here. As far as I am concerned we can go either way, with a 2 netdev or a 3 netdev solution. We just need to make sure we are aware of all the trade-offs, and make a decision one way or the other. This is quickly turning into a bikeshed and I would prefer us to all agree, or at least disagree and commit, on which way to go before we burn more cycles on a patch set that seems to be getting tied up in debate. With the 2 netdev solution we have to limit the functionality so that we don't break things when we bypass the guts of the driver to hand traffic off to the VF. Then ends up meaning that we are stuck with an extra qdisc and Tx queue lock in the transmit path of the VF, and we cannot rely on any in-driver Rx functionality to work such as in-driver XDP. However the advantage here is that this is how netvsc is already doing things. The issue with the 3 netdev solution is that you are stuck with 2 netdevs ("ens1", "ens1nbackup") when the VF is not present. It could be argued this isn't a very elegant looking solution, especially when the VF is not present. With virtio this makes more sense though as you are still able to expose the full functionality of the lower device so you don't have to strip or drop any of the existing net device ops if the "backup" bit is present. Ultimately I would have preferred to have the 3 netdev solution go with virtio
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirkowrote: > Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote: > > [...] > No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. > > How that could work? If you have the VF netdev with all configuration > including IPs and routes and whatever - now you want to do migration > so you add virtio_net and do some weird in-driver bonding with it. But > then, VF disappears and the VF netdev with that and also all > configuration it had. > I don't think this scenario is valid. We are talking about making udev aware of the new virtio-bypass to rebind the name of the old VF interface with supposedly virtio-bypass *post the kernel upgrade*. Of course, this needs virtio-net backend to supply the [bdf] info where the VF/PT device was located. -Siwei > > >>> >>> >>> Yes. This sounds interesting. Looks like you want an existing VM image with >>> VF only configuration to get transparent live migration support by adding >>> virtio_net with BACKUP feature. We may need another feature bit to switch >>> between these 2 options. >> >>Yes, that's what I was thinking about. I have been building something >>like this before, and would like to get back after merging with your >>patch.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
(pruned to reduce thread) On Wed, 21 Feb 2018 16:17:19 -0800 Alexander Duyckwrote: > >>> FWIW two solutions that immediately come to mind is to export "backup" > >>> as phys_port_name of the backup virtio link and/or assign a name to the > >>> master like you are doing already. I think team uses team%d and bond > >>> uses bond%d, soft naming of master devices seems quite natural in this > >>> case. > >> > >> I figured I had overlooked something like that.. Thanks for pointing > >> this out. Okay so I think the phys_port_name approach might resolve > >> the original issue. If I am reading things correctly what we end up > >> with is the master showing up as "ens1" for example and the backup > >> showing up as "ens1nbackup". Am I understanding that right? > >> > >> The problem with the team/bond%d approach is that it creates a new > >> netdevice and so it would require guest configuration changes. > >> > >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > >>> link is quite neat. > >> > >> I agree. For non-"backup" virio_net devices would it be okay for us to > >> just return -EOPNOTSUPP? I assume it would be and that way the legacy > >> behavior could be maintained although the function still exists. > >> > - When the 'active' netdev is unplugged OR not present on a destination > system after live migration, the user will see 2 virtio_net netdevs. > >>> > >>> That's necessary and expected, all configuration applies to the master > >>> so master must exist. > >> > >> With the naming issue resolved this is the only item left outstanding. > >> This becomes a matter of form vs function. > >> > >> The main complaint about the "3 netdev" solution is a bit confusing to > >> have the 2 netdevs present if the VF isn't there. The idea is that > >> having the extra "master" netdev there if there isn't really a bond is > >> a bit ugly. > > > > Is it this uglier in terms of user experience rather than > > functionality? I don't want it dynamically changed between 2-netdev > > and 3-netdev depending on the presence of VF. That gets back to my > > original question and suggestion earlier: why not just hide the lower > > netdevs from udev renaming and such? Which important observability > > benefits users may get if exposing the lower netdevs? > > > > Thanks, > > -Siwei > > The only real advantage to a 2 netdev solution is that it looks like > the netvsc solution, however it doesn't behave like it since there are > some features like XDP that may not function correctly if they are > left enabled in the virtio_net interface. > > As far as functionality the advantage of not hiding the lower devices > is that they are free to be managed. The problem with pushing all of > the configuration into the upper device is that you are limited to the > intersection of the features of the lower devices. This can be > limiting for some setups as some VFs support things like more queues, > or better interrupt moderation options than others so trying to make > everything work with one config would be ugly. > Let's not make XDP the blocker for doing the best solution from the end user point of view. XDP is just yet another offload thing which needs to be handled. The current backup device solution used in netvsc doesn't handle the full range of offload options (things like flow direction, DCB, etc); no one but the HW vendors seems to care.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, 22 Feb 2018 13:30:12 -0800 Alexander Duyckwrote: > > Again, I undertand your motivation. Yet I don't like your solution. > > But if the decision is made to do this in-driver bonding. I would like > > to see it baing done some generic way: > > 1) share the same "in-driver bonding core" code with netvsc > >put to net/core. > > 2) the "in-driver bonding core" will strictly limit the functionality, > >like active-backup mode only, one vf, one backup, vf netdev type > >check (so noone could enslave a tap or anything else) > > If user would need something more, he should employ team/bond. Sharing would be good, but netvsc world would really like to only have one visible network device.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote: [...] >>> >>> No, that's not what I was talking about of course. I thought you >>> mentioned the upgrade scenario this patch would like to address is to >>> use the bypass interface "to take the place of the original virtio, >>> and get udev to rename the bypass to what the original virtio_net >>> was". That is one of the possible upgrade paths for sure. However the >>> upgrade path I was seeking is to use the bypass interface to take the >>> place of original VF interface while retaining the name and network >>> configs, which generally can be done simply with kernel upgrade. It >>> would become limiting as this patch makes the bypass interface share >>> the same virtio pci device with virito backup. Can this bypass >>> interface be made general to take place of any pci device other than >>> virtio-net? This will be more helpful as the cloud users who has >>> existing setup on VF interface don't have to recreate it on virtio-net >>> and VF separately again. How that could work? If you have the VF netdev with all configuration including IPs and routes and whatever - now you want to do migration so you add virtio_net and do some weird in-driver bonding with it. But then, VF disappears and the VF netdev with that and also all configuration it had. I don't think this scenario is valid. >> >> >> Yes. This sounds interesting. Looks like you want an existing VM image with >> VF only configuration to get transparent live migration support by adding >> virtio_net with BACKUP feature. We may need another feature bit to switch >> between these 2 options. > >Yes, that's what I was thinking about. I have been building something >like this before, and would like to get back after merging with your >patch.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridharwrote: > On 2/21/2018 5:59 PM, Siwei Liu wrote: >> >> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck >> wrote: >>> >>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: > > On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >> >> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >>> >>> Ppatch 2 is in response to the community request for a 3 netdev >>> solution. However, it creates some issues we'll get into in a >>> moment. >>> It extends virtio_net to use alternate datapath when available and >>> registered. When BACKUP feature is enabled, virtio_net driver creates >>> an additional 'bypass' netdev that acts as a master device and >>> controls >>> 2 slave devices. The original virtio_net netdev is registered as >>> 'backup' netdev and a passthru/vf device with the same MAC gets >>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>> associated with the same 'pci' device. The user accesses the network >>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' >>> netdev >>> as default for transmits when it is available with link up and >>> running. >> >> Thank you do doing this. >> >>> We noticed a couple of issues with this approach during testing. >>> - As both 'bypass' and 'backup' netdevs are associated with the same >>>virtio pci device, udev tries to rename both of them with the same >>> name >>>and the 2nd rename will fail. This would be OK as long as the >>> first netdev >>>to be renamed is the 'bypass' netdev, but the order in which udev >>> gets >>>to rename the 2 netdevs is not reliable. >> >> Out of curiosity - why do you link the master netdev to the virtio >> struct device? > > The basic idea of all this is that we wanted this to work with an > existing VM image that was using virtio. As such we were trying to > make it so that the bypass interface takes the place of the original > virtio and get udev to rename the bypass to what the original > virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. >>> >>> So if I am understanding what you are saying you are wanting to flip >>> the backup interface from the virtio to a VF. The problem is that >>> becomes a bit of a vendor lock-in solution since it would rely on a >>> specific VF driver. I would agree with Jiri that we don't want to go >>> down that path. We don't want every VF out there firing up its own >>> separate bond. Ideally you want the hypervisor to be able to manage >>> all of this which is why it makes sense to have virtio manage this and >>> why this is associated with the virtio_net interface. >> >> No, that's not what I was talking about of course. I thought you >> mentioned the upgrade scenario this patch would like to address is to >> use the bypass interface "to take the place of the original virtio, >> and get udev to rename the bypass to what the original virtio_net >> was". That is one of the possible upgrade paths for sure. However the >> upgrade path I was seeking is to use the bypass interface to take the >> place of original VF interface while retaining the name and network >> configs, which generally can be done simply with kernel upgrade. It >> would become limiting as this patch makes the bypass interface share >> the same virtio pci device with virito backup. Can this bypass >> interface be made general to take place of any pci device other than >> virtio-net? This will be more helpful as the cloud users who has >> existing setup on VF interface don't have to recreate it on virtio-net >> and VF separately again. > > > Yes. This sounds interesting. Looks like
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, Feb 22, 2018 at 12:11 AM, Jiri Pirkowrote: > Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko wrote: >>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote: On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko wrote: > Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko wrote: >>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: > Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >>> Yeah, I can see it now :( I guess that the ship has sailed and we >>> are >>> stuck with this ugly thing forever... >>> >>> Could you at least make some common code that is shared in between >>> netvsc and virtio_net so this is handled in exacly the same way in >>> both? >> >>IMHO netvsc is a vendor specific driver which made a mistake on what >>behaviour it provides (or tried to align itself with Windows SR-IOV). >>Let's not make a far, far more commonly deployed and important driver >>(virtio) bug-compatible with netvsc. > > Yeah. netvsc solution is a dangerous precedent here and in my > opinition > it was a huge mistake to merge it. I personally would vote to unmerge > it > and make the solution based on team/bond. > > >> >>To Jiri's initial comments, I feel the same way, in fact I've talked >>to >>the NetworkManager guys to get auto-bonding based on MACs handled in >>user space. I think it may very well get done in next versions of NM, >>but isn't done yet. Stephen also raised the point that not everybody >>is >>using NM. > > Can be done in NM, networkd or other network management tools. > Even easier to do this in teamd and let them all benefit. > > Actually, I took a stab to implement this in teamd. Took me like an > hour > and half. > > You can just run teamd with config option "kidnap" like this: > # teamd/teamd -c '{"kidnap": true }' > > Whenever teamd sees another netdev to appear with the same mac as his, > or whenever teamd sees another netdev to change mac to his, > it enslaves it. > > Here's the patch (quick and dirty): > > Subject: [patch teamd] teamd: introduce kidnap feature > > Signed-off-by: Jiri Pirko So this doesn't really address the original problem we were trying to solve. You asked earlier why the netdev name mattered and it mostly has to do with configuration. Specifically what our patch is attempting to resolve is the issue of how to allow a cloud provider to upgrade their customer to SR-IOV support and live migration without requiring them to reconfigure their guest. So the general idea with our patch is to take a VM that is running with virtio_net only and allow it to instead spawn a virtio_bypass master using the same netdev name as the original virtio, and then have the virtio_net and VF come up and be enslaved by the bypass interface. Doing it this way we can allow for multi-vendor SR-IOV live migration support using a guest that was originally configured for virtio only. The problem with your solution is we already have teaming and bonding as you said. There is already a write-up from Red Hat on how to do it (https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). That is all well and good as long as you are willing to keep around two VM images, one for virtio, and one for SR-IOV with live migration. >>> >>> You don't need 2 images. You need only one. The one with the team setup. >>> That's it. If another netdev with the same mac appears, teamd will >>> enslave it and run traffic on it. If not, ok, you'll go only through >>> virtio_net. >> >>Isn't that going to cause the routing table to get messed up when we >>rearrange the netdevs? We don't want to have an significant disruption >> in traffic when we are adding/removing the VF. It seems like we would >>need to invalidate any entries that were configured for the virtio_net >>and reestablish them on the new team interface. Part of the criteria >>we have been working with is that we should be able to transition from
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, Feb 22, 2018 at 5:07 AM, Jiri Pirkowrote: > Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz...@gmail.com wrote: >>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko wrote: >>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote: >> The signaling isn't too much of an issue since we can just tweak the link state of the VF or virtio manually to report the link up or down prior to the hot-plug. Now that we are on the same page with the team0 >> >>> Oh, so you just do "ip link set vfrepresentor down" in the host. >>> That makes sense. I'm pretty sure that this is not implemented for all >>> drivers now. >> >>mlx5 supports that, on the representor close ndo we take the VF link >>operational v-link down >> >>We should probably also put into the picture some/more aspects >>from the host side of things. The provisioning of the v-switch now >>have to deal with two channels going into the VM, the PV (virtio) >>one and the PT (VF) one. >> >>This should probably boil down to apply teaming/bonding between >>the VF representor and a PV backend device, e.g TAP. > > Yes, that is correct. That was my thought on it. If you wanted to you could probably even look at making the PV the active one in the pair from the host side if you wanted to avoid the PCIe overhead for things like broadcast/multicast. The only limitation is that you might need to have the bond take care of the appropriate switchdev bits so that you still programmed rules into the hardware even if you are transmitting down the PV side of the device. For legacy setups I still need to work on putting together a source mode macvlan based setup to handle acting like port representors for the VFs and uplink. - Alex
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz...@gmail.com wrote: >On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirkowrote: >> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote: > >>>The signaling isn't too much of an issue since we can just tweak the >>>link state of the VF or virtio manually to report the link up or down >>>prior to the hot-plug. Now that we are on the same page with the team0 > >> Oh, so you just do "ip link set vfrepresentor down" in the host. >> That makes sense. I'm pretty sure that this is not implemented for all >> drivers now. > >mlx5 supports that, on the representor close ndo we take the VF link >operational v-link down > >We should probably also put into the picture some/more aspects >from the host side of things. The provisioning of the v-switch now >have to deal with two channels going into the VM, the PV (virtio) >one and the PT (VF) one. > >This should probably boil down to apply teaming/bonding between >the VF representor and a PV backend device, e.g TAP. Yes, that is correct.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirkowrote: > Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote: >>The signaling isn't too much of an issue since we can just tweak the >>link state of the VF or virtio manually to report the link up or down >>prior to the hot-plug. Now that we are on the same page with the team0 > Oh, so you just do "ip link set vfrepresentor down" in the host. > That makes sense. I'm pretty sure that this is not implemented for all > drivers now. mlx5 supports that, on the representor close ndo we take the VF link operational v-link down We should probably also put into the picture some/more aspects from the host side of things. The provisioning of the v-switch now have to deal with two channels going into the VM, the PV (virtio) one and the PT (VF) one. This should probably boil down to apply teaming/bonding between the VF representor and a PV backend device, e.g TAP.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirkowrote: >> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote: >>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko wrote: Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko wrote: >> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >> Yeah, I can see it now :( I guess that the ship has sailed and we are >> stuck with this ugly thing forever... >> >> Could you at least make some common code that is shared in between >> netvsc and virtio_net so this is handled in exacly the same way in >> both? > >IMHO netvsc is a vendor specific driver which made a mistake on what >behaviour it provides (or tried to align itself with Windows SR-IOV). >Let's not make a far, far more commonly deployed and important driver >(virtio) bug-compatible with netvsc. Yeah. netvsc solution is a dangerous precedent here and in my opinition it was a huge mistake to merge it. I personally would vote to unmerge it and make the solution based on team/bond. > >To Jiri's initial comments, I feel the same way, in fact I've talked to >the NetworkManager guys to get auto-bonding based on MACs handled in >user space. I think it may very well get done in next versions of NM, >but isn't done yet. Stephen also raised the point that not everybody >is >using NM. Can be done in NM, networkd or other network management tools. Even easier to do this in teamd and let them all benefit. Actually, I took a stab to implement this in teamd. Took me like an hour and half. You can just run teamd with config option "kidnap" like this: # teamd/teamd -c '{"kidnap": true }' Whenever teamd sees another netdev to appear with the same mac as his, or whenever teamd sees another netdev to change mac to his, it enslaves it. Here's the patch (quick and dirty): Subject: [patch teamd] teamd: introduce kidnap feature Signed-off-by: Jiri Pirko >>> >>>So this doesn't really address the original problem we were trying to >>>solve. You asked earlier why the netdev name mattered and it mostly >>>has to do with configuration. Specifically what our patch is >>>attempting to resolve is the issue of how to allow a cloud provider to >>>upgrade their customer to SR-IOV support and live migration without >>>requiring them to reconfigure their guest. So the general idea with >>>our patch is to take a VM that is running with virtio_net only and >>>allow it to instead spawn a virtio_bypass master using the same netdev >>>name as the original virtio, and then have the virtio_net and VF come >>>up and be enslaved by the bypass interface. Doing it this way we can >>>allow for multi-vendor SR-IOV live migration support using a guest >>>that was originally configured for virtio only. >>> >>>The problem with your solution is we already have teaming and bonding >>>as you said. There is already a write-up from Red Hat on how to do it >>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >>>That is all well and good as long as you are willing to keep around >>>two VM images, one for virtio, and one for SR-IOV with live migration. >> >> You don't need 2 images. You need only one. The one with the team setup. >> That's it. If another netdev with the same mac appears, teamd will >> enslave it and run traffic on it. If not, ok, you'll go only through >> virtio_net. > >Isn't that going to cause the routing table to get messed up when we >rearrange the netdevs? We don't want to have an significant disruption > in traffic when we are adding/removing the VF. It seems like we would >need to invalidate any entries that were configured for the virtio_net >and reestablish them on the new team interface. Part of the criteria >we have been working with is that we should be able to transition from >having a VF to not or vice versa without seeing any significant >disruption in the traffic. What? You have routes on the team netdev. virtio_net and VF are only slaves. What are
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote: On 2/21/2018 5:59 PM, Siwei Liu wrote: On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. Thank you do doing this. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. Yes. This sounds interesting. Looks like you want an existing VM image with VF only configuration to get transparent live migration support by adding virtio_net with BACKUP feature. We may need another feature bit to switch between these 2 options. After thinking some more, this may be more involved than adding a new feature bit. This requires a netdev created by virtio to take over the name of a VF netdev associated with a PCI device that may not be plugged in when the virtio driver is coming up. This definitely requires some new messages
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 5:59 PM, Siwei Liu wrote: On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck wrote: On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. Thank you do doing this. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. Yes. This sounds interesting. Looks like you want an existing VM image with VF only configuration to get transparent live migration support by adding virtio_net with BACKUP feature. We may need another feature bit to switch between these 2 options. The other bits get into more complexity then we are ready to handle for now. I think I might have talked about something similar that I was referring to as a "virtio-bond" where you would have a PCI/PCIe tree topology that makes this easier to sort out, and the "virtio-bond would be used to handle coordination/configuration of a much more complex interface. That was
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/21/2018 6:02 PM, Jakub Kicinski wrote: On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote: I don't see why the team cannot be there always. It is more the logistical nightmare. Part of the goal here was to work with the cloud base images that are out there such as https://alt.fedoraproject.org/cloud/. With just the kernel changes the overhead for this stays fairly small and would be pulled in as just a standard part of the kernel update process. The virtio bypass only pops up if the backup bit is present. With the team solution it requires that the base image use the team driver on virtio_net when it sees one. I doubt the OSVs would want to do that just because SR-IOV isn't that popular of a case. IIUC we need to monitor for a "backup hint", spawn the master, rename it to maintain backwards compatibility with no-VF setups and enslave the VF if it appears. All those sound possible from user space, the advantage of the kernel solution right now is that it has more complete code. Am I misunderstanding? I think there is some misunderstanding about the exact requirement and the usecase we are trying to solve. If the Guest is allowed to do this configuration, we already have a solution with either bond/team based user space configuration. This is to enable cloud service providers to provide a accelerated datapath by simply letting to tenants to get their own images with the only requirement to enable their kernels with newer virtio_net driver with BACKUP support and the VF driver. To recap from an earlier thread, here is a response from Stephen that talks about the requirement for the netvsc solution and we would like to provide similar solution for KVM based cloud deployments. > The requirement with Azure accelerated network was that a stock distribution image from the > store must be able to run unmodified and get accelerated networking. > Not sure if other environments need to work the same, but it would be nice. > That meant no additional setup scripts (aka no bonding) and also it must > work transparently with hot-plug. Also there are diverse set of environments: > openstack, cloudinit, network manager and systemd. The solution had to not depend > on any one of them, but also not break any of them. Thanks Sridhar
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote: > > I don't see why the team cannot be there always. > > It is more the logistical nightmare. Part of the goal here was to work > with the cloud base images that are out there such as > https://alt.fedoraproject.org/cloud/. With just the kernel changes the > overhead for this stays fairly small and would be pulled in as just a > standard part of the kernel update process. The virtio bypass only > pops up if the backup bit is present. With the team solution it > requires that the base image use the team driver on virtio_net when it > sees one. I doubt the OSVs would want to do that just because SR-IOV > isn't that popular of a case. IIUC we need to monitor for a "backup hint", spawn the master, rename it to maintain backwards compatibility with no-VF setups and enslave the VF if it appears. All those sound possible from user space, the advantage of the kernel solution right now is that it has more complete code. Am I misunderstanding?
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyckwrote: > On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu wrote: >> I haven't checked emails for days and did not realize the new revision >> had already came out. And thank you for the effort, this revision >> really looks to be a step forward towards our use case and is close to >> what we wanted to do. A few questions in line. >> >> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck >> wrote: >>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: > Ppatch 2 is in response to the community request for a 3 netdev > solution. However, it creates some issues we'll get into in a moment. > It extends virtio_net to use alternate datapath when available and > registered. When BACKUP feature is enabled, virtio_net driver creates > an additional 'bypass' netdev that acts as a master device and controls > 2 slave devices. The original virtio_net netdev is registered as > 'backup' netdev and a passthru/vf device with the same MAC gets > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are > associated with the same 'pci' device. The user accesses the network > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev > as default for transmits when it is available with link up and running. Thank you do doing this. > We noticed a couple of issues with this approach during testing. > - As both 'bypass' and 'backup' netdevs are associated with the same > virtio pci device, udev tries to rename both of them with the same name > and the 2nd rename will fail. This would be OK as long as the first > netdev > to be renamed is the 'bypass' netdev, but the order in which udev gets > to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? >>> >>> The basic idea of all this is that we wanted this to work with an >>> existing VM image that was using virtio. As such we were trying to >>> make it so that the bypass interface takes the place of the original >>> virtio and get udev to rename the bypass to what the original >>> virtio_net was. >> >> Could it made it also possible to take over the config from VF instead >> of virtio on an existing VM image? And get udev rename the bypass >> netdev to what the original VF was. I don't say tightly binding the >> bypass master to only virtio or VF, but I think we should provide both >> options to support different upgrade paths. Possibly we could tweak >> the device tree layout to reuse the same PCI slot for the master >> bypass netdev, such that udev would not get confused when renaming the >> device. The VF needs to use a different function slot afterwards. >> Perhaps we might need to a special multiseat like QEMU device for that >> purpose? >> >> Our case we'll upgrade the config from VF to virtio-bypass directly. > > So if I am understanding what you are saying you are wanting to flip > the backup interface from the virtio to a VF. The problem is that > becomes a bit of a vendor lock-in solution since it would rely on a > specific VF driver. I would agree with Jiri that we don't want to go > down that path. We don't want every VF out there firing up its own > separate bond. Ideally you want the hypervisor to be able to manage > all of this which is why it makes sense to have virtio manage this and > why this is associated with the virtio_net interface. No, that's not what I was talking about of course. I thought you mentioned the upgrade scenario this patch would like to address is to use the bypass interface "to take the place of the original virtio, and get udev to rename the bypass to what the original virtio_net was". That is one of the possible upgrade paths for sure. However the upgrade path I was seeking is to use the bypass interface to take the place of original VF interface while retaining the name and network configs, which generally can be done simply with kernel upgrade. It would become limiting as this patch makes the bypass interface share the same virtio pci device with virito backup. Can this bypass interface be made general to take place of any pci device other than virtio-net? This will be more helpful as the cloud users who has existing setup on VF interface don't have to recreate it on virtio-net and VF separately again. > > The other bits get into more complexity then we are ready to handle > for now. I think I might have talked about something similar that I > was referring to as a "virtio-bond" where you would have a PCI/PCIe > tree topology that makes this easier to sort out, and the "virtio-bond > would be used to handle coordination/configuration of a much more > complex interface. That was one way to solve this problem but I'd like to see
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liuwrote: > I haven't checked emails for days and did not realize the new revision > had already came out. And thank you for the effort, this revision > really looks to be a step forward towards our use case and is close to > what we wanted to do. A few questions in line. > > On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck > wrote: >> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. >>> >>> Thank you do doing this. >>> We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. >>> >>> Out of curiosity - why do you link the master netdev to the virtio >>> struct device? >> >> The basic idea of all this is that we wanted this to work with an >> existing VM image that was using virtio. As such we were trying to >> make it so that the bypass interface takes the place of the original >> virtio and get udev to rename the bypass to what the original >> virtio_net was. > > Could it made it also possible to take over the config from VF instead > of virtio on an existing VM image? And get udev rename the bypass > netdev to what the original VF was. I don't say tightly binding the > bypass master to only virtio or VF, but I think we should provide both > options to support different upgrade paths. Possibly we could tweak > the device tree layout to reuse the same PCI slot for the master > bypass netdev, such that udev would not get confused when renaming the > device. The VF needs to use a different function slot afterwards. > Perhaps we might need to a special multiseat like QEMU device for that > purpose? > > Our case we'll upgrade the config from VF to virtio-bypass directly. So if I am understanding what you are saying you are wanting to flip the backup interface from the virtio to a VF. The problem is that becomes a bit of a vendor lock-in solution since it would rely on a specific VF driver. I would agree with Jiri that we don't want to go down that path. We don't want every VF out there firing up its own separate bond. Ideally you want the hypervisor to be able to manage all of this which is why it makes sense to have virtio manage this and why this is associated with the virtio_net interface. The other bits get into more complexity then we are ready to handle for now. I think I might have talked about something similar that I was referring to as a "virtio-bond" where you would have a PCI/PCIe tree topology that makes this easier to sort out, and the "virtio-bond would be used to handle coordination/configuration of a much more complex interface. >> >>> FWIW two solutions that immediately come to mind is to export "backup" >>> as phys_port_name of the backup virtio link and/or assign a name to the >>> master like you are doing already. I think team uses team%d and bond >>> uses bond%d, soft naming of master devices seems quite natural in this >>> case. >> >> I figured I had overlooked something like that.. Thanks for pointing >> this out. Okay so I think the phys_port_name approach might resolve >> the original issue. If I am reading things correctly what we end up >> with is the master showing up as "ens1" for example and the backup >> showing up as "ens1nbackup". Am I understanding that right? >> >> The problem with the team/bond%d approach is that it creates a new >> netdevice and so it would require guest configuration changes. >> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >>> link is quite neat. >> >> I agree. For non-"backup" virio_net devices would it be okay for us to >> just return -EOPNOTSUPP? I assume it would be and that way the legacy >> behavior could be maintained although the function still exists. >> - When
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
I haven't checked emails for days and did not realize the new revision had already came out. And thank you for the effort, this revision really looks to be a step forward towards our use case and is close to what we wanted to do. A few questions in line. On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyckwrote: > On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski wrote: >> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >>> Ppatch 2 is in response to the community request for a 3 netdev >>> solution. However, it creates some issues we'll get into in a moment. >>> It extends virtio_net to use alternate datapath when available and >>> registered. When BACKUP feature is enabled, virtio_net driver creates >>> an additional 'bypass' netdev that acts as a master device and controls >>> 2 slave devices. The original virtio_net netdev is registered as >>> 'backup' netdev and a passthru/vf device with the same MAC gets >>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>> associated with the same 'pci' device. The user accesses the network >>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>> as default for transmits when it is available with link up and running. >> >> Thank you do doing this. >> >>> We noticed a couple of issues with this approach during testing. >>> - As both 'bypass' and 'backup' netdevs are associated with the same >>> virtio pci device, udev tries to rename both of them with the same name >>> and the 2nd rename will fail. This would be OK as long as the first netdev >>> to be renamed is the 'bypass' netdev, but the order in which udev gets >>> to rename the 2 netdevs is not reliable. >> >> Out of curiosity - why do you link the master netdev to the virtio >> struct device? > > The basic idea of all this is that we wanted this to work with an > existing VM image that was using virtio. As such we were trying to > make it so that the bypass interface takes the place of the original > virtio and get udev to rename the bypass to what the original > virtio_net was. Could it made it also possible to take over the config from VF instead of virtio on an existing VM image? And get udev rename the bypass netdev to what the original VF was. I don't say tightly binding the bypass master to only virtio or VF, but I think we should provide both options to support different upgrade paths. Possibly we could tweak the device tree layout to reuse the same PCI slot for the master bypass netdev, such that udev would not get confused when renaming the device. The VF needs to use a different function slot afterwards. Perhaps we might need to a special multiseat like QEMU device for that purpose? Our case we'll upgrade the config from VF to virtio-bypass directly. > >> FWIW two solutions that immediately come to mind is to export "backup" >> as phys_port_name of the backup virtio link and/or assign a name to the >> master like you are doing already. I think team uses team%d and bond >> uses bond%d, soft naming of master devices seems quite natural in this >> case. > > I figured I had overlooked something like that.. Thanks for pointing > this out. Okay so I think the phys_port_name approach might resolve > the original issue. If I am reading things correctly what we end up > with is the master showing up as "ens1" for example and the backup > showing up as "ens1nbackup". Am I understanding that right? > > The problem with the team/bond%d approach is that it creates a new > netdevice and so it would require guest configuration changes. > >> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio >> link is quite neat. > > I agree. For non-"backup" virio_net devices would it be okay for us to > just return -EOPNOTSUPP? I assume it would be and that way the legacy > behavior could be maintained although the function still exists. > >>> - When the 'active' netdev is unplugged OR not present on a destination >>> system after live migration, the user will see 2 virtio_net netdevs. >> >> That's necessary and expected, all configuration applies to the master >> so master must exist. > > With the naming issue resolved this is the only item left outstanding. > This becomes a matter of form vs function. > > The main complaint about the "3 netdev" solution is a bit confusing to > have the 2 netdevs present if the VF isn't there. The idea is that > having the extra "master" netdev there if there isn't really a bond is > a bit ugly. Is it this uglier in terms of user experience rather than functionality? I don't want it dynamically changed between 2-netdev and 3-netdev depending on the presence of VF. That gets back to my original question and suggestion earlier: why not just hide the lower netdevs from udev renaming and such? Which important observability benefits users may get if exposing the lower netdevs? Thanks, -Siwei > > The downside of the "2 netdev" solution is that you have to deal
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirkowrote: > Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko wrote: >>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko wrote: > Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: >>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: > Yeah, I can see it now :( I guess that the ship has sailed and we are > stuck with this ugly thing forever... > > Could you at least make some common code that is shared in between > netvsc and virtio_net so this is handled in exacly the same way in > both? IMHO netvsc is a vendor specific driver which made a mistake on what behaviour it provides (or tried to align itself with Windows SR-IOV). Let's not make a far, far more commonly deployed and important driver (virtio) bug-compatible with netvsc. >>> >>> Yeah. netvsc solution is a dangerous precedent here and in my opinition >>> it was a huge mistake to merge it. I personally would vote to unmerge it >>> and make the solution based on team/bond. >>> >>> To Jiri's initial comments, I feel the same way, in fact I've talked to the NetworkManager guys to get auto-bonding based on MACs handled in user space. I think it may very well get done in next versions of NM, but isn't done yet. Stephen also raised the point that not everybody is using NM. >>> >>> Can be done in NM, networkd or other network management tools. >>> Even easier to do this in teamd and let them all benefit. >>> >>> Actually, I took a stab to implement this in teamd. Took me like an hour >>> and half. >>> >>> You can just run teamd with config option "kidnap" like this: >>> # teamd/teamd -c '{"kidnap": true }' >>> >>> Whenever teamd sees another netdev to appear with the same mac as his, >>> or whenever teamd sees another netdev to change mac to his, >>> it enslaves it. >>> >>> Here's the patch (quick and dirty): >>> >>> Subject: [patch teamd] teamd: introduce kidnap feature >>> >>> Signed-off-by: Jiri Pirko >> >>So this doesn't really address the original problem we were trying to >>solve. You asked earlier why the netdev name mattered and it mostly >>has to do with configuration. Specifically what our patch is >>attempting to resolve is the issue of how to allow a cloud provider to >>upgrade their customer to SR-IOV support and live migration without >>requiring them to reconfigure their guest. So the general idea with >>our patch is to take a VM that is running with virtio_net only and >>allow it to instead spawn a virtio_bypass master using the same netdev >>name as the original virtio, and then have the virtio_net and VF come >>up and be enslaved by the bypass interface. Doing it this way we can >>allow for multi-vendor SR-IOV live migration support using a guest >>that was originally configured for virtio only. >> >>The problem with your solution is we already have teaming and bonding >>as you said. There is already a write-up from Red Hat on how to do it >>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >>That is all well and good as long as you are willing to keep around >>two VM images, one for virtio, and one for SR-IOV with live migration. > > You don't need 2 images. You need only one. The one with the team setup. > That's it. If another netdev with the same mac appears, teamd will > enslave it and run traffic on it. If not, ok, you'll go only through > virtio_net. Isn't that going to cause the routing table to get messed up when we rearrange the netdevs? We don't want to have an significant disruption in traffic when we are adding/removing the VF. It seems like we would need to invalidate any entries that were configured for the virtio_net and reestablish them on the new team interface. Part of the criteria we have been working with is that we should be able to transition from having a VF to not or vice versa without seeing any significant disruption in the traffic. >>> >>> What? You have routes on the team netdev. virtio_net and VF are only >>> slaves. What are you talking about? I don't get it :/ >> >>So lets walk though this by example. The general idea of the base case >>for all this is somebody starting with virtio_net, we will call the
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirkowrote: >> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: >>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko wrote: Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: >> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: Yeah, I can see it now :( I guess that the ship has sailed and we are stuck with this ugly thing forever... Could you at least make some common code that is shared in between netvsc and virtio_net so this is handled in exacly the same way in both? >>> >>>IMHO netvsc is a vendor specific driver which made a mistake on what >>>behaviour it provides (or tried to align itself with Windows SR-IOV). >>>Let's not make a far, far more commonly deployed and important driver >>>(virtio) bug-compatible with netvsc. >> >> Yeah. netvsc solution is a dangerous precedent here and in my opinition >> it was a huge mistake to merge it. I personally would vote to unmerge it >> and make the solution based on team/bond. >> >> >>> >>>To Jiri's initial comments, I feel the same way, in fact I've talked to >>>the NetworkManager guys to get auto-bonding based on MACs handled in >>>user space. I think it may very well get done in next versions of NM, >>>but isn't done yet. Stephen also raised the point that not everybody is >>>using NM. >> >> Can be done in NM, networkd or other network management tools. >> Even easier to do this in teamd and let them all benefit. >> >> Actually, I took a stab to implement this in teamd. Took me like an hour >> and half. >> >> You can just run teamd with config option "kidnap" like this: >> # teamd/teamd -c '{"kidnap": true }' >> >> Whenever teamd sees another netdev to appear with the same mac as his, >> or whenever teamd sees another netdev to change mac to his, >> it enslaves it. >> >> Here's the patch (quick and dirty): >> >> Subject: [patch teamd] teamd: introduce kidnap feature >> >> Signed-off-by: Jiri Pirko > >So this doesn't really address the original problem we were trying to >solve. You asked earlier why the netdev name mattered and it mostly >has to do with configuration. Specifically what our patch is >attempting to resolve is the issue of how to allow a cloud provider to >upgrade their customer to SR-IOV support and live migration without >requiring them to reconfigure their guest. So the general idea with >our patch is to take a VM that is running with virtio_net only and >allow it to instead spawn a virtio_bypass master using the same netdev >name as the original virtio, and then have the virtio_net and VF come >up and be enslaved by the bypass interface. Doing it this way we can >allow for multi-vendor SR-IOV live migration support using a guest >that was originally configured for virtio only. > >The problem with your solution is we already have teaming and bonding >as you said. There is already a write-up from Red Hat on how to do it >(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >That is all well and good as long as you are willing to keep around >two VM images, one for virtio, and one for SR-IOV with live migration. You don't need 2 images. You need only one. The one with the team setup. That's it. If another netdev with the same mac appears, teamd will enslave it and run traffic on it. If not, ok, you'll go only through virtio_net. >>> >>>Isn't that going to cause the routing table to get messed up when we >>>rearrange the netdevs? We don't want to have an significant disruption >>> in traffic when we are adding/removing the VF. It seems like we would >>>need to invalidate any entries that were configured for the virtio_net >>>and reestablish them on the new team interface. Part of the criteria >>>we have been working with is that we should be able to transition from >>>having a VF to not or vice versa without seeing any significant >>>disruption in the traffic. >> >> What? You have routes on the team netdev. virtio_net and VF are only >> slaves. What are you talking about? I don't get it :/ > >So lets walk though this by example. The general idea of the base case >for all this is somebody starting with virtio_net, we will call the >interface "ens1" for now. It comes up and is assigned a dhcp address >and everything works as expected. Now in order to get better >performance we want to add a VF
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirkowrote: > Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko wrote: >>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: > Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >>> Yeah, I can see it now :( I guess that the ship has sailed and we are >>> stuck with this ugly thing forever... >>> >>> Could you at least make some common code that is shared in between >>> netvsc and virtio_net so this is handled in exacly the same way in both? >> >>IMHO netvsc is a vendor specific driver which made a mistake on what >>behaviour it provides (or tried to align itself with Windows SR-IOV). >>Let's not make a far, far more commonly deployed and important driver >>(virtio) bug-compatible with netvsc. > > Yeah. netvsc solution is a dangerous precedent here and in my opinition > it was a huge mistake to merge it. I personally would vote to unmerge it > and make the solution based on team/bond. > > >> >>To Jiri's initial comments, I feel the same way, in fact I've talked to >>the NetworkManager guys to get auto-bonding based on MACs handled in >>user space. I think it may very well get done in next versions of NM, >>but isn't done yet. Stephen also raised the point that not everybody is >>using NM. > > Can be done in NM, networkd or other network management tools. > Even easier to do this in teamd and let them all benefit. > > Actually, I took a stab to implement this in teamd. Took me like an hour > and half. > > You can just run teamd with config option "kidnap" like this: > # teamd/teamd -c '{"kidnap": true }' > > Whenever teamd sees another netdev to appear with the same mac as his, > or whenever teamd sees another netdev to change mac to his, > it enslaves it. > > Here's the patch (quick and dirty): > > Subject: [patch teamd] teamd: introduce kidnap feature > > Signed-off-by: Jiri Pirko So this doesn't really address the original problem we were trying to solve. You asked earlier why the netdev name mattered and it mostly has to do with configuration. Specifically what our patch is attempting to resolve is the issue of how to allow a cloud provider to upgrade their customer to SR-IOV support and live migration without requiring them to reconfigure their guest. So the general idea with our patch is to take a VM that is running with virtio_net only and allow it to instead spawn a virtio_bypass master using the same netdev name as the original virtio, and then have the virtio_net and VF come up and be enslaved by the bypass interface. Doing it this way we can allow for multi-vendor SR-IOV live migration support using a guest that was originally configured for virtio only. The problem with your solution is we already have teaming and bonding as you said. There is already a write-up from Red Hat on how to do it (https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). That is all well and good as long as you are willing to keep around two VM images, one for virtio, and one for SR-IOV with live migration. >>> >>> You don't need 2 images. You need only one. The one with the team setup. >>> That's it. If another netdev with the same mac appears, teamd will >>> enslave it and run traffic on it. If not, ok, you'll go only through >>> virtio_net. >> >>Isn't that going to cause the routing table to get messed up when we >>rearrange the netdevs? We don't want to have an significant disruption >> in traffic when we are adding/removing the VF. It seems like we would >>need to invalidate any entries that were configured for the virtio_net >>and reestablish them on the new team interface. Part of the criteria >>we have been working with is that we should be able to transition from >>having a VF to not or vice versa without seeing any significant >>disruption in the traffic. > > What? You have routes on the team netdev. virtio_net and VF are only > slaves. What are you talking about? I don't get it :/ So lets walk though this by example. The general idea of the base case for all this is somebody starting with virtio_net, we will call the interface "ens1" for now. It comes up and is assigned a dhcp address and everything works as expected. Now in order to get better performance we want to add a VF "ens2", but we don't want a new IP address. Now if I understand correctly what will happen is that when "ens2" appears on the system teamd will then create a new team
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirkowrote: >> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >> Yeah, I can see it now :( I guess that the ship has sailed and we are >> stuck with this ugly thing forever... >> >> Could you at least make some common code that is shared in between >> netvsc and virtio_net so this is handled in exacly the same way in both? > >IMHO netvsc is a vendor specific driver which made a mistake on what >behaviour it provides (or tried to align itself with Windows SR-IOV). >Let's not make a far, far more commonly deployed and important driver >(virtio) bug-compatible with netvsc. Yeah. netvsc solution is a dangerous precedent here and in my opinition it was a huge mistake to merge it. I personally would vote to unmerge it and make the solution based on team/bond. > >To Jiri's initial comments, I feel the same way, in fact I've talked to >the NetworkManager guys to get auto-bonding based on MACs handled in >user space. I think it may very well get done in next versions of NM, >but isn't done yet. Stephen also raised the point that not everybody is >using NM. Can be done in NM, networkd or other network management tools. Even easier to do this in teamd and let them all benefit. Actually, I took a stab to implement this in teamd. Took me like an hour and half. You can just run teamd with config option "kidnap" like this: # teamd/teamd -c '{"kidnap": true }' Whenever teamd sees another netdev to appear with the same mac as his, or whenever teamd sees another netdev to change mac to his, it enslaves it. Here's the patch (quick and dirty): Subject: [patch teamd] teamd: introduce kidnap feature Signed-off-by: Jiri Pirko >>> >>>So this doesn't really address the original problem we were trying to >>>solve. You asked earlier why the netdev name mattered and it mostly >>>has to do with configuration. Specifically what our patch is >>>attempting to resolve is the issue of how to allow a cloud provider to >>>upgrade their customer to SR-IOV support and live migration without >>>requiring them to reconfigure their guest. So the general idea with >>>our patch is to take a VM that is running with virtio_net only and >>>allow it to instead spawn a virtio_bypass master using the same netdev >>>name as the original virtio, and then have the virtio_net and VF come >>>up and be enslaved by the bypass interface. Doing it this way we can >>>allow for multi-vendor SR-IOV live migration support using a guest >>>that was originally configured for virtio only. >>> >>>The problem with your solution is we already have teaming and bonding >>>as you said. There is already a write-up from Red Hat on how to do it >>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >>>That is all well and good as long as you are willing to keep around >>>two VM images, one for virtio, and one for SR-IOV with live migration. >> >> You don't need 2 images. You need only one. The one with the team setup. >> That's it. If another netdev with the same mac appears, teamd will >> enslave it and run traffic on it. If not, ok, you'll go only through >> virtio_net. > >Isn't that going to cause the routing table to get messed up when we >rearrange the netdevs? We don't want to have an significant disruption > in traffic when we are adding/removing the VF. It seems like we would >need to invalidate any entries that were configured for the virtio_net >and reestablish them on the new team interface. Part of the criteria >we have been working with is that we should be able to transition from >having a VF to not or vice versa without seeing any significant >disruption in the traffic. What? You have routes on the team netdev. virtio_net and VF are only slaves. What are you talking about? I don't get it :/ > >Also how does this handle any static configuration? I am assuming that >everything here assumes the team will be brought up as soon as it is >seen and assigned a DHCP address. Again. You configure whatever you need on the team netdev. > >The solution as you have proposed seems problematic at best. I don't >see how the team solution works without introducing some sort of >traffic disruption to either add/remove the VF and bring up/tear down >the team interface. At that point we might as well just give up on >this piece of live migration support entirely since the disruption was >what we were trying to avoid. We
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirkowrote: > Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko wrote: >>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: > Yeah, I can see it now :( I guess that the ship has sailed and we are > stuck with this ugly thing forever... > > Could you at least make some common code that is shared in between > netvsc and virtio_net so this is handled in exacly the same way in both? IMHO netvsc is a vendor specific driver which made a mistake on what behaviour it provides (or tried to align itself with Windows SR-IOV). Let's not make a far, far more commonly deployed and important driver (virtio) bug-compatible with netvsc. >>> >>> Yeah. netvsc solution is a dangerous precedent here and in my opinition >>> it was a huge mistake to merge it. I personally would vote to unmerge it >>> and make the solution based on team/bond. >>> >>> To Jiri's initial comments, I feel the same way, in fact I've talked to the NetworkManager guys to get auto-bonding based on MACs handled in user space. I think it may very well get done in next versions of NM, but isn't done yet. Stephen also raised the point that not everybody is using NM. >>> >>> Can be done in NM, networkd or other network management tools. >>> Even easier to do this in teamd and let them all benefit. >>> >>> Actually, I took a stab to implement this in teamd. Took me like an hour >>> and half. >>> >>> You can just run teamd with config option "kidnap" like this: >>> # teamd/teamd -c '{"kidnap": true }' >>> >>> Whenever teamd sees another netdev to appear with the same mac as his, >>> or whenever teamd sees another netdev to change mac to his, >>> it enslaves it. >>> >>> Here's the patch (quick and dirty): >>> >>> Subject: [patch teamd] teamd: introduce kidnap feature >>> >>> Signed-off-by: Jiri Pirko >> >>So this doesn't really address the original problem we were trying to >>solve. You asked earlier why the netdev name mattered and it mostly >>has to do with configuration. Specifically what our patch is >>attempting to resolve is the issue of how to allow a cloud provider to >>upgrade their customer to SR-IOV support and live migration without >>requiring them to reconfigure their guest. So the general idea with >>our patch is to take a VM that is running with virtio_net only and >>allow it to instead spawn a virtio_bypass master using the same netdev >>name as the original virtio, and then have the virtio_net and VF come >>up and be enslaved by the bypass interface. Doing it this way we can >>allow for multi-vendor SR-IOV live migration support using a guest >>that was originally configured for virtio only. >> >>The problem with your solution is we already have teaming and bonding >>as you said. There is already a write-up from Red Hat on how to do it >>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >>That is all well and good as long as you are willing to keep around >>two VM images, one for virtio, and one for SR-IOV with live migration. > > You don't need 2 images. You need only one. The one with the team setup. > That's it. If another netdev with the same mac appears, teamd will > enslave it and run traffic on it. If not, ok, you'll go only through > virtio_net. Isn't that going to cause the routing table to get messed up when we rearrange the netdevs? We don't want to have an significant disruption in traffic when we are adding/removing the VF. It seems like we would need to invalidate any entries that were configured for the virtio_net and reestablish them on the new team interface. Part of the criteria we have been working with is that we should be able to transition from having a VF to not or vice versa without seeing any significant disruption in the traffic. Also how does this handle any static configuration? I am assuming that everything here assumes the team will be brought up as soon as it is seen and assigned a DHCP address. The solution as you have proposed seems problematic at best. I don't see how the team solution works without introducing some sort of traffic disruption to either add/remove the VF and bring up/tear down the team interface. At that point we might as well just give up on this piece of live migration support entirely since the disruption was what we were trying to avoid. We might as well just hotplug out the VF and hotplug in a virtio at the same bus device and function number and just let udev take care of renaming it for us. The idea was supposed to be a seamless transition between the two interfaces.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote: >On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirkowrote: >> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: Yeah, I can see it now :( I guess that the ship has sailed and we are stuck with this ugly thing forever... Could you at least make some common code that is shared in between netvsc and virtio_net so this is handled in exacly the same way in both? >>> >>>IMHO netvsc is a vendor specific driver which made a mistake on what >>>behaviour it provides (or tried to align itself with Windows SR-IOV). >>>Let's not make a far, far more commonly deployed and important driver >>>(virtio) bug-compatible with netvsc. >> >> Yeah. netvsc solution is a dangerous precedent here and in my opinition >> it was a huge mistake to merge it. I personally would vote to unmerge it >> and make the solution based on team/bond. >> >> >>> >>>To Jiri's initial comments, I feel the same way, in fact I've talked to >>>the NetworkManager guys to get auto-bonding based on MACs handled in >>>user space. I think it may very well get done in next versions of NM, >>>but isn't done yet. Stephen also raised the point that not everybody is >>>using NM. >> >> Can be done in NM, networkd or other network management tools. >> Even easier to do this in teamd and let them all benefit. >> >> Actually, I took a stab to implement this in teamd. Took me like an hour >> and half. >> >> You can just run teamd with config option "kidnap" like this: >> # teamd/teamd -c '{"kidnap": true }' >> >> Whenever teamd sees another netdev to appear with the same mac as his, >> or whenever teamd sees another netdev to change mac to his, >> it enslaves it. >> >> Here's the patch (quick and dirty): >> >> Subject: [patch teamd] teamd: introduce kidnap feature >> >> Signed-off-by: Jiri Pirko > >So this doesn't really address the original problem we were trying to >solve. You asked earlier why the netdev name mattered and it mostly >has to do with configuration. Specifically what our patch is >attempting to resolve is the issue of how to allow a cloud provider to >upgrade their customer to SR-IOV support and live migration without >requiring them to reconfigure their guest. So the general idea with >our patch is to take a VM that is running with virtio_net only and >allow it to instead spawn a virtio_bypass master using the same netdev >name as the original virtio, and then have the virtio_net and VF come >up and be enslaved by the bypass interface. Doing it this way we can >allow for multi-vendor SR-IOV live migration support using a guest >that was originally configured for virtio only. > >The problem with your solution is we already have teaming and bonding >as you said. There is already a write-up from Red Hat on how to do it >(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). >That is all well and good as long as you are willing to keep around >two VM images, one for virtio, and one for SR-IOV with live migration. You don't need 2 images. You need only one. The one with the team setup. That's it. If another netdev with the same mac appears, teamd will enslave it and run traffic on it. If not, ok, you'll go only through virtio_net. >The problem is nobody wants to do that. What they want is to maintain >one guest image and if they decide to upgrade to SR-IOV they still >want their live migration and they don't want to have to reconfigure >the guest. > >That said it does seem to make the existing Red Hat solution easier to >manage since you wouldn't be guessing at ifname so I have provided >some feedback below. > >> --- >> include/team.h | 7 +++ >> libteam/ifinfo.c | 20 >> teamd/teamd.c | 17 + >> teamd/teamd.h | 5 + >> teamd/teamd_events.c | 17 + >> teamd/teamd_ifinfo_watch.c | 9 + >> teamd/teamd_per_port.c | 7 ++- >> 7 files changed, 81 insertions(+), 1 deletion(-) >> >> diff --git a/include/team.h b/include/team.h >> index 9ae517d..b0c19c8 100644 >> --- a/include/team.h >> +++ b/include/team.h >> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct >> team_handle *th, >> #define team_for_each_ifinfo(ifinfo, th) \ >> for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo; \ >> ifinfo = team_get_next_ifinfo(th, ifinfo)) >> + >> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, >> + struct team_ifinfo >> *ifinfo); >> +#define team_for_each_unlinked_ifinfo(ifinfo, th) \ >> + for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo; \ >> +ifinfo =
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirkowrote: > Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >>> Yeah, I can see it now :( I guess that the ship has sailed and we are >>> stuck with this ugly thing forever... >>> >>> Could you at least make some common code that is shared in between >>> netvsc and virtio_net so this is handled in exacly the same way in both? >> >>IMHO netvsc is a vendor specific driver which made a mistake on what >>behaviour it provides (or tried to align itself with Windows SR-IOV). >>Let's not make a far, far more commonly deployed and important driver >>(virtio) bug-compatible with netvsc. > > Yeah. netvsc solution is a dangerous precedent here and in my opinition > it was a huge mistake to merge it. I personally would vote to unmerge it > and make the solution based on team/bond. > > >> >>To Jiri's initial comments, I feel the same way, in fact I've talked to >>the NetworkManager guys to get auto-bonding based on MACs handled in >>user space. I think it may very well get done in next versions of NM, >>but isn't done yet. Stephen also raised the point that not everybody is >>using NM. > > Can be done in NM, networkd or other network management tools. > Even easier to do this in teamd and let them all benefit. > > Actually, I took a stab to implement this in teamd. Took me like an hour > and half. > > You can just run teamd with config option "kidnap" like this: > # teamd/teamd -c '{"kidnap": true }' > > Whenever teamd sees another netdev to appear with the same mac as his, > or whenever teamd sees another netdev to change mac to his, > it enslaves it. > > Here's the patch (quick and dirty): > > Subject: [patch teamd] teamd: introduce kidnap feature > > Signed-off-by: Jiri Pirko So this doesn't really address the original problem we were trying to solve. You asked earlier why the netdev name mattered and it mostly has to do with configuration. Specifically what our patch is attempting to resolve is the issue of how to allow a cloud provider to upgrade their customer to SR-IOV support and live migration without requiring them to reconfigure their guest. So the general idea with our patch is to take a VM that is running with virtio_net only and allow it to instead spawn a virtio_bypass master using the same netdev name as the original virtio, and then have the virtio_net and VF come up and be enslaved by the bypass interface. Doing it this way we can allow for multi-vendor SR-IOV live migration support using a guest that was originally configured for virtio only. The problem with your solution is we already have teaming and bonding as you said. There is already a write-up from Red Hat on how to do it (https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts). That is all well and good as long as you are willing to keep around two VM images, one for virtio, and one for SR-IOV with live migration. The problem is nobody wants to do that. What they want is to maintain one guest image and if they decide to upgrade to SR-IOV they still want their live migration and they don't want to have to reconfigure the guest. That said it does seem to make the existing Red Hat solution easier to manage since you wouldn't be guessing at ifname so I have provided some feedback below. > --- > include/team.h | 7 +++ > libteam/ifinfo.c | 20 > teamd/teamd.c | 17 + > teamd/teamd.h | 5 + > teamd/teamd_events.c | 17 + > teamd/teamd_ifinfo_watch.c | 9 + > teamd/teamd_per_port.c | 7 ++- > 7 files changed, 81 insertions(+), 1 deletion(-) > > diff --git a/include/team.h b/include/team.h > index 9ae517d..b0c19c8 100644 > --- a/include/team.h > +++ b/include/team.h > @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct > team_handle *th, > #define team_for_each_ifinfo(ifinfo, th) \ > for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo; \ > ifinfo = team_get_next_ifinfo(th, ifinfo)) > + > +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, > + struct team_ifinfo *ifinfo); > +#define team_for_each_unlinked_ifinfo(ifinfo, th) \ > + for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo; \ > +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo)) > + > /* ifinfo getters */ > bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo); > uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo); > diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c > index 5c32a9c..8f9548e 100644 > --- a/libteam/ifinfo.c > +++ b/libteam/ifinfo.c > @@ -494,6 +494,26 @@ struct team_ifinfo
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote: >On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: >> Yeah, I can see it now :( I guess that the ship has sailed and we are >> stuck with this ugly thing forever... >> >> Could you at least make some common code that is shared in between >> netvsc and virtio_net so this is handled in exacly the same way in both? > >IMHO netvsc is a vendor specific driver which made a mistake on what >behaviour it provides (or tried to align itself with Windows SR-IOV). >Let's not make a far, far more commonly deployed and important driver >(virtio) bug-compatible with netvsc. Yeah. netvsc solution is a dangerous precedent here and in my opinition it was a huge mistake to merge it. I personally would vote to unmerge it and make the solution based on team/bond. > >To Jiri's initial comments, I feel the same way, in fact I've talked to >the NetworkManager guys to get auto-bonding based on MACs handled in >user space. I think it may very well get done in next versions of NM, >but isn't done yet. Stephen also raised the point that not everybody is >using NM. Can be done in NM, networkd or other network management tools. Even easier to do this in teamd and let them all benefit. Actually, I took a stab to implement this in teamd. Took me like an hour and half. You can just run teamd with config option "kidnap" like this: # teamd/teamd -c '{"kidnap": true }' Whenever teamd sees another netdev to appear with the same mac as his, or whenever teamd sees another netdev to change mac to his, it enslaves it. Here's the patch (quick and dirty): Subject: [patch teamd] teamd: introduce kidnap feature Signed-off-by: Jiri Pirko--- include/team.h | 7 +++ libteam/ifinfo.c | 20 teamd/teamd.c | 17 + teamd/teamd.h | 5 + teamd/teamd_events.c | 17 + teamd/teamd_ifinfo_watch.c | 9 + teamd/teamd_per_port.c | 7 ++- 7 files changed, 81 insertions(+), 1 deletion(-) diff --git a/include/team.h b/include/team.h index 9ae517d..b0c19c8 100644 --- a/include/team.h +++ b/include/team.h @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th, #define team_for_each_ifinfo(ifinfo, th) \ for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo; \ ifinfo = team_get_next_ifinfo(th, ifinfo)) + +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, + struct team_ifinfo *ifinfo); +#define team_for_each_unlinked_ifinfo(ifinfo, th) \ + for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo; \ +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo)) + /* ifinfo getters */ bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo); uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo); diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c index 5c32a9c..8f9548e 100644 --- a/libteam/ifinfo.c +++ b/libteam/ifinfo.c @@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct team_handle *th, return NULL; } +/** + * @param th libteam library context + * @param ifinfo ifinfo structure + * + * @details Get next unlinked ifinfo in list. + * + * @return Ifinfo next to ifinfo passed. + **/ +TEAM_EXPORT +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th, + struct team_ifinfo *ifinfo) +{ + do { + ifinfo = list_get_next_node_entry(>ifinfo_list, ifinfo, list); + if (ifinfo && !ifinfo->linked) + return ifinfo; + } while (ifinfo); + return NULL; +} + /** * @param ifinfo ifinfo structure * diff --git a/teamd/teamd.c b/teamd/teamd.c index aac2511..069c7f0 100644 --- a/teamd/teamd.c +++ b/teamd/teamd.c @@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct teamd_context *ctx, return 0; } +static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx, +struct team_ifinfo *ifinfo, +void *priv) +{ + int err; + bool kidnap; + + err = teamd_config_bool_get(ctx, , "$.kidnap"); + if (err || !kidnap || + ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) || + memcmp(team_get_ifinfo_hwaddr(ifinfo), + ctx->hwaddr, ctx->hwaddr_len)) + return 0; + return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo)); +} + static const struct teamd_event_watch_ops teamd_port_watch_ops = { .port_added = teamd_event_watch_port_added, + .unlinked_hwaddr_changed = teamd_event_watch_unlinked_hwaddr_changed, }; static int teamd_port_watch_init(struct teamd_context *ctx)
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote: > Yeah, I can see it now :( I guess that the ship has sailed and we are > stuck with this ugly thing forever... > > Could you at least make some common code that is shared in between > netvsc and virtio_net so this is handled in exacly the same way in both? IMHO netvsc is a vendor specific driver which made a mistake on what behaviour it provides (or tried to align itself with Windows SR-IOV). Let's not make a far, far more commonly deployed and important driver (virtio) bug-compatible with netvsc. To Jiri's initial comments, I feel the same way, in fact I've talked to the NetworkManager guys to get auto-bonding based on MACs handled in user space. I think it may very well get done in next versions of NM, but isn't done yet. Stephen also raised the point that not everybody is using NM.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirkowrote: > Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote: >>On 2/20/2018 8:29 AM, Jiri Pirko wrote: >>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko wrote: >>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>> > > > used by hypervisor to indicate that virtio_net interface should act as >>> > > > a backup for another device with the same MAC address. >>> > > > >>> > > > Ppatch 2 is in response to the community request for a 3 netdev >>> > > > solution. However, it creates some issues we'll get into in a moment. >>> > > > It extends virtio_net to use alternate datapath when available and >>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates >>> > > > an additional 'bypass' netdev that acts as a master device and >>> > > > controls >>> > > > 2 slave devices. The original virtio_net netdev is registered as >>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets >>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>> > > > associated with the same 'pci' device. The user accesses the network >>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' >>> > > > netdev >>> > > > as default for transmits when it is available with link up and >>> > > > running. >>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part >>> > > of bonding driver as a part of NIC driver. Bond and team drivers >>> > > are mature solutions, well tested, broadly used, with lots of issues >>> > > resolved in the past. What you try to introduce is a weird shortcut >>> > > that already has couple of issues as you mentioned and will certanly >>> > > have many more. Also, I'm pretty sure that in future, someone comes up >>> > > with ideas like multiple VFs, LACP and similar bonding things. >>> > The problem with the bond and team drivers is they are too large and >>> > have too many interfaces available for configuration so as a result >>> > they can really screw this interface up. >>> What? Too large is which sense? Why "too many interfaces" is a problem? >>> Also, team has only one interface to userspace team-generic-netlink. >>> >>> >>> > Essentially this is meant to be a bond that is more-or-less managed by >>> > the host, not the guest. We want the host to be able to configure it >>> How is it managed by the host? In your usecase the guest has 2 netdevs: >>> virtio_net, pci vf. >>> I don't see how host can do any managing of that, other than the >>> obvious. But still, the active/backup decision is done in guest. This is >>> a simple bond/team usecase. As I said, there is something needed to be >>> implemented in userspace in order to handle re-appear of vf netdev. >>> But that should be fairly easy to do in teamd. >> >>The host manages the active/backup decision by >>- assigning the same MAC address to both VF and virtio interfaces >>- setting a BACKUP feature bit on virtio that enables virtio to transparently >>take >> over the VFs datapath. >>- only enable one datapath at anytime so that packets don't get looped back >>- during live migration enable virtio datapth, unplug vf on the source and >>replug >> vf on the destination. >> >>The VM is not expected and doesn't have any control of setting the MAC >>address >>or bringing up/down the links. >> >>This is the model that is currently supported with netvsc driver on Azure. > > Yeah, I can see it now :( I guess that the ship has sailed and we are > stuck with this ugly thing forever... > > Could you at least make some common code that is shared in between > netvsc and virtio_net so this is handled in exacly the same way in both? > > The fact that the netvsc/virtio_net kidnaps a netdev only because it > has the same mac is going to give me some serious nighmares... > I think we need to introduce some more strict checks. In order for that to work we need to settle on a model for these. The issue is that netvsc is using what we refer to as the "2 netdev" model where they don't expose the paravirtual interface as its own netdev. The opinion of Jakub and others has been that we should do a "3 netdev" model in the case of virtio_net since otherwise we will lose functionality such as in-driver XDP and have to deal with an extra set of qdiscs and Tx queue locks on transmit path. Really at this point I am good either way, but we need to probably have Stephen, Jakub, and whoever else had an opinion on the matter sort out the 2 vs 3 argument before we could proceed on that. Most of patch 2 in the set can easily be broken out into a separate file later if we decide to go that route. Thanks. - Alex
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote: >On 2/20/2018 8:29 AM, Jiri Pirko wrote: >> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirkowrote: >> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >> > > > used by hypervisor to indicate that virtio_net interface should act as >> > > > a backup for another device with the same MAC address. >> > > > >> > > > Ppatch 2 is in response to the community request for a 3 netdev >> > > > solution. However, it creates some issues we'll get into in a moment. >> > > > It extends virtio_net to use alternate datapath when available and >> > > > registered. When BACKUP feature is enabled, virtio_net driver creates >> > > > an additional 'bypass' netdev that acts as a master device and controls >> > > > 2 slave devices. The original virtio_net netdev is registered as >> > > > 'backup' netdev and a passthru/vf device with the same MAC gets >> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >> > > > associated with the same 'pci' device. The user accesses the network >> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' >> > > > netdev >> > > > as default for transmits when it is available with link up and running. >> > > Sorry, but this is ridiculous. You are apparently re-implemeting part >> > > of bonding driver as a part of NIC driver. Bond and team drivers >> > > are mature solutions, well tested, broadly used, with lots of issues >> > > resolved in the past. What you try to introduce is a weird shortcut >> > > that already has couple of issues as you mentioned and will certanly >> > > have many more. Also, I'm pretty sure that in future, someone comes up >> > > with ideas like multiple VFs, LACP and similar bonding things. >> > The problem with the bond and team drivers is they are too large and >> > have too many interfaces available for configuration so as a result >> > they can really screw this interface up. >> What? Too large is which sense? Why "too many interfaces" is a problem? >> Also, team has only one interface to userspace team-generic-netlink. >> >> >> > Essentially this is meant to be a bond that is more-or-less managed by >> > the host, not the guest. We want the host to be able to configure it >> How is it managed by the host? In your usecase the guest has 2 netdevs: >> virtio_net, pci vf. >> I don't see how host can do any managing of that, other than the >> obvious. But still, the active/backup decision is done in guest. This is >> a simple bond/team usecase. As I said, there is something needed to be >> implemented in userspace in order to handle re-appear of vf netdev. >> But that should be fairly easy to do in teamd. > >The host manages the active/backup decision by >- assigning the same MAC address to both VF and virtio interfaces >- setting a BACKUP feature bit on virtio that enables virtio to transparently >take > over the VFs datapath. >- only enable one datapath at anytime so that packets don't get looped back >- during live migration enable virtio datapth, unplug vf on the source and >replug > vf on the destination. > >The VM is not expected and doesn't have any control of setting the MAC >address >or bringing up/down the links. > >This is the model that is currently supported with netvsc driver on Azure. Yeah, I can see it now :( I guess that the ship has sailed and we are stuck with this ugly thing forever... Could you at least make some common code that is shared in between netvsc and virtio_net so this is handled in exacly the same way in both? The fact that the netvsc/virtio_net kidnaps a netdev only because it has the same mac is going to give me some serious nighmares... I think we need to introduce some more strict checks.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 06:23:49PM CET, alexander.du...@gmail.com wrote: >On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirkowrote: >> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko wrote: Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >used by hypervisor to indicate that virtio_net interface should act as >a backup for another device with the same MAC address. > >Ppatch 2 is in response to the community request for a 3 netdev >solution. However, it creates some issues we'll get into in a moment. >It extends virtio_net to use alternate datapath when available and >registered. When BACKUP feature is enabled, virtio_net driver creates >an additional 'bypass' netdev that acts as a master device and controls >2 slave devices. The original virtio_net netdev is registered as >'backup' netdev and a passthru/vf device with the same MAC gets >registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >associated with the same 'pci' device. The user accesses the network >interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >as default for transmits when it is available with link up and running. Sorry, but this is ridiculous. You are apparently re-implemeting part of bonding driver as a part of NIC driver. Bond and team drivers are mature solutions, well tested, broadly used, with lots of issues resolved in the past. What you try to introduce is a weird shortcut that already has couple of issues as you mentioned and will certanly have many more. Also, I'm pretty sure that in future, someone comes up with ideas like multiple VFs, LACP and similar bonding things. >>> >>>The problem with the bond and team drivers is they are too large and >>>have too many interfaces available for configuration so as a result >>>they can really screw this interface up. >> >> What? Too large is which sense? Why "too many interfaces" is a problem? >> Also, team has only one interface to userspace team-generic-netlink. > >Specifically I was working with bond. I had overlooked team for the >most part since it required an additional userspace daemon which >basically broke our requirement of no user-space intervention. Why? That sound artificial. Why the userspace cannot be part of the solution? > >I was trying to focus on just doing an active/backup setup. The >problem is there are debugfs, sysfs, and procfs interfaces exposed >that we don't need and/or want. Adding any sort of interface to >exclude these would just bloat up the bonding driver, and leaving them >in would just be confusing since they would all need to be ignored. In >addition the steps needed to get the name to come out the same as the >original virtio interface would just bloat up bonding. Why to you care about "name"? it's a netdev, isn't it all that matters? The viewpoint of the user inside vm boils down to: 1) I have 2 netdevs 2) One is preferred 3) I setup team on top of them That's should be it. It is the users responsibility to do it this way. > >>> >>>Essentially this is meant to be a bond that is more-or-less managed by >>>the host, not the guest. We want the host to be able to configure it >> >> How is it managed by the host? In your usecase the guest has 2 netdevs: >> virtio_net, pci vf. >> I don't see how host can do any managing of that, other than the >> obvious. But still, the active/backup decision is done in guest. This is >> a simple bond/team usecase. As I said, there is something needed to be >> implemented in userspace in order to handle re-appear of vf netdev. >> But that should be fairly easy to do in teamd. >> >> >>>and have it automatically kick in on the guest. For now we want to >>>avoid adding too much complexity as this is meant to be just the first >> >> That's what I fear, "for now".. > >I used the expression "for now" as I see this being the first stage of >a multi-stage process. That is what I fear... > >Step 1 is to get a basic virtio-bypass driver added to virtio so that >it is at least comparable to netvsc in terms of feature set and >enables basic network live migration. > >Step 2 is adding some sort of dirty page tracking, preferably via >something like a paravirtual iommu interface. Once we have that we can >defer the eviction of the VF until the very last moment of the live >migration. For now I need to work on testing a modification to allow >mapping the entire guest as being pass-through for DMA to the device, >and requiring dynamic for any DMA that is bidirectional or from the >device. That is purely on the host side. Does not really matter if your solution or standard bond/team is in use, right? > >Step 3 will be to start looking at advanced configuration. That is >where we drop the
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirkowrote: > Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko wrote: >>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be used by hypervisor to indicate that virtio_net interface should act as a backup for another device with the same MAC address. Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. >>> >>> Sorry, but this is ridiculous. You are apparently re-implemeting part >>> of bonding driver as a part of NIC driver. Bond and team drivers >>> are mature solutions, well tested, broadly used, with lots of issues >>> resolved in the past. What you try to introduce is a weird shortcut >>> that already has couple of issues as you mentioned and will certanly >>> have many more. Also, I'm pretty sure that in future, someone comes up >>> with ideas like multiple VFs, LACP and similar bonding things. >> >>The problem with the bond and team drivers is they are too large and >>have too many interfaces available for configuration so as a result >>they can really screw this interface up. > > What? Too large is which sense? Why "too many interfaces" is a problem? > Also, team has only one interface to userspace team-generic-netlink. Specifically I was working with bond. I had overlooked team for the most part since it required an additional userspace daemon which basically broke our requirement of no user-space intervention. I was trying to focus on just doing an active/backup setup. The problem is there are debugfs, sysfs, and procfs interfaces exposed that we don't need and/or want. Adding any sort of interface to exclude these would just bloat up the bonding driver, and leaving them in would just be confusing since they would all need to be ignored. In addition the steps needed to get the name to come out the same as the original virtio interface would just bloat up bonding. >> >>Essentially this is meant to be a bond that is more-or-less managed by >>the host, not the guest. We want the host to be able to configure it > > How is it managed by the host? In your usecase the guest has 2 netdevs: > virtio_net, pci vf. > I don't see how host can do any managing of that, other than the > obvious. But still, the active/backup decision is done in guest. This is > a simple bond/team usecase. As I said, there is something needed to be > implemented in userspace in order to handle re-appear of vf netdev. > But that should be fairly easy to do in teamd. > > >>and have it automatically kick in on the guest. For now we want to >>avoid adding too much complexity as this is meant to be just the first > > That's what I fear, "for now".. I used the expression "for now" as I see this being the first stage of a multi-stage process. Step 1 is to get a basic virtio-bypass driver added to virtio so that it is at least comparable to netvsc in terms of feature set and enables basic network live migration. Step 2 is adding some sort of dirty page tracking, preferably via something like a paravirtual iommu interface. Once we have that we can defer the eviction of the VF until the very last moment of the live migration. For now I need to work on testing a modification to allow mapping the entire guest as being pass-through for DMA to the device, and requiring dynamic for any DMA that is bidirectional or from the device. Step 3 will be to start looking at advanced configuration. That is where we drop the implementation in step 1 and instead look at spawning something that looks more like the team type interface, however instead of working with a user-space daemon we would likely need to work with some sort of mailbox or message queue coming up from the hypervisor. Then we can start looking at doing things like passing up blocks of eBPF code to handle Tx port selection or whatever we need. > >>step. Trying to go in and implement the whole solution right from the >>start based on existing drivers is going to be a massive time sink and >>will likely never get completed due to the fact that there is always >>going to be some other thing that will
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/20/2018 8:29 AM, Jiri Pirko wrote: Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirkowrote: Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be used by hypervisor to indicate that virtio_net interface should act as a backup for another device with the same MAC address. Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. Sorry, but this is ridiculous. You are apparently re-implemeting part of bonding driver as a part of NIC driver. Bond and team drivers are mature solutions, well tested, broadly used, with lots of issues resolved in the past. What you try to introduce is a weird shortcut that already has couple of issues as you mentioned and will certanly have many more. Also, I'm pretty sure that in future, someone comes up with ideas like multiple VFs, LACP and similar bonding things. The problem with the bond and team drivers is they are too large and have too many interfaces available for configuration so as a result they can really screw this interface up. What? Too large is which sense? Why "too many interfaces" is a problem? Also, team has only one interface to userspace team-generic-netlink. Essentially this is meant to be a bond that is more-or-less managed by the host, not the guest. We want the host to be able to configure it How is it managed by the host? In your usecase the guest has 2 netdevs: virtio_net, pci vf. I don't see how host can do any managing of that, other than the obvious. But still, the active/backup decision is done in guest. This is a simple bond/team usecase. As I said, there is something needed to be implemented in userspace in order to handle re-appear of vf netdev. But that should be fairly easy to do in teamd. The host manages the active/backup decision by - assigning the same MAC address to both VF and virtio interfaces - setting a BACKUP feature bit on virtio that enables virtio to transparently take over the VFs datapath. - only enable one datapath at anytime so that packets don't get looped back - during live migration enable virtio datapth, unplug vf on the source and replug vf on the destination. The VM is not expected and doesn't have any control of setting the MAC address or bringing up/down the links. This is the model that is currently supported with netvsc driver on Azure. and have it automatically kick in on the guest. For now we want to avoid adding too much complexity as this is meant to be just the first That's what I fear, "for now".. step. Trying to go in and implement the whole solution right from the start based on existing drivers is going to be a massive time sink and will likely never get completed due to the fact that there is always going to be some other thing that will interfere. "implement the whole solution right from the start based on existing drivers" - what solution are you talking about? I don't understand this para. My personal hope is that we can look at doing a virtio-bond sort of device that will handle all this as well as providing a communication channel, but that is much further down the road. For now we only have a single bit so the goal for now is trying to keep this as simple as possible. Oh. So there is really intention to do re-implementation of bonding in virtio. That is plain-wrong in my opinion. Could you just use bond/team, please, and don't reinvent the wheel with this abomination? What is the reason for this abomination? According to: https://marc.info/?l=linux-virtualization=151189725224231=2 The reason is quite weak. User in the vm sees 2 (or more) netdevices, he puts them in bond/team and that's it. This works now! If the vm lacks some userspace features, let's fix it there! For example the MAC changes is something that could be easily handled in teamd userspace deamon. I think you might have missed the point of this. This is meant to be a simple interface so the guest should not be able to change the MAC address, and it shouldn't require any userspace daemon to setup or tear down. Ideally with this solution the virtio bypass will come up and be assigned the name of the original virtio, and the
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote: >On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirkowrote: >> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>>used by hypervisor to indicate that virtio_net interface should act as >>>a backup for another device with the same MAC address. >>> >>>Ppatch 2 is in response to the community request for a 3 netdev >>>solution. However, it creates some issues we'll get into in a moment. >>>It extends virtio_net to use alternate datapath when available and >>>registered. When BACKUP feature is enabled, virtio_net driver creates >>>an additional 'bypass' netdev that acts as a master device and controls >>>2 slave devices. The original virtio_net netdev is registered as >>>'backup' netdev and a passthru/vf device with the same MAC gets >>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>>associated with the same 'pci' device. The user accesses the network >>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>>as default for transmits when it is available with link up and running. >> >> Sorry, but this is ridiculous. You are apparently re-implemeting part >> of bonding driver as a part of NIC driver. Bond and team drivers >> are mature solutions, well tested, broadly used, with lots of issues >> resolved in the past. What you try to introduce is a weird shortcut >> that already has couple of issues as you mentioned and will certanly >> have many more. Also, I'm pretty sure that in future, someone comes up >> with ideas like multiple VFs, LACP and similar bonding things. > >The problem with the bond and team drivers is they are too large and >have too many interfaces available for configuration so as a result >they can really screw this interface up. What? Too large is which sense? Why "too many interfaces" is a problem? Also, team has only one interface to userspace team-generic-netlink. > >Essentially this is meant to be a bond that is more-or-less managed by >the host, not the guest. We want the host to be able to configure it How is it managed by the host? In your usecase the guest has 2 netdevs: virtio_net, pci vf. I don't see how host can do any managing of that, other than the obvious. But still, the active/backup decision is done in guest. This is a simple bond/team usecase. As I said, there is something needed to be implemented in userspace in order to handle re-appear of vf netdev. But that should be fairly easy to do in teamd. >and have it automatically kick in on the guest. For now we want to >avoid adding too much complexity as this is meant to be just the first That's what I fear, "for now".. >step. Trying to go in and implement the whole solution right from the >start based on existing drivers is going to be a massive time sink and >will likely never get completed due to the fact that there is always >going to be some other thing that will interfere. "implement the whole solution right from the start based on existing drivers" - what solution are you talking about? I don't understand this para. > >My personal hope is that we can look at doing a virtio-bond sort of >device that will handle all this as well as providing a communication >channel, but that is much further down the road. For now we only have >a single bit so the goal for now is trying to keep this as simple as >possible. Oh. So there is really intention to do re-implementation of bonding in virtio. That is plain-wrong in my opinion. Could you just use bond/team, please, and don't reinvent the wheel with this abomination? > >> What is the reason for this abomination? According to: >> https://marc.info/?l=linux-virtualization=151189725224231=2 >> The reason is quite weak. >> User in the vm sees 2 (or more) netdevices, he puts them in bond/team >> and that's it. This works now! If the vm lacks some userspace features, >> let's fix it there! For example the MAC changes is something that could >> be easily handled in teamd userspace deamon. > >I think you might have missed the point of this. This is meant to be a >simple interface so the guest should not be able to change the MAC >address, and it shouldn't require any userspace daemon to setup or >tear down. Ideally with this solution the virtio bypass will come up >and be assigned the name of the original virtio, and the "backup" >interface will come up and be assigned the name of the original virtio >with an additional "nbackup" tacked on via the phys_port_name, and >then whenever a VF is added it will automatically be enslaved by the >bypass interface, and it will be removed when the VF is hotplugged >out. > >In my mind the difference between this and bond or team is where the >configuration interface lies. In the case of bond it is in the kernel. >If my understanding is correct team is mostly in user space. With this >the configuration interface
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On 2/18/2018 10:11 PM, Jakub Kicinski wrote: On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote: We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. That makes sense. Is it udev/naming that you're most concerned about here? I.e. what's the user space app that expects the netdev to be linked? This is just out of curiosity, the linking of netdevs to devices is a bit of a PITA in the switchdev eswitch mode world, with libvirt expecting only certain devices to be there.. Right now we're not linking VF reprs, which breaks naming. I wanted to revisit that. For live migration usecase, userspace is only aware of one virtio_net interface and it doesn't expect it to be linked with any lower dev. So it should be fine even if the lower netdev is not present. Only the master netdev should be assigned the same name so that userspace configuration scripts in the VM don't need to change. FWIW two solutions that immediately come to mind is to export "backup" as phys_port_name of the backup virtio link and/or assign a name to the master like you are doing already. I think team uses team%d and bond uses bond%d, soft naming of master devices seems quite natural in this case. I figured I had overlooked something like that.. Thanks for pointing this out. Okay so I think the phys_port_name approach might resolve the original issue. If I am reading things correctly what we end up with is the master showing up as "ens1" for example and the backup showing up as "ens1nbackup". Am I understanding that right? Yes, provided systemd is new enough. Yes. I did a quick test to confirm that adding ndo_phys_port_name() to virtio_net ndo_ops fixes the udev naming issue with 2 virtio netdevs. This is on fedora27. The problem with the team/bond%d approach is that it creates a new netdevice and so it would require guest configuration changes. IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio link is quite neat. I agree. For non-"backup" virio_net devices would it be okay for us to just return -EOPNOTSUPP? I assume it would be and that way the legacy behavior could be maintained although the function still exists. That's my understanding too.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirkowrote: > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >>used by hypervisor to indicate that virtio_net interface should act as >>a backup for another device with the same MAC address. >> >>Ppatch 2 is in response to the community request for a 3 netdev >>solution. However, it creates some issues we'll get into in a moment. >>It extends virtio_net to use alternate datapath when available and >>registered. When BACKUP feature is enabled, virtio_net driver creates >>an additional 'bypass' netdev that acts as a master device and controls >>2 slave devices. The original virtio_net netdev is registered as >>'backup' netdev and a passthru/vf device with the same MAC gets >>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >>associated with the same 'pci' device. The user accesses the network >>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >>as default for transmits when it is available with link up and running. > > Sorry, but this is ridiculous. You are apparently re-implemeting part > of bonding driver as a part of NIC driver. Bond and team drivers > are mature solutions, well tested, broadly used, with lots of issues > resolved in the past. What you try to introduce is a weird shortcut > that already has couple of issues as you mentioned and will certanly > have many more. Also, I'm pretty sure that in future, someone comes up > with ideas like multiple VFs, LACP and similar bonding things. The problem with the bond and team drivers is they are too large and have too many interfaces available for configuration so as a result they can really screw this interface up. Essentially this is meant to be a bond that is more-or-less managed by the host, not the guest. We want the host to be able to configure it and have it automatically kick in on the guest. For now we want to avoid adding too much complexity as this is meant to be just the first step. Trying to go in and implement the whole solution right from the start based on existing drivers is going to be a massive time sink and will likely never get completed due to the fact that there is always going to be some other thing that will interfere. My personal hope is that we can look at doing a virtio-bond sort of device that will handle all this as well as providing a communication channel, but that is much further down the road. For now we only have a single bit so the goal for now is trying to keep this as simple as possible. > What is the reason for this abomination? According to: > https://marc.info/?l=linux-virtualization=151189725224231=2 > The reason is quite weak. > User in the vm sees 2 (or more) netdevices, he puts them in bond/team > and that's it. This works now! If the vm lacks some userspace features, > let's fix it there! For example the MAC changes is something that could > be easily handled in teamd userspace deamon. I think you might have missed the point of this. This is meant to be a simple interface so the guest should not be able to change the MAC address, and it shouldn't require any userspace daemon to setup or tear down. Ideally with this solution the virtio bypass will come up and be assigned the name of the original virtio, and the "backup" interface will come up and be assigned the name of the original virtio with an additional "nbackup" tacked on via the phys_port_name, and then whenever a VF is added it will automatically be enslaved by the bypass interface, and it will be removed when the VF is hotplugged out. In my mind the difference between this and bond or team is where the configuration interface lies. In the case of bond it is in the kernel. If my understanding is correct team is mostly in user space. With this the configuration interface is really down in the hypervisor and requests are communicated up to the guest. I would prefer not to make virtio_net dependent on the bonding or team drivers, or worse yet a userspace daemon in the guest. For now I would argue we should keep this as simple as possible just to support basic live migration. There has already been discussions of refactoring this after it is in so that we can start to combine the functionality here with what is there in bonding/team, but the differences in configuration interface and the size of the code bases will make it challenging to outright merge this into something like that.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote: >Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be >used by hypervisor to indicate that virtio_net interface should act as >a backup for another device with the same MAC address. > >Ppatch 2 is in response to the community request for a 3 netdev >solution. However, it creates some issues we'll get into in a moment. >It extends virtio_net to use alternate datapath when available and >registered. When BACKUP feature is enabled, virtio_net driver creates >an additional 'bypass' netdev that acts as a master device and controls >2 slave devices. The original virtio_net netdev is registered as >'backup' netdev and a passthru/vf device with the same MAC gets >registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >associated with the same 'pci' device. The user accesses the network >interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >as default for transmits when it is available with link up and running. Sorry, but this is ridiculous. You are apparently re-implemeting part of bonding driver as a part of NIC driver. Bond and team drivers are mature solutions, well tested, broadly used, with lots of issues resolved in the past. What you try to introduce is a weird shortcut that already has couple of issues as you mentioned and will certanly have many more. Also, I'm pretty sure that in future, someone comes up with ideas like multiple VFs, LACP and similar bonding things. What is the reason for this abomination? According to: https://marc.info/?l=linux-virtualization=151189725224231=2 The reason is quite weak. User in the vm sees 2 (or more) netdevices, he puts them in bond/team and that's it. This works now! If the vm lacks some userspace features, let's fix it there! For example the MAC changes is something that could be easily handled in teamd userspace deamon.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote: > >> We noticed a couple of issues with this approach during testing. > >> - As both 'bypass' and 'backup' netdevs are associated with the same > >> virtio pci device, udev tries to rename both of them with the same name > >> and the 2nd rename will fail. This would be OK as long as the first > >> netdev > >> to be renamed is the 'bypass' netdev, but the order in which udev gets > >> to rename the 2 netdevs is not reliable. > > > > Out of curiosity - why do you link the master netdev to the virtio > > struct device? > > The basic idea of all this is that we wanted this to work with an > existing VM image that was using virtio. As such we were trying to > make it so that the bypass interface takes the place of the original > virtio and get udev to rename the bypass to what the original > virtio_net was. That makes sense. Is it udev/naming that you're most concerned about here? I.e. what's the user space app that expects the netdev to be linked? This is just out of curiosity, the linking of netdevs to devices is a bit of a PITA in the switchdev eswitch mode world, with libvirt expecting only certain devices to be there.. Right now we're not linking VF reprs, which breaks naming. I wanted to revisit that. > > FWIW two solutions that immediately come to mind is to export "backup" > > as phys_port_name of the backup virtio link and/or assign a name to the > > master like you are doing already. I think team uses team%d and bond > > uses bond%d, soft naming of master devices seems quite natural in this > > case. > > I figured I had overlooked something like that.. Thanks for pointing > this out. Okay so I think the phys_port_name approach might resolve > the original issue. If I am reading things correctly what we end up > with is the master showing up as "ens1" for example and the backup > showing up as "ens1nbackup". Am I understanding that right? Yes, provided systemd is new enough. > The problem with the team/bond%d approach is that it creates a new > netdevice and so it would require guest configuration changes. > > > IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > > link is quite neat. > > I agree. For non-"backup" virio_net devices would it be okay for us to > just return -EOPNOTSUPP? I assume it would be and that way the legacy > behavior could be maintained although the function still exists. That's my understanding too.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinskiwrote: > On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: >> Ppatch 2 is in response to the community request for a 3 netdev >> solution. However, it creates some issues we'll get into in a moment. >> It extends virtio_net to use alternate datapath when available and >> registered. When BACKUP feature is enabled, virtio_net driver creates >> an additional 'bypass' netdev that acts as a master device and controls >> 2 slave devices. The original virtio_net netdev is registered as >> 'backup' netdev and a passthru/vf device with the same MAC gets >> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are >> associated with the same 'pci' device. The user accesses the network >> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev >> as default for transmits when it is available with link up and running. > > Thank you do doing this. > >> We noticed a couple of issues with this approach during testing. >> - As both 'bypass' and 'backup' netdevs are associated with the same >> virtio pci device, udev tries to rename both of them with the same name >> and the 2nd rename will fail. This would be OK as long as the first netdev >> to be renamed is the 'bypass' netdev, but the order in which udev gets >> to rename the 2 netdevs is not reliable. > > Out of curiosity - why do you link the master netdev to the virtio > struct device? The basic idea of all this is that we wanted this to work with an existing VM image that was using virtio. As such we were trying to make it so that the bypass interface takes the place of the original virtio and get udev to rename the bypass to what the original virtio_net was. > FWIW two solutions that immediately come to mind is to export "backup" > as phys_port_name of the backup virtio link and/or assign a name to the > master like you are doing already. I think team uses team%d and bond > uses bond%d, soft naming of master devices seems quite natural in this > case. I figured I had overlooked something like that.. Thanks for pointing this out. Okay so I think the phys_port_name approach might resolve the original issue. If I am reading things correctly what we end up with is the master showing up as "ens1" for example and the backup showing up as "ens1nbackup". Am I understanding that right? The problem with the team/bond%d approach is that it creates a new netdevice and so it would require guest configuration changes. > IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > link is quite neat. I agree. For non-"backup" virio_net devices would it be okay for us to just return -EOPNOTSUPP? I assume it would be and that way the legacy behavior could be maintained although the function still exists. >> - When the 'active' netdev is unplugged OR not present on a destination >> system after live migration, the user will see 2 virtio_net netdevs. > > That's necessary and expected, all configuration applies to the master > so master must exist. With the naming issue resolved this is the only item left outstanding. This becomes a matter of form vs function. The main complaint about the "3 netdev" solution is a bit confusing to have the 2 netdevs present if the VF isn't there. The idea is that having the extra "master" netdev there if there isn't really a bond is a bit ugly. The downside of the "2 netdev" solution is that you have to deal with an extra layer of locking/queueing to get to the VF and you lose some functionality since things like in-driver XDP have to be disabled in order to maintain the same functionality when the VF is present or not. However it looks more like classic virtio_net when the VF is not present.
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote: > Ppatch 2 is in response to the community request for a 3 netdev > solution. However, it creates some issues we'll get into in a moment. > It extends virtio_net to use alternate datapath when available and > registered. When BACKUP feature is enabled, virtio_net driver creates > an additional 'bypass' netdev that acts as a master device and controls > 2 slave devices. The original virtio_net netdev is registered as > 'backup' netdev and a passthru/vf device with the same MAC gets > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are > associated with the same 'pci' device. The user accesses the network > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev > as default for transmits when it is available with link up and running. Thank you do doing this. > We noticed a couple of issues with this approach during testing. > - As both 'bypass' and 'backup' netdevs are associated with the same > virtio pci device, udev tries to rename both of them with the same name > and the 2nd rename will fail. This would be OK as long as the first netdev > to be renamed is the 'bypass' netdev, but the order in which udev gets > to rename the 2 netdevs is not reliable. Out of curiosity - why do you link the master netdev to the virtio struct device? FWIW two solutions that immediately come to mind is to export "backup" as phys_port_name of the backup virtio link and/or assign a name to the master like you are doing already. I think team uses team%d and bond uses bond%d, soft naming of master devices seems quite natural in this case. IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio link is quite neat. > - When the 'active' netdev is unplugged OR not present on a destination > system after live migration, the user will see 2 virtio_net netdevs. That's necessary and expected, all configuration applies to the master so master must exist.
[RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be used by hypervisor to indicate that virtio_net interface should act as a backup for another device with the same MAC address. Ppatch 2 is in response to the community request for a 3 netdev solution. However, it creates some issues we'll get into in a moment. It extends virtio_net to use alternate datapath when available and registered. When BACKUP feature is enabled, virtio_net driver creates an additional 'bypass' netdev that acts as a master device and controls 2 slave devices. The original virtio_net netdev is registered as 'backup' netdev and a passthru/vf device with the same MAC gets registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev as default for transmits when it is available with link up and running. We noticed a couple of issues with this approach during testing. - As both 'bypass' and 'backup' netdevs are associated with the same virtio pci device, udev tries to rename both of them with the same name and the 2nd rename will fail. This would be OK as long as the first netdev to be renamed is the 'bypass' netdev, but the order in which udev gets to rename the 2 netdevs is not reliable. - When the 'active' netdev is unplugged OR not present on a destination system after live migration, the user will see 2 virtio_net netdevs. Patch 3 refactors much of the changes made in patch 2, which was done on purpose just to show the solution we recommend as part of one patch set. If we submit a final version of this, we would combine patch 2/3 together. This patch removes the creation of an additional netdev, Instead, it uses a new virtnet_bypass_info struct added to the original 'backup' netdev to track the 'bypass' information and introduces an additional set of ndo and ethtool ops that are used when BACKUP feature is enabled. One difference with the 3 netdev model compared to the 2 netdev model is that the 'bypass' netdev is created with 'noqueue' qdisc marked as 'NETIF_F_LLTX'. This avoids going through an additional qdisc and acquiring an additional qdisc and tx lock during transmits. If we can replace the qdisc of virtio netdev dynamically, it should be possible to get these optimizations enabled even with 2 netdev model when BACKUP feature is enabled. As this patch series is initially focusing on usecases where hypervisor fully controls the VM networking and the guest is not expected to directly configure any hardware settings, it doesn't expose all the ndo/ethtool ops that are supported by virtio_net at this time. To support additional usecases, it should be possible to enable additional ops later by caching the state in virtio netdev and replaying when the 'active' netdev gets registered. The hypervisor needs to enable only one datapath at any time so that packets don't get looped back to the VM over the other datapath. When a VF is plugged, the virtio datapath link state can be marked as down. At the time of live migration, the hypervisor needs to unplug the VF device from the guest on the source host and reset the MAC filter of the VF to initiate failover of datapath to virtio before starting the migration. After the migration is completed, the destination hypervisor sets the MAC filter on the VF and plugs it back to the guest to switch over to VF datapath. This patch is based on the discussion initiated by Jesse on this thread. https://marc.info/?l=linux-virtualization=151189725224231=2 Sridhar Samudrala (3): virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit virtio_net: Extend virtio to use VF datapath when available virtio_net: Enable alternate datapath without creating an additional netdev drivers/net/virtio_net.c| 564 +++- include/uapi/linux/virtio_net.h | 3 + 2 files changed, 563 insertions(+), 4 deletions(-) -- 2.14.3