Re: [PATCH net] failover: eliminate callback hell
On Mon, Jun 11, 2018 at 11:56:56AM -0700, Siwei Liu wrote: > The current implementation may only work with new userspace, even so > the eth0/eth0nsby naming is not consistenly persisted due to races in > bus probing. Which race do you mean exactly? -- MST
Re: [PATCH net] failover: eliminate callback hell
On 6/11/2018 12:34 PM, Samudrala, Sridhar wrote: On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote: On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote: * Set permanent and current address of net_failover device to match the primary. We copy the dev_addr of standby dev to failover_dev in net_failover_create() before calling register_netdev(). register_netdev() does a copy of dev_addr to perm_addr. So i don't think this is an issue. * Carrier should be marked off before registering device the net_failover device. Will fix this and also a couple of places dev_err() needs to be replaced with netdev_err() Sridhar, do we want to address this? If yes, could you please take a look at addressing these meanwhile, while we keep arguing about making API changes? Sure. I will submit patches to address these issues raised by Stephen.
Re: [PATCH net] failover: eliminate callback hell
On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote: On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote: * Set permanent and current address of net_failover device to match the primary. * Carrier should be marked off before registering device the net_failover device. Sridhar, do we want to address this? If yes, could you please take a look at addressing these meanwhile, while we keep arguing about making API changes? Sure. I will submit patches to address these issues raised by Stephen.
Re: [PATCH net] failover: eliminate callback hell
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger wrote: > On Fri, 8 Jun 2018 17:42:21 -0700 > Siwei Liu wrote: > >> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger >> wrote: >> > On Fri, 8 Jun 2018 16:44:12 -0700 >> > Siwei Liu wrote: >> > >> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger >> >> wrote: >> >> > On Fri, 8 Jun 2018 15:25:59 -0700 >> >> > Siwei Liu wrote: >> >> > >> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger >> >> >> wrote: >> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300 >> >> >> > "Michael S. Tsirkin" wrote: >> >> >> > >> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org >> >> >> >> > wrote: >> >> >> >> > >The net failover should be a simple library, not a virtual >> >> >> >> > >object with function callbacks (see callback hell). >> >> >> >> > >> >> >> >> > Why just a library? It should do a common things. I think it >> >> >> >> > should be a >> >> >> >> > virtual object. Looks like your patch again splits the common >> >> >> >> > functionality into multiple drivers. That is kind of backwards >> >> >> >> > attitude. >> >> >> >> > I don't get it. We should rather focus on fixing the mess the >> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to >> >> >> >> > 3-netdev >> >> >> >> > model. >> >> >> >> >> >> >> >> So it seems that at least one benefit for netvsc would be better >> >> >> >> handling of renames. >> >> >> >> >> >> >> >> Question is how can this change to 3-netdev happen? Stephen is >> >> >> >> concerned about risk of breaking some userspace. >> >> >> >> >> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> >> >> >> address, and you said then "why not use existing network namespaces >> >> >> >> rather than inventing a new abstraction". So how about it then? Do >> >> >> >> you >> >> >> >> want to find a way to use namespaces to hide the PV device for >> >> >> >> netvsc >> >> >> >> compatibility? >> >> >> >> >> >> >> > >> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough >> >> >> > distro's and >> >> >> > startups that all demand eth0 always be present. And VF may come and >> >> >> > go. >> >> >> > After this history, there is a strong motivation not to change how >> >> >> > kernel >> >> >> > behaves. Switching to 3 device model would be perceived as breaking >> >> >> > existing userspace. >> >> >> > >> >> >> > With virtio you can work it out with the distro's yourself. >> >> >> > There is no pre-existing semantics to deal with. >> >> >> > >> >> >> > For the virtio, I don't see the need for IFF_HIDDEN. >> >> >> >> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of >> >> >> that flag, as well as the 1-netdev model, is to have a means to >> >> >> inherit the interface name from the VF, and to eliminate playing hacks >> >> >> around renaming devices, customizing udev rules and et al. Why >> >> >> inheriting VF's name important? To allow existing config/setup around >> >> >> VF continues to work across kernel feature upgrade. Most of network >> >> >> config files in all distros are based on interface names. Few are MAC >> >> >> address based but making lower slaves hidden would cover the rest. And >> >> >> most importantly, preserving the same level of user experience as >> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops >> >> >> exposed. This is essential to realize transparent live migration that >> >> >> users dont have to learn and be aware of the undertaken. >> >> > >> >> > Inheriting the VF name will fail in the migration scenario. >> >> > It is perfectly reasonable to migrate a guest to another machine where >> >> > the VF PCI address is different. And since current udev/systemd model >> >> > is to base network device name off of PCI address, the device will >> >> > change >> >> > name when guest is migrated. >> >> > >> >> The scenario of having VF on a different PCI address on post migration >> >> is essentially equal to plugging in a new NIC. Why it has to pair with >> >> the original PV? A sepearte PV device should be in place to pair the >> >> new VF. >> > >> > The host only guarantees that the PV device will be on the same network. >> > It does not make any PCI guarantees. The way Windows works is to find >> > the device based on "serial number" which is an Hyper-V specific attribute >> > of PCI devices. >> > >> > I considered naming off of serial number but that won't work for the >> > case where PV device is present first and VF arrives later. The serial >> > number is attribute of VF, not the PV which is there first. >> >> I assume the PV can get that information ahead of time before VF >> arrives? Without it how do you match the device when you see a VF >> coming with some serial number? Is it possible for PV to get the >> matching SN even earlier during probe time? Or it has to depend on the >> presence of vPCI bridge to
Re: [PATCH net] failover: eliminate callback hell
On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski wrote: > On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote: >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of >> >> that flag, as well as the 1-netdev model, is to have a means to >> >> inherit the interface name from the VF, and to eliminate playing hacks >> >> around renaming devices, customizing udev rules and et al. Why >> >> inheriting VF's name important? To allow existing config/setup around >> >> VF continues to work across kernel feature upgrade. Most of network >> >> config files in all distros are based on interface names. Few are MAC >> >> address based but making lower slaves hidden would cover the rest. And >> >> most importantly, preserving the same level of user experience as >> >> using raw VF interface once getting all ndo_ops and ethtool_ops >> >> exposed. This is essential to realize transparent live migration that >> >> users dont have to learn and be aware of the undertaken. >> > >> > Inheriting the VF name will fail in the migration scenario. >> > It is perfectly reasonable to migrate a guest to another machine where >> > the VF PCI address is different. And since current udev/systemd model >> > is to base network device name off of PCI address, the device will change >> > name when guest is migrated. >> > >> The scenario of having VF on a different PCI address on post migration >> is essentially equal to plugging in a new NIC. Why it has to pair with >> the original PV? A sepearte PV device should be in place to pair the >> new VF. > > IMHO it may be a better idea to look at the VF as acceleration for the > PV rather than PV a migration vehicle from the VF. Hence we should I'm basically talking about two use cases not about solutions or implementations specifically. As said, the one I'm looking into needs to migrate a pre-failover VF setup to 1-netdev failover model in a transparent manner. There's no point to switch PCI address back and forth in the backend to set where to bind the PV or the VF, as you have no ways to predict what guest kernel will be running until its fully loaded. Supporting a VF on new location binding to existing PV might be nice, but not directly relevant to those who don't need this side feature than migration itself. Having said that, while I somewhat agree both use cases should have its own place in the picture, I don't think judging one better than the other or vice versa is logical IMHO. > continue to follow the naming of PV, like the current implementation > does implicitly by linking to PV's struct device. The current implementation may only work with new userspace, even so the eth0/eth0nsby naming is not consistenly persisted due to races in bus probing. The naming part should be fixed. -Siwei
Re: [PATCH net] failover: eliminate callback hell
On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote: > * Set permanent and current address of net_failover device > to match the primary. > > * Carrier should be marked off before registering device > the net_failover device. Sridhar, do we want to address this? If yes, could you please take a look at addressing these meanwhile, while we keep arguing about making API changes? -- MST
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 03:21:21PM -0700, Stephen Hemminger wrote: > > > > > > I think you need to get rid of triggering off of the name change. > > > > Worth considering down the road since it's a bit of a hack but it's one > > we won't have trouble supporting, unlike the delayed uplink. > > You can't depend on userspace doing rename. There's only so much we can do to add new functionality to old userspace. You can always just use PV and all will work. -- MST
Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 17:42:21 -0700 Siwei Liu wrote: > On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger > wrote: > > On Fri, 8 Jun 2018 16:44:12 -0700 > > Siwei Liu wrote: > > > >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger > >> wrote: > >> > On Fri, 8 Jun 2018 15:25:59 -0700 > >> > Siwei Liu wrote: > >> > > >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger > >> >> wrote: > >> >> > On Wed, 6 Jun 2018 15:30:27 +0300 > >> >> > "Michael S. Tsirkin" wrote: > >> >> > > >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > >> >> >> > wrote: > >> >> >> > >The net failover should be a simple library, not a virtual > >> >> >> > >object with function callbacks (see callback hell). > >> >> >> > > >> >> >> > Why just a library? It should do a common things. I think it > >> >> >> > should be a > >> >> >> > virtual object. Looks like your patch again splits the common > >> >> >> > functionality into multiple drivers. That is kind of backwards > >> >> >> > attitude. > >> >> >> > I don't get it. We should rather focus on fixing the mess the > >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >> >> >> > model. > >> >> >> > >> >> >> So it seems that at least one benefit for netvsc would be better > >> >> >> handling of renames. > >> >> >> > >> >> >> Question is how can this change to 3-netdev happen? Stephen is > >> >> >> concerned about risk of breaking some userspace. > >> >> >> > >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> >> >> address, and you said then "why not use existing network namespaces > >> >> >> rather than inventing a new abstraction". So how about it then? Do > >> >> >> you > >> >> >> want to find a way to use namespaces to hide the PV device for netvsc > >> >> >> compatibility? > >> >> >> > >> >> > > >> >> > Netvsc can't work with 3 dev model. MS has worked with enough > >> >> > distro's and > >> >> > startups that all demand eth0 always be present. And VF may come and > >> >> > go. > >> >> > After this history, there is a strong motivation not to change how > >> >> > kernel > >> >> > behaves. Switching to 3 device model would be perceived as breaking > >> >> > existing userspace. > >> >> > > >> >> > With virtio you can work it out with the distro's yourself. > >> >> > There is no pre-existing semantics to deal with. > >> >> > > >> >> > For the virtio, I don't see the need for IFF_HIDDEN. > >> >> > >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of > >> >> that flag, as well as the 1-netdev model, is to have a means to > >> >> inherit the interface name from the VF, and to eliminate playing hacks > >> >> around renaming devices, customizing udev rules and et al. Why > >> >> inheriting VF's name important? To allow existing config/setup around > >> >> VF continues to work across kernel feature upgrade. Most of network > >> >> config files in all distros are based on interface names. Few are MAC > >> >> address based but making lower slaves hidden would cover the rest. And > >> >> most importantly, preserving the same level of user experience as > >> >> using raw VF interface once getting all ndo_ops and ethtool_ops > >> >> exposed. This is essential to realize transparent live migration that > >> >> users dont have to learn and be aware of the undertaken. > >> > > >> > Inheriting the VF name will fail in the migration scenario. > >> > It is perfectly reasonable to migrate a guest to another machine where > >> > the VF PCI address is different. And since current udev/systemd model > >> > is to base network device name off of PCI address, the device will change > >> > name when guest is migrated. > >> > > >> The scenario of having VF on a different PCI address on post migration > >> is essentially equal to plugging in a new NIC. Why it has to pair with > >> the original PV? A sepearte PV device should be in place to pair the > >> new VF. > > > > The host only guarantees that the PV device will be on the same network. > > It does not make any PCI guarantees. The way Windows works is to find > > the device based on "serial number" which is an Hyper-V specific attribute > > of PCI devices. > > > > I considered naming off of serial number but that won't work for the > > case where PV device is present first and VF arrives later. The serial > > number is attribute of VF, not the PV which is there first. > > I assume the PV can get that information ahead of time before VF > arrives? Without it how do you match the device when you see a VF > coming with some serial number? Is it possible for PV to get the > matching SN even earlier during probe time? Or it has to depend on the > presence of vPCI bridge to generate this SN? NO. the PV device does not know ahead of time and there are scenario where the serial and PCI info can change when it does arrive. These are test
Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 15:54:38 -0700 Siwei Liu wrote: > On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar > wrote: > > > > > > On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > >> > >> On Wed, 6 Jun 2018 15:30:27 +0300 > >> "Michael S. Tsirkin" wrote: > >> > >>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > > > > The net failover should be a simple library, not a virtual > > object with function callbacks (see callback hell). > > Why just a library? It should do a common things. I think it should be a > virtual object. Looks like your patch again splits the common > functionality into multiple drivers. That is kind of backwards attitude. > I don't get it. We should rather focus on fixing the mess the > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > model. > >>> > >>> So it seems that at least one benefit for netvsc would be better > >>> handling of renames. > >>> > >>> Question is how can this change to 3-netdev happen? Stephen is > >>> concerned about risk of breaking some userspace. > >>> > >>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >>> address, and you said then "why not use existing network namespaces > >>> rather than inventing a new abstraction". So how about it then? Do you > >>> want to find a way to use namespaces to hide the PV device for netvsc > >>> compatibility? > >>> > >> Netvsc can't work with 3 dev model. MS has worked with enough distro's and > >> startups that all demand eth0 always be present. And VF may come and go. > >> After this history, there is a strong motivation not to change how kernel > >> behaves. Switching to 3 device model would be perceived as breaking > >> existing userspace. > > > > > > I think it should be possible for netvsc to work with 3 dev model if the > > only > > requirement is that eth0 will always be present. With net_failover, you will > > see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > > issue > > if somehow there is userspace requirement that there can be only 2 netdevs, > > not 3 > > when VF is plugged. > > > > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc > > device > > and the IP address gets configured on eth0. Will this be an issue? > > > Did you realize that the eth0 name in the current 3-netdev code can't > be consistently persisted across reboot, if you have more than one VFs > to pair with? On one boot it got eth0/eth0nsby, on the next it may get > eth1/eth1nsby on the same interface. > > It won't be useable by default until you add some custom udev rules. > I think there is no reason to break things by moving netvsc to 3 device model. The first device probed is always the same on Hyper-V/Azure, and always comes up as eth0. The order comes from the fact that they are reported to guest in that order and currently vmbus probe is single threaded.
Re: [PATCH net] failover: eliminate callback hell
On Fri, Jun 08, 2018 at 05:02:35PM -0700, Stephen Hemminger wrote: > On Fri, 8 Jun 2018 16:44:12 -0700 > Siwei Liu wrote: > > > On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger > > wrote: > > > On Fri, 8 Jun 2018 15:25:59 -0700 > > > Siwei Liu wrote: > > > > > >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger > > >> wrote: > > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > > >> > "Michael S. Tsirkin" wrote: > > >> > > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > >> >> > wrote: > > >> >> > >The net failover should be a simple library, not a virtual > > >> >> > >object with function callbacks (see callback hell). > > >> >> > > > >> >> > Why just a library? It should do a common things. I think it should > > >> >> > be a > > >> >> > virtual object. Looks like your patch again splits the common > > >> >> > functionality into multiple drivers. That is kind of backwards > > >> >> > attitude. > > >> >> > I don't get it. We should rather focus on fixing the mess the > > >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > >> >> > model. > > >> >> > > >> >> So it seems that at least one benefit for netvsc would be better > > >> >> handling of renames. > > >> >> > > >> >> Question is how can this change to 3-netdev happen? Stephen is > > >> >> concerned about risk of breaking some userspace. > > >> >> > > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > >> >> address, and you said then "why not use existing network namespaces > > >> >> rather than inventing a new abstraction". So how about it then? Do you > > >> >> want to find a way to use namespaces to hide the PV device for netvsc > > >> >> compatibility? > > >> >> > > >> > > > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > > >> > and > > >> > startups that all demand eth0 always be present. And VF may come and > > >> > go. > > >> > After this history, there is a strong motivation not to change how > > >> > kernel > > >> > behaves. Switching to 3 device model would be perceived as breaking > > >> > existing userspace. > > >> > > > >> > With virtio you can work it out with the distro's yourself. > > >> > There is no pre-existing semantics to deal with. > > >> > > > >> > For the virtio, I don't see the need for IFF_HIDDEN. > > >> > > >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of > > >> that flag, as well as the 1-netdev model, is to have a means to > > >> inherit the interface name from the VF, and to eliminate playing hacks > > >> around renaming devices, customizing udev rules and et al. Why > > >> inheriting VF's name important? To allow existing config/setup around > > >> VF continues to work across kernel feature upgrade. Most of network > > >> config files in all distros are based on interface names. Few are MAC > > >> address based but making lower slaves hidden would cover the rest. And > > >> most importantly, preserving the same level of user experience as > > >> using raw VF interface once getting all ndo_ops and ethtool_ops > > >> exposed. This is essential to realize transparent live migration that > > >> users dont have to learn and be aware of the undertaken. > > > > > > Inheriting the VF name will fail in the migration scenario. > > > It is perfectly reasonable to migrate a guest to another machine where > > > the VF PCI address is different. And since current udev/systemd model > > > is to base network device name off of PCI address, the device will change > > > name when guest is migrated. > > > > > The scenario of having VF on a different PCI address on post migration > > is essentially equal to plugging in a new NIC. Why it has to pair with > > the original PV? A sepearte PV device should be in place to pair the > > new VF. > > The host only guarantees that the PV device will be on the same network. > It does not make any PCI guarantees. The way Windows works is to find > the device based on "serial number" which is an Hyper-V specific attribute > of PCI devices. > > I considered naming off of serial number but that won't work for the > case where PV device is present first and VF arrives later. The serial > number is attribute of VF, not the PV which is there first. > > Your ideas about having the PCI information of the VF form the name > of the failover device have the same problem. The PV device may > be the only one present on boot. We plan to add the serial number to the PV. > > > > On Azure, the VF maybe removed (by host) at any time and then later > > > reattached. There is no guarantee that VF will show back up at > > > the same synthetic PCI address. It will likely have a different > > > PCI domain value. > > > > This is something QEMU can do and make sure the PCI address is > > consistent after migration. > > > > -Siwei
Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote: > >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of > >> that flag, as well as the 1-netdev model, is to have a means to > >> inherit the interface name from the VF, and to eliminate playing hacks > >> around renaming devices, customizing udev rules and et al. Why > >> inheriting VF's name important? To allow existing config/setup around > >> VF continues to work across kernel feature upgrade. Most of network > >> config files in all distros are based on interface names. Few are MAC > >> address based but making lower slaves hidden would cover the rest. And > >> most importantly, preserving the same level of user experience as > >> using raw VF interface once getting all ndo_ops and ethtool_ops > >> exposed. This is essential to realize transparent live migration that > >> users dont have to learn and be aware of the undertaken. > > > > Inheriting the VF name will fail in the migration scenario. > > It is perfectly reasonable to migrate a guest to another machine where > > the VF PCI address is different. And since current udev/systemd model > > is to base network device name off of PCI address, the device will change > > name when guest is migrated. > > > The scenario of having VF on a different PCI address on post migration > is essentially equal to plugging in a new NIC. Why it has to pair with > the original PV? A sepearte PV device should be in place to pair the > new VF. IMHO it may be a better idea to look at the VF as acceleration for the PV rather than PV a migration vehicle from the VF. Hence we should continue to follow the naming of PV, like the current implementation does implicitly by linking to PV's struct device.
Re: [PATCH net] failover: eliminate callback hell
On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger wrote: > On Fri, 8 Jun 2018 16:44:12 -0700 > Siwei Liu wrote: > >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger >> wrote: >> > On Fri, 8 Jun 2018 15:25:59 -0700 >> > Siwei Liu wrote: >> > >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger >> >> wrote: >> >> > On Wed, 6 Jun 2018 15:30:27 +0300 >> >> > "Michael S. Tsirkin" wrote: >> >> > >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org >> >> >> > wrote: >> >> >> > >The net failover should be a simple library, not a virtual >> >> >> > >object with function callbacks (see callback hell). >> >> >> > >> >> >> > Why just a library? It should do a common things. I think it should >> >> >> > be a >> >> >> > virtual object. Looks like your patch again splits the common >> >> >> > functionality into multiple drivers. That is kind of backwards >> >> >> > attitude. >> >> >> > I don't get it. We should rather focus on fixing the mess the >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev >> >> >> > model. >> >> >> >> >> >> So it seems that at least one benefit for netvsc would be better >> >> >> handling of renames. >> >> >> >> >> >> Question is how can this change to 3-netdev happen? Stephen is >> >> >> concerned about risk of breaking some userspace. >> >> >> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> >> >> address, and you said then "why not use existing network namespaces >> >> >> rather than inventing a new abstraction". So how about it then? Do you >> >> >> want to find a way to use namespaces to hide the PV device for netvsc >> >> >> compatibility? >> >> >> >> >> > >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's >> >> > and >> >> > startups that all demand eth0 always be present. And VF may come and go. >> >> > After this history, there is a strong motivation not to change how >> >> > kernel >> >> > behaves. Switching to 3 device model would be perceived as breaking >> >> > existing userspace. >> >> > >> >> > With virtio you can work it out with the distro's yourself. >> >> > There is no pre-existing semantics to deal with. >> >> > >> >> > For the virtio, I don't see the need for IFF_HIDDEN. >> >> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of >> >> that flag, as well as the 1-netdev model, is to have a means to >> >> inherit the interface name from the VF, and to eliminate playing hacks >> >> around renaming devices, customizing udev rules and et al. Why >> >> inheriting VF's name important? To allow existing config/setup around >> >> VF continues to work across kernel feature upgrade. Most of network >> >> config files in all distros are based on interface names. Few are MAC >> >> address based but making lower slaves hidden would cover the rest. And >> >> most importantly, preserving the same level of user experience as >> >> using raw VF interface once getting all ndo_ops and ethtool_ops >> >> exposed. This is essential to realize transparent live migration that >> >> users dont have to learn and be aware of the undertaken. >> > >> > Inheriting the VF name will fail in the migration scenario. >> > It is perfectly reasonable to migrate a guest to another machine where >> > the VF PCI address is different. And since current udev/systemd model >> > is to base network device name off of PCI address, the device will change >> > name when guest is migrated. >> > >> The scenario of having VF on a different PCI address on post migration >> is essentially equal to plugging in a new NIC. Why it has to pair with >> the original PV? A sepearte PV device should be in place to pair the >> new VF. > > The host only guarantees that the PV device will be on the same network. > It does not make any PCI guarantees. The way Windows works is to find > the device based on "serial number" which is an Hyper-V specific attribute > of PCI devices. > > I considered naming off of serial number but that won't work for the > case where PV device is present first and VF arrives later. The serial > number is attribute of VF, not the PV which is there first. I assume the PV can get that information ahead of time before VF arrives? Without it how do you match the device when you see a VF coming with some serial number? Is it possible for PV to get the matching SN even earlier during probe time? Or it has to depend on the presence of vPCI bridge to generate this SN? > > Your ideas about having the PCI information of the VF form the name > of the failover device have the same problem. The PV device may > be the only one present on boot. Yeah, this is a chicken-egg problem indeed, and that was the reason why I supply the BDF info for PV to name the master interface. However, the ACPI PCI slot needs to depend on the PCI bus enumeration so that can't be predictable. Would it make sense to only rename
Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 16:44:12 -0700 Siwei Liu wrote: > On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger > wrote: > > On Fri, 8 Jun 2018 15:25:59 -0700 > > Siwei Liu wrote: > > > >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger > >> wrote: > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > >> > "Michael S. Tsirkin" wrote: > >> > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > >> >> > wrote: > >> >> > >The net failover should be a simple library, not a virtual > >> >> > >object with function callbacks (see callback hell). > >> >> > > >> >> > Why just a library? It should do a common things. I think it should > >> >> > be a > >> >> > virtual object. Looks like your patch again splits the common > >> >> > functionality into multiple drivers. That is kind of backwards > >> >> > attitude. > >> >> > I don't get it. We should rather focus on fixing the mess the > >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >> >> > model. > >> >> > >> >> So it seems that at least one benefit for netvsc would be better > >> >> handling of renames. > >> >> > >> >> Question is how can this change to 3-netdev happen? Stephen is > >> >> concerned about risk of breaking some userspace. > >> >> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> >> address, and you said then "why not use existing network namespaces > >> >> rather than inventing a new abstraction". So how about it then? Do you > >> >> want to find a way to use namespaces to hide the PV device for netvsc > >> >> compatibility? > >> >> > >> > > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > >> > and > >> > startups that all demand eth0 always be present. And VF may come and go. > >> > After this history, there is a strong motivation not to change how kernel > >> > behaves. Switching to 3 device model would be perceived as breaking > >> > existing userspace. > >> > > >> > With virtio you can work it out with the distro's yourself. > >> > There is no pre-existing semantics to deal with. > >> > > >> > For the virtio, I don't see the need for IFF_HIDDEN. > >> > >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of > >> that flag, as well as the 1-netdev model, is to have a means to > >> inherit the interface name from the VF, and to eliminate playing hacks > >> around renaming devices, customizing udev rules and et al. Why > >> inheriting VF's name important? To allow existing config/setup around > >> VF continues to work across kernel feature upgrade. Most of network > >> config files in all distros are based on interface names. Few are MAC > >> address based but making lower slaves hidden would cover the rest. And > >> most importantly, preserving the same level of user experience as > >> using raw VF interface once getting all ndo_ops and ethtool_ops > >> exposed. This is essential to realize transparent live migration that > >> users dont have to learn and be aware of the undertaken. > > > > Inheriting the VF name will fail in the migration scenario. > > It is perfectly reasonable to migrate a guest to another machine where > > the VF PCI address is different. And since current udev/systemd model > > is to base network device name off of PCI address, the device will change > > name when guest is migrated. > > > The scenario of having VF on a different PCI address on post migration > is essentially equal to plugging in a new NIC. Why it has to pair with > the original PV? A sepearte PV device should be in place to pair the > new VF. The host only guarantees that the PV device will be on the same network. It does not make any PCI guarantees. The way Windows works is to find the device based on "serial number" which is an Hyper-V specific attribute of PCI devices. I considered naming off of serial number but that won't work for the case where PV device is present first and VF arrives later. The serial number is attribute of VF, not the PV which is there first. Your ideas about having the PCI information of the VF form the name of the failover device have the same problem. The PV device may be the only one present on boot. > > On Azure, the VF maybe removed (by host) at any time and then later > > reattached. There is no guarantee that VF will show back up at > > the same synthetic PCI address. It will likely have a different > > PCI domain value. > > This is something QEMU can do and make sure the PCI address is > consistent after migration. > > -Siwei
Re: [PATCH net] failover: eliminate callback hell
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger wrote: > On Fri, 8 Jun 2018 15:25:59 -0700 > Siwei Liu wrote: > >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger >> wrote: >> > On Wed, 6 Jun 2018 15:30:27 +0300 >> > "Michael S. Tsirkin" wrote: >> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: >> >> > >The net failover should be a simple library, not a virtual >> >> > >object with function callbacks (see callback hell). >> >> > >> >> > Why just a library? It should do a common things. I think it should be a >> >> > virtual object. Looks like your patch again splits the common >> >> > functionality into multiple drivers. That is kind of backwards attitude. >> >> > I don't get it. We should rather focus on fixing the mess the >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev >> >> > model. >> >> >> >> So it seems that at least one benefit for netvsc would be better >> >> handling of renames. >> >> >> >> Question is how can this change to 3-netdev happen? Stephen is >> >> concerned about risk of breaking some userspace. >> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> >> address, and you said then "why not use existing network namespaces >> >> rather than inventing a new abstraction". So how about it then? Do you >> >> want to find a way to use namespaces to hide the PV device for netvsc >> >> compatibility? >> >> >> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and >> > startups that all demand eth0 always be present. And VF may come and go. >> > After this history, there is a strong motivation not to change how kernel >> > behaves. Switching to 3 device model would be perceived as breaking >> > existing userspace. >> > >> > With virtio you can work it out with the distro's yourself. >> > There is no pre-existing semantics to deal with. >> > >> > For the virtio, I don't see the need for IFF_HIDDEN. >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of >> that flag, as well as the 1-netdev model, is to have a means to >> inherit the interface name from the VF, and to eliminate playing hacks >> around renaming devices, customizing udev rules and et al. Why >> inheriting VF's name important? To allow existing config/setup around >> VF continues to work across kernel feature upgrade. Most of network >> config files in all distros are based on interface names. Few are MAC >> address based but making lower slaves hidden would cover the rest. And >> most importantly, preserving the same level of user experience as >> using raw VF interface once getting all ndo_ops and ethtool_ops >> exposed. This is essential to realize transparent live migration that >> users dont have to learn and be aware of the undertaken. > > Inheriting the VF name will fail in the migration scenario. > It is perfectly reasonable to migrate a guest to another machine where > the VF PCI address is different. And since current udev/systemd model > is to base network device name off of PCI address, the device will change > name when guest is migrated. > The scenario of having VF on a different PCI address on post migration is essentially equal to plugging in a new NIC. Why it has to pair with the original PV? A sepearte PV device should be in place to pair the new VF. > On Azure, the VF maybe removed (by host) at any time and then later > reattached. There is no guarantee that VF will show back up at > the same synthetic PCI address. It will likely have a different > PCI domain value. This is something QEMU can do and make sure the PCI address is consistent after migration. -Siwei
Re: [PATCH net] failover: eliminate callback hell
On Fri, 8 Jun 2018 15:25:59 -0700 Siwei Liu wrote: > On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger > wrote: > > On Wed, 6 Jun 2018 15:30:27 +0300 > > "Michael S. Tsirkin" wrote: > > > >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > >> > >The net failover should be a simple library, not a virtual > >> > >object with function callbacks (see callback hell). > >> > > >> > Why just a library? It should do a common things. I think it should be a > >> > virtual object. Looks like your patch again splits the common > >> > functionality into multiple drivers. That is kind of backwards attitude. > >> > I don't get it. We should rather focus on fixing the mess the > >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >> > model. > >> > >> So it seems that at least one benefit for netvsc would be better > >> handling of renames. > >> > >> Question is how can this change to 3-netdev happen? Stephen is > >> concerned about risk of breaking some userspace. > >> > >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> address, and you said then "why not use existing network namespaces > >> rather than inventing a new abstraction". So how about it then? Do you > >> want to find a way to use namespaces to hide the PV device for netvsc > >> compatibility? > >> > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > > startups that all demand eth0 always be present. And VF may come and go. > > After this history, there is a strong motivation not to change how kernel > > behaves. Switching to 3 device model would be perceived as breaking > > existing userspace. > > > > With virtio you can work it out with the distro's yourself. > > There is no pre-existing semantics to deal with. > > > > For the virtio, I don't see the need for IFF_HIDDEN. > > I have a somewhat different view regarding IFF_HIDDEN. The purpose of > that flag, as well as the 1-netdev model, is to have a means to > inherit the interface name from the VF, and to eliminate playing hacks > around renaming devices, customizing udev rules and et al. Why > inheriting VF's name important? To allow existing config/setup around > VF continues to work across kernel feature upgrade. Most of network > config files in all distros are based on interface names. Few are MAC > address based but making lower slaves hidden would cover the rest. And > most importantly, preserving the same level of user experience as > using raw VF interface once getting all ndo_ops and ethtool_ops > exposed. This is essential to realize transparent live migration that > users dont have to learn and be aware of the undertaken. Inheriting the VF name will fail in the migration scenario. It is perfectly reasonable to migrate a guest to another machine where the VF PCI address is different. And since current udev/systemd model is to base network device name off of PCI address, the device will change name when guest is migrated. On Azure, the VF maybe removed (by host) at any time and then later reattached. There is no guarantee that VF will show back up at the same synthetic PCI address. It will likely have a different PCI domain value.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar wrote: > > > On 6/6/2018 2:24 PM, Stephen Hemminger wrote: >> >> On Wed, 6 Jun 2018 15:30:27 +0300 >> "Michael S. Tsirkin" wrote: >> >>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > > The net failover should be a simple library, not a virtual > object with function callbacks (see callback hell). Why just a library? It should do a common things. I think it should be a virtual object. Looks like your patch again splits the common functionality into multiple drivers. That is kind of backwards attitude. I don't get it. We should rather focus on fixing the mess the introduction of netvsc-bonding caused and switch netvsc to 3-netdev model. >>> >>> So it seems that at least one benefit for netvsc would be better >>> handling of renames. >>> >>> Question is how can this change to 3-netdev happen? Stephen is >>> concerned about risk of breaking some userspace. >>> >>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >>> address, and you said then "why not use existing network namespaces >>> rather than inventing a new abstraction". So how about it then? Do you >>> want to find a way to use namespaces to hide the PV device for netvsc >>> compatibility? >>> >> Netvsc can't work with 3 dev model. MS has worked with enough distro's and >> startups that all demand eth0 always be present. And VF may come and go. >> After this history, there is a strong motivation not to change how kernel >> behaves. Switching to 3 device model would be perceived as breaking >> existing userspace. > > > I think it should be possible for netvsc to work with 3 dev model if the > only > requirement is that eth0 will always be present. With net_failover, you will > see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > issue > if somehow there is userspace requirement that there can be only 2 netdevs, > not 3 > when VF is plugged. > > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc > device > and the IP address gets configured on eth0. Will this be an issue? > Did you realize that the eth0 name in the current 3-netdev code can't be consistently persisted across reboot, if you have more than one VFs to pair with? On one boot it got eth0/eth0nsby, on the next it may get eth1/eth1nsby on the same interface. It won't be useable by default until you add some custom udev rules. -Siwei > > >> >> With virtio you can work it out with the distro's yourself. >> There is no pre-existing semantics to deal with. >> >> For the virtio, I don't see the need for IFF_HIDDEN. >> With 3-dev model as long as you mark the PV and VF devices >> as slaves, then userspace knows to leave them alone. Assuming userspace >> is already able to deal with team and bond devices. >> Any time you introduce new UAPI behavior something breaks. >> >> On the rename front, I really don't care if VF can be renamed. And for >> netvsc want to allow the PV device to be renamed. Udev developers want >> that >> but have not found a stable/persistent value to expose to userspace >> to allow it. > >
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger wrote: > On Wed, 6 Jun 2018 15:30:27 +0300 > "Michael S. Tsirkin" wrote: > >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: >> > >The net failover should be a simple library, not a virtual >> > >object with function callbacks (see callback hell). >> > >> > Why just a library? It should do a common things. I think it should be a >> > virtual object. Looks like your patch again splits the common >> > functionality into multiple drivers. That is kind of backwards attitude. >> > I don't get it. We should rather focus on fixing the mess the >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev >> > model. >> >> So it seems that at least one benefit for netvsc would be better >> handling of renames. >> >> Question is how can this change to 3-netdev happen? Stephen is >> concerned about risk of breaking some userspace. >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> address, and you said then "why not use existing network namespaces >> rather than inventing a new abstraction". So how about it then? Do you >> want to find a way to use namespaces to hide the PV device for netvsc >> compatibility? >> > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > startups that all demand eth0 always be present. And VF may come and go. > After this history, there is a strong motivation not to change how kernel > behaves. Switching to 3 device model would be perceived as breaking > existing userspace. > > With virtio you can work it out with the distro's yourself. > There is no pre-existing semantics to deal with. > > For the virtio, I don't see the need for IFF_HIDDEN. I have a somewhat different view regarding IFF_HIDDEN. The purpose of that flag, as well as the 1-netdev model, is to have a means to inherit the interface name from the VF, and to eliminate playing hacks around renaming devices, customizing udev rules and et al. Why inheriting VF's name important? To allow existing config/setup around VF continues to work across kernel feature upgrade. Most of network config files in all distros are based on interface names. Few are MAC address based but making lower slaves hidden would cover the rest. And most importantly, preserving the same level of user experience as using raw VF interface once getting all ndo_ops and ethtool_ops exposed. This is essential to realize transparent live migration that users dont have to learn and be aware of the undertaken. It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few use cases don't care about getting slaves exposed, the 3-netdev model would work for them. For the rest, the pre-existing semantics to them is the single VF device model they've already dealt with. This is nothing different than having Azure stick to the 2-netdev model because of existing user base IMHO. -Siwei > With 3-dev model as long as you mark the PV and VF devices > as slaves, then userspace knows to leave them alone. Assuming userspace > is already able to deal with team and bond devices. > Any time you introduce new UAPI behavior something breaks. > > On the rename front, I really don't care if VF can be renamed. And for > netvsc want to allow the PV device to be renamed. Udev developers want that > but have not found a stable/persistent value to expose to userspace > to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Fri, Jun 08, 2018 at 11:30:08AM -0700, Stephen Hemminger wrote: > * what about nested KVM on Hyper-V? Would it make sense to > have a way to pass subset of VF queues to guest? No as long as hyper-v doesn't have a vIOMMU. -- MST
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 20:22:15 +0300 "Michael S. Tsirkin" wrote: > On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote: > > On Thu, 7 Jun 2018 18:41:31 +0300 > > "Michael S. Tsirkin" wrote: > > > > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > > > > vfio-pci on the netdevs it sees? > > > > > > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > > > > based.,. > > > > The DPDK support uses: > > > > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to > > > >userspace. This means VF netdev device must exist and be visible. > > > > * Slow path using kernel netvsc device, TAP and BPF to get exception > > > >path packets to userspace. > > > > * A autodiscovery mechanism that to set all this up that relies on > > > >2 device model and sysfs. > > > > > > Could you describe what does it look for exactly? What will break if > > > instead of MLX5 being a child of the PV, it's a child of the failover > > > device? > > > > So in DPDK there is an internal four device model: > > 1. failsafe is like failover in your model > > 2. TAP is used like netvsc in kernel > > 3. MLX5 is the VF > > 4. vdev_netvsc is a pseudo device whose only reason to exist > >is to glue everything together. > > > > Digging deeper inside... > > > > Vdev_netvsc does: > >* driver is started in a convuluted way off device arguments > >* probe routine for driver runs > > - scans list of kernel interfaces in sysfs > > - matches those using VMBUS > > Could you tell a bit more what does this step entail? Quick code high/low lights. ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, 1, name, kvargs, specified, ); static int vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface, const struct ether_addr *eth_addr, va_list ap), int is_netvsc, ...) { struct if_nameindex *iface = if_nameindex(); for (i = 0; iface[i].if_name; ++i) { is_netvsc_ret = vdev_netvsc_iface_is_netvsc([i]) ? 1 : 0; if (is_netvsc ^ is_netvsc_ret) continue; strlcpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name)); if (ioctl(s, SIOCGIFHWADDR, ) == -1) { } memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data, RTE_DIM(eth_addr.addr_bytes)); ret = func([i], _addr, ap); << func is vdev_netvsc_netvsc_probe static int vdev_netvsc_netvsc_probe(const struct if_nameindex *iface, const struct ether_addr *eth_addr, va_list ap) { /* Routed NetVSC should not be probed. */ if (vdev_netvsc_has_route(iface, AF_INET) || vdev_netvsc_has_route(iface, AF_INET6)) { if (!specified) return 0; DRV_LOG(WARNING, "probably using routed NetVSC interface \"%s\"" " (index %u)", iface->if_name, iface->if_index); } /* Create interface context. */ ctx = calloc(1, sizeof(*ctx)); ... > > > - skip netvsc devices that have an IPV4 route > >* scan for PCI devices that have same MAC address as kernel netvsc > > devices discovered in previous step > >* add these interfaces to arguments to failsafe > > > > Then failsafe configures based on arguments on device > > > > The code works but is specific to the Azure hardware model, and exposes lots > > of things to application that it should not have to care about. > > > > If you try and walk through this code in DPDK, you will see why I have > > developed > > a dislike for high levels of indirection. > > > > > > > > Thanks that was helpful! I'll try to poke at it next week. Just from > the description it seems the kernel is merely used to locate the MAC > address through sysfs and that for this DPDK code to keep working the > hidden device must be hidden from it in sysfs - is that a fair summary? What is the point of the 3 device model? What value does it have to userspace? How would userspace use each of the three devices. Going back to 3 device model really doesn't make sense to me if there is not visible benefit. Some other considerations: * there is ongoing development to support RDMA failover as well in netvsc. * there is a new driver which implements the VMBUS protocol in userspace for DPDK. This gets rid of several layers and removes any special scanning code. The vmbus device is unbound from netvsc and bound to UIO device. Then the user space DPDK driver manages all the host signalling events including VF discovery. It is really 2 device model done all in userspace. The kernel device is still needed when
Re: [PATCH net] failover: eliminate callback hell
On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 18:41:31 +0300 > "Michael S. Tsirkin" wrote: > > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > > > vfio-pci on the netdevs it sees? > > > > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > > > based.,. > > > The DPDK support uses: > > > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to > > >userspace. This means VF netdev device must exist and be visible. > > > * Slow path using kernel netvsc device, TAP and BPF to get exception > > >path packets to userspace. > > > * A autodiscovery mechanism that to set all this up that relies on > > >2 device model and sysfs. > > > > Could you describe what does it look for exactly? What will break if > > instead of MLX5 being a child of the PV, it's a child of the failover > > device? > > So in DPDK there is an internal four device model: > 1. failsafe is like failover in your model > 2. TAP is used like netvsc in kernel > 3. MLX5 is the VF > 4. vdev_netvsc is a pseudo device whose only reason to exist > is to glue everything together. > > Digging deeper inside... > > Vdev_netvsc does: >* driver is started in a convuluted way off device arguments >* probe routine for driver runs > - scans list of kernel interfaces in sysfs > - matches those using VMBUS Could you tell a bit more what does this step entail? > - skip netvsc devices that have an IPV4 route >* scan for PCI devices that have same MAC address as kernel netvsc > devices discovered in previous step >* add these interfaces to arguments to failsafe > > Then failsafe configures based on arguments on device > > The code works but is specific to the Azure hardware model, and exposes lots > of things to application that it should not have to care about. > > If you try and walk through this code in DPDK, you will see why I have > developed > a dislike for high levels of indirection. > > > Thanks that was helpful! I'll try to poke at it next week. Just from the description it seems the kernel is merely used to locate the MAC address through sysfs and that for this DPDK code to keep working the hidden device must be hidden from it in sysfs - is that a fair summary? -- MST
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 18:41:31 +0300 "Michael S. Tsirkin" wrote: > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > > vfio-pci on the netdevs it sees? > > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > > based.,. > > The DPDK support uses: > > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to > >userspace. This means VF netdev device must exist and be visible. > > * Slow path using kernel netvsc device, TAP and BPF to get exception > >path packets to userspace. > > * A autodiscovery mechanism that to set all this up that relies on > >2 device model and sysfs. > > Could you describe what does it look for exactly? What will break if > instead of MLX5 being a child of the PV, it's a child of the failover > device? So in DPDK there is an internal four device model: 1. failsafe is like failover in your model 2. TAP is used like netvsc in kernel 3. MLX5 is the VF 4. vdev_netvsc is a pseudo device whose only reason to exist is to glue everything together. Digging deeper inside... Vdev_netvsc does: * driver is started in a convuluted way off device arguments * probe routine for driver runs - scans list of kernel interfaces in sysfs - matches those using VMBUS - skip netvsc devices that have an IPV4 route * scan for PCI devices that have same MAC address as kernel netvsc devices discovered in previous step * add these interfaces to arguments to failsafe Then failsafe configures based on arguments on device The code works but is specific to the Azure hardware model, and exposes lots of things to application that it should not have to care about. If you try and walk through this code in DPDK, you will see why I have developed a dislike for high levels of indirection.
Re: [PATCH net] failover: eliminate callback hell
On Thu, Jun 07, 2018 at 07:51:12AM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 07:17:51 -0700 > Alexander Duyck wrote: > > > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger > > wrote: > > > On Wed, 6 Jun 2018 14:54:04 -0700 > > > "Samudrala, Sridhar" wrote: > > > > > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > > >> > "Michael S. Tsirkin" wrote: > > >> > > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > >> >>> wrote: > > >> The net failover should be a simple library, not a virtual > > >> object with function callbacks (see callback hell). > > >> >>> Why just a library? It should do a common things. I think it should > > >> >>> be a > > >> >>> virtual object. Looks like your patch again splits the common > > >> >>> functionality into multiple drivers. That is kind of backwards > > >> >>> attitude. > > >> >>> I don't get it. We should rather focus on fixing the mess the > > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > >> >>> model. > > >> >> So it seems that at least one benefit for netvsc would be better > > >> >> handling of renames. > > >> >> > > >> >> Question is how can this change to 3-netdev happen? Stephen is > > >> >> concerned about risk of breaking some userspace. > > >> >> > > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > >> >> address, and you said then "why not use existing network namespaces > > >> >> rather than inventing a new abstraction". So how about it then? Do you > > >> >> want to find a way to use namespaces to hide the PV device for netvsc > > >> >> compatibility? > > >> >> > > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > > >> > and > > >> > startups that all demand eth0 always be present. And VF may come and > > >> > go. > > >> > After this history, there is a strong motivation not to change how > > >> > kernel > > >> > behaves. Switching to 3 device model would be perceived as breaking > > >> > existing userspace. > > >> > > >> I think it should be possible for netvsc to work with 3 dev model if the > > >> only > > >> requirement is that eth0 will always be present. With net_failover, you > > >> will > > >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > > >> issue > > >> if somehow there is userspace requirement that there can be only 2 > > >> netdevs, not 3 > > >> when VF is plugged. > > >> > > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the > > >> netvsc device > > >> and the IP address gets configured on eth0. Will this be an issue? > > > > > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > > > but that is the way it is. > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping > > vfio-pci on the netdevs it sees? > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel > based.,. > The DPDK support uses: > * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to >userspace. This means VF netdev device must exist and be visible. > * Slow path using kernel netvsc device, TAP and BPF to get exception >path packets to userspace. > * A autodiscovery mechanism that to set all this up that relies on >2 device model and sysfs. Could you describe what does it look for exactly? What will break if instead of MLX5 being a child of the PV, it's a child of the failover device? > In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual > IOMMU so VFIO will not work there at all. >
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 17:57:50 +0300 "Michael S. Tsirkin" wrote: > On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote: > > On Thu, 7 Jun 2018 00:47:52 +0300 > > "Michael S. Tsirkin" wrote: > > > > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > > > > On Wed, 6 Jun 2018 15:30:27 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > > > > > wrote: > > > > > > >The net failover should be a simple library, not a virtual > > > > > > >object with function callbacks (see callback hell). > > > > > > > > > > > > Why just a library? It should do a common things. I think it should > > > > > > be a > > > > > > virtual object. Looks like your patch again splits the common > > > > > > functionality into multiple drivers. That is kind of backwards > > > > > > attitude. > > > > > > I don't get it. We should rather focus on fixing the mess the > > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > > > > model. > > > > > > > > > > So it seems that at least one benefit for netvsc would be better > > > > > handling of renames. > > > > > > > > > > Question is how can this change to 3-netdev happen? Stephen is > > > > > concerned about risk of breaking some userspace. > > > > > > > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > > > > address, and you said then "why not use existing network namespaces > > > > > rather than inventing a new abstraction". So how about it then? Do you > > > > > want to find a way to use namespaces to hide the PV device for netvsc > > > > > compatibility? > > > > > > > > > > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's > > > > and > > > > startups that all demand eth0 always be present. And VF may come and > > > > go. > > > > > > Well failover seems to maintain this invariant with the 3 dev model. > > > > > > > After this history, there is a strong motivation not to change how > > > > kernel > > > > behaves. Switching to 3 device model would be perceived as breaking > > > > existing userspace. > > > > > > I feel I'm misunderstood. I was asking whether a 3-rd device can be > > > hidden so that userspace does not know that you switched to a 3 device > > > model. It will think there are 2 devices and will keep working. > > > > > > If you do that, then there won't be anything that > > > would be perceived as breaking existing userspace, will there? > > > > DPDK now knows about the netvsc 2 device model and drivers in userspace > > depend on it. > > Interesting but I'm not sure how this answers the question. How would > DPDK care that there's a hidden device? If you can point out the > code in question, maybe a way can be found to make changes while > keeping it working. See http://dpdk.org/browse/dpdk/tree/drivers/net/vdev_netvsc/vdev_netvsc.c I am working to eliminate the necessity of this complex model in DPDK. Having a 3 device model inside DPDK has just as many problems as in the kernel.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote: > On Thu, 7 Jun 2018 00:47:52 +0300 > "Michael S. Tsirkin" wrote: > > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > > > On Wed, 6 Jun 2018 15:30:27 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > > > > > wrote: > > > > > >The net failover should be a simple library, not a virtual > > > > > >object with function callbacks (see callback hell). > > > > > > > > > > Why just a library? It should do a common things. I think it should > > > > > be a > > > > > virtual object. Looks like your patch again splits the common > > > > > functionality into multiple drivers. That is kind of backwards > > > > > attitude. > > > > > I don't get it. We should rather focus on fixing the mess the > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > > > model. > > > > > > > > So it seems that at least one benefit for netvsc would be better > > > > handling of renames. > > > > > > > > Question is how can this change to 3-netdev happen? Stephen is > > > > concerned about risk of breaking some userspace. > > > > > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > > > address, and you said then "why not use existing network namespaces > > > > rather than inventing a new abstraction". So how about it then? Do you > > > > want to find a way to use namespaces to hide the PV device for netvsc > > > > compatibility? > > > > > > > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > > > startups that all demand eth0 always be present. And VF may come and go. > > > > Well failover seems to maintain this invariant with the 3 dev model. > > > > > After this history, there is a strong motivation not to change how kernel > > > behaves. Switching to 3 device model would be perceived as breaking > > > existing userspace. > > > > I feel I'm misunderstood. I was asking whether a 3-rd device can be > > hidden so that userspace does not know that you switched to a 3 device > > model. It will think there are 2 devices and will keep working. > > > > If you do that, then there won't be anything that > > would be perceived as breaking existing userspace, will there? > > DPDK now knows about the netvsc 2 device model and drivers in userspace > depend on it. Interesting but I'm not sure how this answers the question. How would DPDK care that there's a hidden device? If you can point out the code in question, maybe a way can be found to make changes while keeping it working. > > > > > > > With virtio you can work it out with the distro's yourself. > > > There is no pre-existing semantics to deal with. > > > > > > For the virtio, I don't see the need for IFF_HIDDEN. > > > With 3-dev model as long as you mark the PV and VF devices > > > as slaves, then userspace knows to leave them alone. Assuming userspace > > > is already able to deal with team and bond devices. > > > > That's clear enough. > > > > > Any time you introduce new UAPI behavior something breaks. > > > > Not if we do it right. > > > > > On the rename front, I really don't care if VF can be renamed. > > > > OK that's nice. > > > > > And for > > > netvsc want to allow the PV device to be renamed. > > > > That's because of the 2 device model, right? So that explains why even > > if the delayed hack is good for the goose it might not be good for the > > gander :) > > You are bringing up the VF right away. How does the 3-device initialization > state machine work? Do you give a window for udev to possibly rename the > VF? Do you rely on that? > > > > > > Udev developers want that > > > but have not found a stable/persistent value to expose to userspace > > > to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 07:17:51 -0700 Alexander Duyck wrote: > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger > wrote: > > On Wed, 6 Jun 2018 14:54:04 -0700 > > "Samudrala, Sridhar" wrote: > > > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > >> > On Wed, 6 Jun 2018 15:30:27 +0300 > >> > "Michael S. Tsirkin" wrote: > >> > > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org > >> >>> wrote: > >> The net failover should be a simple library, not a virtual > >> object with function callbacks (see callback hell). > >> >>> Why just a library? It should do a common things. I think it should be > >> >>> a > >> >>> virtual object. Looks like your patch again splits the common > >> >>> functionality into multiple drivers. That is kind of backwards > >> >>> attitude. > >> >>> I don't get it. We should rather focus on fixing the mess the > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >> >>> model. > >> >> So it seems that at least one benefit for netvsc would be better > >> >> handling of renames. > >> >> > >> >> Question is how can this change to 3-netdev happen? Stephen is > >> >> concerned about risk of breaking some userspace. > >> >> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> >> address, and you said then "why not use existing network namespaces > >> >> rather than inventing a new abstraction". So how about it then? Do you > >> >> want to find a way to use namespaces to hide the PV device for netvsc > >> >> compatibility? > >> >> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's > >> > and > >> > startups that all demand eth0 always be present. And VF may come and go. > >> > After this history, there is a strong motivation not to change how kernel > >> > behaves. Switching to 3 device model would be perceived as breaking > >> > existing userspace. > >> > >> I think it should be possible for netvsc to work with 3 dev model if the > >> only > >> requirement is that eth0 will always be present. With net_failover, you > >> will > >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an > >> issue > >> if somehow there is userspace requirement that there can be only 2 > >> netdevs, not 3 > >> when VF is plugged. > >> > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc > >> device > >> and the IP address gets configured on eth0. Will this be an issue? > > > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > > but that is the way it is. > > Why would DPDK care what we do in the kernel? Isn't it just slapping > vfio-pci on the netdevs it sees? Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,. The DPDK support uses: * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to userspace. This means VF netdev device must exist and be visible. * Slow path using kernel netvsc device, TAP and BPF to get exception path packets to userspace. * A autodiscovery mechanism that to set all this up that relies on 2 device model and sysfs. In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual IOMMU so VFIO will not work there at all.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger wrote: > On Wed, 6 Jun 2018 14:54:04 -0700 > "Samudrala, Sridhar" wrote: > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote: >> > On Wed, 6 Jun 2018 15:30:27 +0300 >> > "Michael S. Tsirkin" wrote: >> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: >> The net failover should be a simple library, not a virtual >> object with function callbacks (see callback hell). >> >>> Why just a library? It should do a common things. I think it should be a >> >>> virtual object. Looks like your patch again splits the common >> >>> functionality into multiple drivers. That is kind of backwards attitude. >> >>> I don't get it. We should rather focus on fixing the mess the >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev >> >>> model. >> >> So it seems that at least one benefit for netvsc would be better >> >> handling of renames. >> >> >> >> Question is how can this change to 3-netdev happen? Stephen is >> >> concerned about risk of breaking some userspace. >> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to >> >> address, and you said then "why not use existing network namespaces >> >> rather than inventing a new abstraction". So how about it then? Do you >> >> want to find a way to use namespaces to hide the PV device for netvsc >> >> compatibility? >> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and >> > startups that all demand eth0 always be present. And VF may come and go. >> > After this history, there is a strong motivation not to change how kernel >> > behaves. Switching to 3 device model would be perceived as breaking >> > existing userspace. >> >> I think it should be possible for netvsc to work with 3 dev model if the only >> requirement is that eth0 will always be present. With net_failover, you will >> see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an >> issue >> if somehow there is userspace requirement that there can be only 2 netdevs, >> not 3 >> when VF is plugged. >> >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc >> device >> and the IP address gets configured on eth0. Will this be an issue? > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess > but that is the way it is. Why would DPDK care what we do in the kernel? Isn't it just slapping vfio-pci on the netdevs it sees?
Re: [PATCH net] failover: eliminate callback hell
On Wed, 6 Jun 2018 15:19:30 +0300 "Michael S. Tsirkin" wrote: > On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote: > > > I think the push back was with the usage of the delay, not bringing up > > > the primary/standby > > > device in the name change event handler. > > > Can't netvsc use this mechanism instead of depending on the delay? > > > > > > > > > > The patch that was rejected for netvsc was about using name change. > > So failover is now doing exactly what you wanted netvsc to do. Rather > than reverting everyone to old behaviour how about using more pieces > from failover? > > > Also, you can't depend on name change; you still need a timer. Not all > > distributions > > change name of devices. > > So failover chose not to implement the delayed open so far. > If it does I suspect we'll have to keep it around forever - > kind of like netvsc seems to be stuck with it. > But let's see if there's enough actual demand from people running > ancient distros with latest kernels to even start looking for > a solution for failover. > > And this kind of behaviour change really should be split out > so we can discuss it separately. > > > Or user has blocked that by udev rules. > > Don't do that then? > If you don't want to allow udev to rename the device, then just pull the name change hook.
Re: [PATCH net] failover: eliminate callback hell
On Wed, 6 Jun 2018 14:54:04 -0700 "Samudrala, Sridhar" wrote: > On 6/6/2018 2:24 PM, Stephen Hemminger wrote: > > On Wed, 6 Jun 2018 15:30:27 +0300 > > "Michael S. Tsirkin" wrote: > > > >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > The net failover should be a simple library, not a virtual > object with function callbacks (see callback hell). > >>> Why just a library? It should do a common things. I think it should be a > >>> virtual object. Looks like your patch again splits the common > >>> functionality into multiple drivers. That is kind of backwards attitude. > >>> I don't get it. We should rather focus on fixing the mess the > >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev > >>> model. > >> So it seems that at least one benefit for netvsc would be better > >> handling of renames. > >> > >> Question is how can this change to 3-netdev happen? Stephen is > >> concerned about risk of breaking some userspace. > >> > >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > >> address, and you said then "why not use existing network namespaces > >> rather than inventing a new abstraction". So how about it then? Do you > >> want to find a way to use namespaces to hide the PV device for netvsc > >> compatibility? > >> > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > > startups that all demand eth0 always be present. And VF may come and go. > > After this history, there is a strong motivation not to change how kernel > > behaves. Switching to 3 device model would be perceived as breaking > > existing userspace. > > I think it should be possible for netvsc to work with 3 dev model if the only > requirement is that eth0 will always be present. With net_failover, you will > see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an issue > if somehow there is userspace requirement that there can be only 2 netdevs, > not 3 > when VF is plugged. > > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc > device > and the IP address gets configured on eth0. Will this be an issue? DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess but that is the way it is.
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 00:47:52 +0300 "Michael S. Tsirkin" wrote: > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > > On Wed, 6 Jun 2018 15:30:27 +0300 > > "Michael S. Tsirkin" wrote: > > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > > > > > > > > >The net failover should be a simple library, not a virtual > > > > >object with function callbacks (see callback hell). > > > > > > > > Why just a library? It should do a common things. I think it should be a > > > > virtual object. Looks like your patch again splits the common > > > > functionality into multiple drivers. That is kind of backwards attitude. > > > > I don't get it. We should rather focus on fixing the mess the > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > > model. > > > > > > So it seems that at least one benefit for netvsc would be better > > > handling of renames. > > > > > > Question is how can this change to 3-netdev happen? Stephen is > > > concerned about risk of breaking some userspace. > > > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > > address, and you said then "why not use existing network namespaces > > > rather than inventing a new abstraction". So how about it then? Do you > > > want to find a way to use namespaces to hide the PV device for netvsc > > > compatibility? > > > > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > > startups that all demand eth0 always be present. And VF may come and go. > > Well failover seems to maintain this invariant with the 3 dev model. > > > After this history, there is a strong motivation not to change how kernel > > behaves. Switching to 3 device model would be perceived as breaking > > existing userspace. > > I feel I'm misunderstood. I was asking whether a 3-rd device can be > hidden so that userspace does not know that you switched to a 3 device > model. It will think there are 2 devices and will keep working. > > If you do that, then there won't be anything that > would be perceived as breaking existing userspace, will there? DPDK now knows about the netvsc 2 device model and drivers in userspace depend on it. > > > > With virtio you can work it out with the distro's yourself. > > There is no pre-existing semantics to deal with. > > > > For the virtio, I don't see the need for IFF_HIDDEN. > > With 3-dev model as long as you mark the PV and VF devices > > as slaves, then userspace knows to leave them alone. Assuming userspace > > is already able to deal with team and bond devices. > > That's clear enough. > > > Any time you introduce new UAPI behavior something breaks. > > Not if we do it right. > > > On the rename front, I really don't care if VF can be renamed. > > OK that's nice. > > > And for > > netvsc want to allow the PV device to be renamed. > > That's because of the 2 device model, right? So that explains why even > if the delayed hack is good for the goose it might not be good for the > gander :) You are bringing up the VF right away. How does the 3-device initialization state machine work? Do you give a window for udev to possibly rename the VF? Do you rely on that? > > > Udev developers want that > > but have not found a stable/persistent value to expose to userspace > > to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Thu, 7 Jun 2018 00:30:21 +0300 "Michael S. Tsirkin" wrote: > On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote: > > On Tue, 5 Jun 2018 23:11:37 -0700 > > "Samudrala, Sridhar" wrote: > > > > > On 6/5/2018 11:00 PM, Stephen Hemminger wrote: > > > > On Tue, 5 Jun 2018 22:39:12 -0700 > > > > "Samudrala, Sridhar" wrote: > > > > > > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote: > > > >>> On Tue, 5 Jun 2018 16:52:22 -0700 > > > >>> "Samudrala, Sridhar" wrote: > > > >>> > > > On 6/5/2018 2:52 PM, Stephen Hemminger wrote: > > > > On Tue, 5 Jun 2018 22:38:43 +0300 > > > > "Michael S. Tsirkin" wrote: > > > > > > > >>> See: > > > >>> https://patchwork.ozlabs.org/patch/851711/ > > > >> Let me try to summarize that: > > > >> > > > >>You wanted to speed up the delayed link up. You had an idea to > > > >>additionally take link up when userspace renames the interface > > > >> (standby > > > >>one which is also the failover for netvsc). > > > >> > > > >>But userspace might not do any renames, in which case there will > > > >>still be the delay, and so this never got applied. > > > >> > > > >>Is this a good summary? > > > >> > > > >> Davem said delay should go away completely as it's not robust, and > > > >> I > > > >> think I agree. So I don't think we should make all failover users > > > >> use > > > >> delay. IIUC failover kept a delay option especially for netvsc to > > > >> minimize the surprise factor. Hopefully we can come up with > > > >> something more robust and drop that option completely. > > > > The timeout was the original solution to how to complete setup after > > > > userspace has had a chance to rename the device. Unfortunately, the > > > > whole network > > > > device initialization (cooperation with udev and userspace) is a a > > > > mess because > > > > there is no well defined specification, and there are multiple ways > > > > userspace > > > > does this in old and new distributions. The timeout has its own > > > > issues > > > > (how long, handling errors during that window, what if userspace > > > > modifies other > > > > device state); and open to finding a better solution. > > > > > > > > My point was that if name change can not be relied on (or used) by > > > > netvsc, > > > > then we can't allow it for net_failover either. > > > I think the push back was with the usage of the delay, not bringing > > > up the primary/standby > > > device in the name change event handler. > > > Can't netvsc use this mechanism instead of depending on the delay? > > > > > > > > > >>> The patch that was rejected for netvsc was about using name change. > > > >>> Also, you can't depend on name change; you still need a timer. Not > > > >>> all distributions > > > >>> change name of devices. Or user has blocked that by udev rules. > > > >> In the net_failover_slave_register() we do a dev_open() and ignore any > > > >> failure due to > > > >> EBUSY and do another dev_open() in the name change event handler. > > > >> If the name is not expected to change, i would think the dev_open() at > > > >> the time of > > > >> register will succeed. > > > > The problem is your first dev_open will bring device up and lockout > > > > udev from changing the network device name. > > > > > > I have tried with/without udev and didn't see any issue with the naming > > > of the primary/standby > > > devices in my testing. > > > > > > With the 3-netdev failover model, we are only interested in setting the > > > right name for the failover > > > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does > > > it really matter if udev fails > > > to rename the lower primary/standby netdevs, i don't think it will > > > matter? The user is not expected > > > to touch the lower netdevs. > > > > Renaming matters to some users and the udev maintainers. They want the VF > > to be named enpXXX > > The primary in virtio case needs to be ens3 with some cloud platforms. > > Confused. VF can't be a standby, of it's used in a failover it's a > primary, you can't call it both enpXXX amd ens3. Could you describe the > use case in a bit more detail? Sorry, got things backwards. The primary is VF and it should be possible to have it renamed based on PCI info. The standby PV (in 3 dev model) would get renamed by udev to ens3. So the failover device would just be ethX right? > > > > I think you need to get rid of triggering off of the name change. > > Worth considering down the road since it's a bit of a hack but it's one > we won't have trouble supporting, unlike the delayed uplink. You can't depend on userspace doing rename. > > > Long term, let's open the discussion about allowing network
Re: [PATCH net] failover: eliminate callback hell
On 6/6/2018 2:24 PM, Stephen Hemminger wrote: On Wed, 6 Jun 2018 15:30:27 +0300 "Michael S. Tsirkin" wrote: On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: The net failover should be a simple library, not a virtual object with function callbacks (see callback hell). Why just a library? It should do a common things. I think it should be a virtual object. Looks like your patch again splits the common functionality into multiple drivers. That is kind of backwards attitude. I don't get it. We should rather focus on fixing the mess the introduction of netvsc-bonding caused and switch netvsc to 3-netdev model. So it seems that at least one benefit for netvsc would be better handling of renames. Question is how can this change to 3-netdev happen? Stephen is concerned about risk of breaking some userspace. Stephen, this seems to be the usecase that IFF_HIDDEN was trying to address, and you said then "why not use existing network namespaces rather than inventing a new abstraction". So how about it then? Do you want to find a way to use namespaces to hide the PV device for netvsc compatibility? Netvsc can't work with 3 dev model. MS has worked with enough distro's and startups that all demand eth0 always be present. And VF may come and go. After this history, there is a strong motivation not to change how kernel behaves. Switching to 3 device model would be perceived as breaking existing userspace. I think it should be possible for netvsc to work with 3 dev model if the only requirement is that eth0 will always be present. With net_failover, you will see eth0 and eth0nsby OR with older distros eth0 and eth1. It may be an issue if somehow there is userspace requirement that there can be only 2 netdevs, not 3 when VF is plugged. eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device and the IP address gets configured on eth0. Will this be an issue? With virtio you can work it out with the distro's yourself. There is no pre-existing semantics to deal with. For the virtio, I don't see the need for IFF_HIDDEN. With 3-dev model as long as you mark the PV and VF devices as slaves, then userspace knows to leave them alone. Assuming userspace is already able to deal with team and bond devices. Any time you introduce new UAPI behavior something breaks. On the rename front, I really don't care if VF can be renamed. And for netvsc want to allow the PV device to be renamed. Udev developers want that but have not found a stable/persistent value to expose to userspace to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote: > On Wed, 6 Jun 2018 15:30:27 +0300 > "Michael S. Tsirkin" wrote: > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > > > >The net failover should be a simple library, not a virtual > > > >object with function callbacks (see callback hell). > > > > > > Why just a library? It should do a common things. I think it should be a > > > virtual object. Looks like your patch again splits the common > > > functionality into multiple drivers. That is kind of backwards attitude. > > > I don't get it. We should rather focus on fixing the mess the > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > > model. > > > > So it seems that at least one benefit for netvsc would be better > > handling of renames. > > > > Question is how can this change to 3-netdev happen? Stephen is > > concerned about risk of breaking some userspace. > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > > address, and you said then "why not use existing network namespaces > > rather than inventing a new abstraction". So how about it then? Do you > > want to find a way to use namespaces to hide the PV device for netvsc > > compatibility? > > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and > startups that all demand eth0 always be present. And VF may come and go. Well failover seems to maintain this invariant with the 3 dev model. > After this history, there is a strong motivation not to change how kernel > behaves. Switching to 3 device model would be perceived as breaking > existing userspace. I feel I'm misunderstood. I was asking whether a 3-rd device can be hidden so that userspace does not know that you switched to a 3 device model. It will think there are 2 devices and will keep working. If you do that, then there won't be anything that would be perceived as breaking existing userspace, will there? > With virtio you can work it out with the distro's yourself. > There is no pre-existing semantics to deal with. > > For the virtio, I don't see the need for IFF_HIDDEN. > With 3-dev model as long as you mark the PV and VF devices > as slaves, then userspace knows to leave them alone. Assuming userspace > is already able to deal with team and bond devices. That's clear enough. > Any time you introduce new UAPI behavior something breaks. Not if we do it right. > On the rename front, I really don't care if VF can be renamed. OK that's nice. > And for > netvsc want to allow the PV device to be renamed. That's because of the 2 device model, right? So that explains why even if the delayed hack is good for the goose it might not be good for the gander :) > Udev developers want that > but have not found a stable/persistent value to expose to userspace > to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote: > On Tue, 5 Jun 2018 23:11:37 -0700 > "Samudrala, Sridhar" wrote: > > > On 6/5/2018 11:00 PM, Stephen Hemminger wrote: > > > On Tue, 5 Jun 2018 22:39:12 -0700 > > > "Samudrala, Sridhar" wrote: > > > > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote: > > >>> On Tue, 5 Jun 2018 16:52:22 -0700 > > >>> "Samudrala, Sridhar" wrote: > > >>> > > On 6/5/2018 2:52 PM, Stephen Hemminger wrote: > > > On Tue, 5 Jun 2018 22:38:43 +0300 > > > "Michael S. Tsirkin" wrote: > > > > > >>> See: > > >>> https://patchwork.ozlabs.org/patch/851711/ > > >> Let me try to summarize that: > > >> > > >> You wanted to speed up the delayed link up. You had an idea to > > >> additionally take link up when userspace renames the interface > > >> (standby > > >> one which is also the failover for netvsc). > > >> > > >> But userspace might not do any renames, in which case there will > > >> still be the delay, and so this never got applied. > > >> > > >> Is this a good summary? > > >> > > >> Davem said delay should go away completely as it's not robust, and I > > >> think I agree. So I don't think we should make all failover users > > >> use > > >> delay. IIUC failover kept a delay option especially for netvsc to > > >> minimize the surprise factor. Hopefully we can come up with > > >> something more robust and drop that option completely. > > > The timeout was the original solution to how to complete setup after > > > userspace has had a chance to rename the device. Unfortunately, the > > > whole network > > > device initialization (cooperation with udev and userspace) is a a > > > mess because > > > there is no well defined specification, and there are multiple ways > > > userspace > > > does this in old and new distributions. The timeout has its own > > > issues > > > (how long, handling errors during that window, what if userspace > > > modifies other > > > device state); and open to finding a better solution. > > > > > > My point was that if name change can not be relied on (or used) by > > > netvsc, > > > then we can't allow it for net_failover either. > > I think the push back was with the usage of the delay, not bringing up > > the primary/standby > > device in the name change event handler. > > Can't netvsc use this mechanism instead of depending on the delay? > > > > > > >>> The patch that was rejected for netvsc was about using name change. > > >>> Also, you can't depend on name change; you still need a timer. Not all > > >>> distributions > > >>> change name of devices. Or user has blocked that by udev rules. > > >> In the net_failover_slave_register() we do a dev_open() and ignore any > > >> failure due to > > >> EBUSY and do another dev_open() in the name change event handler. > > >> If the name is not expected to change, i would think the dev_open() at > > >> the time of > > >> register will succeed. > > > The problem is your first dev_open will bring device up and lockout > > > udev from changing the network device name. > > > > I have tried with/without udev and didn't see any issue with the naming of > > the primary/standby > > devices in my testing. > > > > With the 3-netdev failover model, we are only interested in setting the > > right name for the failover > > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it > > really matter if udev fails > > to rename the lower primary/standby netdevs, i don't think it will matter? > > The user is not expected > > to touch the lower netdevs. > > Renaming matters to some users and the udev maintainers. They want the VF to > be named enpXXX > The primary in virtio case needs to be ens3 with some cloud platforms. Confused. VF can't be a standby, of it's used in a failover it's a primary, you can't call it both enpXXX amd ens3. Could you describe the use case in a bit more detail? > > I think you need to get rid of triggering off of the name change. Worth considering down the road since it's a bit of a hack but it's one we won't have trouble supporting, unlike the delayed uplink. > Long term, let's open the discussion about allowing network devices to change > name when up? > Maybe with NETIF_LIVENAME_CHANGE flag? That's probably the cleanest approach assuming it can be made to work without races. I suspect we can just live with what we have until then. -- MST
Re: [PATCH net] failover: eliminate callback hell
On Wed, 6 Jun 2018 09:25:12 +0200 Jiri Pirko wrote: > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > >The net failover should be a simple library, not a virtual > >object with function callbacks (see callback hell). > > Why just a library? It should do a common things. I think it should be a > virtual object. Looks like your patch again splits the common > functionality into multiple drivers. That is kind of backwards attitude. > I don't get it. We should rather focus on fixing the mess the > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > model. To me, callbacks are worse than a simple notifier. Is there some other agenda going on? do you want to get rid of network notifier. Netvsc is not switching to 3 device model in near future. Too much userspace is wrapped around it already.
Re: [PATCH net] failover: eliminate callback hell
On Wed, 6 Jun 2018 15:30:27 +0300 "Michael S. Tsirkin" wrote: > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > > >The net failover should be a simple library, not a virtual > > >object with function callbacks (see callback hell). > > > > Why just a library? It should do a common things. I think it should be a > > virtual object. Looks like your patch again splits the common > > functionality into multiple drivers. That is kind of backwards attitude. > > I don't get it. We should rather focus on fixing the mess the > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > > model. > > So it seems that at least one benefit for netvsc would be better > handling of renames. > > Question is how can this change to 3-netdev happen? Stephen is > concerned about risk of breaking some userspace. > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to > address, and you said then "why not use existing network namespaces > rather than inventing a new abstraction". So how about it then? Do you > want to find a way to use namespaces to hide the PV device for netvsc > compatibility? > Netvsc can't work with 3 dev model. MS has worked with enough distro's and startups that all demand eth0 always be present. And VF may come and go. After this history, there is a strong motivation not to change how kernel behaves. Switching to 3 device model would be perceived as breaking existing userspace. With virtio you can work it out with the distro's yourself. There is no pre-existing semantics to deal with. For the virtio, I don't see the need for IFF_HIDDEN. With 3-dev model as long as you mark the PV and VF devices as slaves, then userspace knows to leave them alone. Assuming userspace is already able to deal with team and bond devices. Any time you introduce new UAPI behavior something breaks. On the rename front, I really don't care if VF can be renamed. And for netvsc want to allow the PV device to be renamed. Udev developers want that but have not found a stable/persistent value to expose to userspace to allow it.
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 23:11:37 -0700 "Samudrala, Sridhar" wrote: > On 6/5/2018 11:00 PM, Stephen Hemminger wrote: > > On Tue, 5 Jun 2018 22:39:12 -0700 > > "Samudrala, Sridhar" wrote: > > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote: > >>> On Tue, 5 Jun 2018 16:52:22 -0700 > >>> "Samudrala, Sridhar" wrote: > >>> > On 6/5/2018 2:52 PM, Stephen Hemminger wrote: > > On Tue, 5 Jun 2018 22:38:43 +0300 > > "Michael S. Tsirkin" wrote: > > > >>> See: > >>> https://patchwork.ozlabs.org/patch/851711/ > >> Let me try to summarize that: > >> > >>You wanted to speed up the delayed link up. You had an idea to > >>additionally take link up when userspace renames the interface > >> (standby > >>one which is also the failover for netvsc). > >> > >>But userspace might not do any renames, in which case there will > >>still be the delay, and so this never got applied. > >> > >>Is this a good summary? > >> > >> Davem said delay should go away completely as it's not robust, and I > >> think I agree. So I don't think we should make all failover users use > >> delay. IIUC failover kept a delay option especially for netvsc to > >> minimize the surprise factor. Hopefully we can come up with > >> something more robust and drop that option completely. > > The timeout was the original solution to how to complete setup after > > userspace has had a chance to rename the device. Unfortunately, the > > whole network > > device initialization (cooperation with udev and userspace) is a a mess > > because > > there is no well defined specification, and there are multiple ways > > userspace > > does this in old and new distributions. The timeout has its own issues > > (how long, handling errors during that window, what if userspace > > modifies other > > device state); and open to finding a better solution. > > > > My point was that if name change can not be relied on (or used) by > > netvsc, > > then we can't allow it for net_failover either. > I think the push back was with the usage of the delay, not bringing up > the primary/standby > device in the name change event handler. > Can't netvsc use this mechanism instead of depending on the delay? > > > >>> The patch that was rejected for netvsc was about using name change. > >>> Also, you can't depend on name change; you still need a timer. Not all > >>> distributions > >>> change name of devices. Or user has blocked that by udev rules. > >> In the net_failover_slave_register() we do a dev_open() and ignore any > >> failure due to > >> EBUSY and do another dev_open() in the name change event handler. > >> If the name is not expected to change, i would think the dev_open() at the > >> time of > >> register will succeed. > > The problem is your first dev_open will bring device up and lockout > > udev from changing the network device name. > > I have tried with/without udev and didn't see any issue with the naming of > the primary/standby > devices in my testing. > > With the 3-netdev failover model, we are only interested in setting the right > name for the failover > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it > really matter if udev fails > to rename the lower primary/standby netdevs, i don't think it will matter? > The user is not expected > to touch the lower netdevs. Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX The primary in virtio case needs to be ens3 with some cloud platforms. I think you need to get rid of triggering off of the name change. Long term, let's open the discussion about allowing network devices to change name when up? Maybe with NETIF_LIVENAME_CHANGE flag?
Re: [PATCH net] failover: eliminate callback hell
On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote: > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: > >The net failover should be a simple library, not a virtual > >object with function callbacks (see callback hell). > > Why just a library? It should do a common things. I think it should be a > virtual object. Looks like your patch again splits the common > functionality into multiple drivers. That is kind of backwards attitude. > I don't get it. We should rather focus on fixing the mess the > introduction of netvsc-bonding caused and switch netvsc to 3-netdev > model. So it seems that at least one benefit for netvsc would be better handling of renames. Question is how can this change to 3-netdev happen? Stephen is concerned about risk of breaking some userspace. Stephen, this seems to be the usecase that IFF_HIDDEN was trying to address, and you said then "why not use existing network namespaces rather than inventing a new abstraction". So how about it then? Do you want to find a way to use namespaces to hide the PV device for netvsc compatibility? -- MST
Re: [PATCH net] failover: eliminate callback hell
On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote: > > I think the push back was with the usage of the delay, not bringing up the > > primary/standby > > device in the name change event handler. > > Can't netvsc use this mechanism instead of depending on the delay? > > > > > > The patch that was rejected for netvsc was about using name change. So failover is now doing exactly what you wanted netvsc to do. Rather than reverting everyone to old behaviour how about using more pieces from failover? > Also, you can't depend on name change; you still need a timer. Not all > distributions > change name of devices. So failover chose not to implement the delayed open so far. If it does I suspect we'll have to keep it around forever - kind of like netvsc seems to be stuck with it. But let's see if there's enough actual demand from people running ancient distros with latest kernels to even start looking for a solution for failover. And this kind of behaviour change really should be split out so we can discuss it separately. > Or user has blocked that by udev rules. Don't do that then? -- MST
Re: [PATCH net] failover: eliminate callback hell
Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: >The net failover should be a simple library, not a virtual >object with function callbacks (see callback hell). Why just a library? It should do a common things. I think it should be a virtual object. Looks like your patch again splits the common functionality into multiple drivers. That is kind of backwards attitude. I don't get it. We should rather focus on fixing the mess the introduction of netvsc-bonding caused and switch netvsc to 3-netdev model.
Re: [PATCH net] failover: eliminate callback hell
On 6/5/2018 11:00 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 22:39:12 -0700 "Samudrala, Sridhar" wrote: On 6/5/2018 8:51 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 16:52:22 -0700 "Samudrala, Sridhar" wrote: On 6/5/2018 2:52 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 22:38:43 +0300 "Michael S. Tsirkin" wrote: See: https://patchwork.ozlabs.org/patch/851711/ Let me try to summarize that: You wanted to speed up the delayed link up. You had an idea to additionally take link up when userspace renames the interface (standby one which is also the failover for netvsc). But userspace might not do any renames, in which case there will still be the delay, and so this never got applied. Is this a good summary? Davem said delay should go away completely as it's not robust, and I think I agree. So I don't think we should make all failover users use delay. IIUC failover kept a delay option especially for netvsc to minimize the surprise factor. Hopefully we can come up with something more robust and drop that option completely. The timeout was the original solution to how to complete setup after userspace has had a chance to rename the device. Unfortunately, the whole network device initialization (cooperation with udev and userspace) is a a mess because there is no well defined specification, and there are multiple ways userspace does this in old and new distributions. The timeout has its own issues (how long, handling errors during that window, what if userspace modifies other device state); and open to finding a better solution. My point was that if name change can not be relied on (or used) by netvsc, then we can't allow it for net_failover either. I think the push back was with the usage of the delay, not bringing up the primary/standby device in the name change event handler. Can't netvsc use this mechanism instead of depending on the delay? The patch that was rejected for netvsc was about using name change. Also, you can't depend on name change; you still need a timer. Not all distributions change name of devices. Or user has blocked that by udev rules. In the net_failover_slave_register() we do a dev_open() and ignore any failure due to EBUSY and do another dev_open() in the name change event handler. If the name is not expected to change, i would think the dev_open() at the time of register will succeed. The problem is your first dev_open will bring device up and lockout udev from changing the network device name. I have tried with/without udev and didn't see any issue with the naming of the primary/standby devices in my testing. With the 3-netdev failover model, we are only interested in setting the right name for the failover netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected to touch the lower netdevs.
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 22:39:12 -0700 "Samudrala, Sridhar" wrote: > On 6/5/2018 8:51 PM, Stephen Hemminger wrote: > > On Tue, 5 Jun 2018 16:52:22 -0700 > > "Samudrala, Sridhar" wrote: > > > >> On 6/5/2018 2:52 PM, Stephen Hemminger wrote: > >>> On Tue, 5 Jun 2018 22:38:43 +0300 > >>> "Michael S. Tsirkin" wrote: > >>> > > See: > > https://patchwork.ozlabs.org/patch/851711/ > Let me try to summarize that: > > You wanted to speed up the delayed link up. You had an idea to > additionally take link up when userspace renames the interface (standby > one which is also the failover for netvsc). > > But userspace might not do any renames, in which case there will > still be the delay, and so this never got applied. > > Is this a good summary? > > Davem said delay should go away completely as it's not robust, and I > think I agree. So I don't think we should make all failover users use > delay. IIUC failover kept a delay option especially for netvsc to > minimize the surprise factor. Hopefully we can come up with > something more robust and drop that option completely. > >>> The timeout was the original solution to how to complete setup after > >>> userspace has had a chance to rename the device. Unfortunately, the whole > >>> network > >>> device initialization (cooperation with udev and userspace) is a a mess > >>> because > >>> there is no well defined specification, and there are multiple ways > >>> userspace > >>> does this in old and new distributions. The timeout has its own issues > >>> (how long, handling errors during that window, what if userspace modifies > >>> other > >>> device state); and open to finding a better solution. > >>> > >>> My point was that if name change can not be relied on (or used) by netvsc, > >>> then we can't allow it for net_failover either. > >> I think the push back was with the usage of the delay, not bringing up the > >> primary/standby > >> device in the name change event handler. > >> Can't netvsc use this mechanism instead of depending on the delay? > >> > >> > > The patch that was rejected for netvsc was about using name change. > > Also, you can't depend on name change; you still need a timer. Not all > > distributions > > change name of devices. Or user has blocked that by udev rules. > > In the net_failover_slave_register() we do a dev_open() and ignore any > failure due to > EBUSY and do another dev_open() in the name change event handler. > If the name is not expected to change, i would think the dev_open() at the > time of > register will succeed. The problem is your first dev_open will bring device up and lockout udev from changing the network device name.
Re: [PATCH net] failover: eliminate callback hell
On 6/5/2018 8:51 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 16:52:22 -0700 "Samudrala, Sridhar" wrote: On 6/5/2018 2:52 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 22:38:43 +0300 "Michael S. Tsirkin" wrote: See: https://patchwork.ozlabs.org/patch/851711/ Let me try to summarize that: You wanted to speed up the delayed link up. You had an idea to additionally take link up when userspace renames the interface (standby one which is also the failover for netvsc). But userspace might not do any renames, in which case there will still be the delay, and so this never got applied. Is this a good summary? Davem said delay should go away completely as it's not robust, and I think I agree. So I don't think we should make all failover users use delay. IIUC failover kept a delay option especially for netvsc to minimize the surprise factor. Hopefully we can come up with something more robust and drop that option completely. The timeout was the original solution to how to complete setup after userspace has had a chance to rename the device. Unfortunately, the whole network device initialization (cooperation with udev and userspace) is a a mess because there is no well defined specification, and there are multiple ways userspace does this in old and new distributions. The timeout has its own issues (how long, handling errors during that window, what if userspace modifies other device state); and open to finding a better solution. My point was that if name change can not be relied on (or used) by netvsc, then we can't allow it for net_failover either. I think the push back was with the usage of the delay, not bringing up the primary/standby device in the name change event handler. Can't netvsc use this mechanism instead of depending on the delay? The patch that was rejected for netvsc was about using name change. Also, you can't depend on name change; you still need a timer. Not all distributions change name of devices. Or user has blocked that by udev rules. In the net_failover_slave_register() we do a dev_open() and ignore any failure due to EBUSY and do another dev_open() in the name change event handler. If the name is not expected to change, i would think the dev_open() at the time of register will succeed.
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 16:52:22 -0700 "Samudrala, Sridhar" wrote: > On 6/5/2018 2:52 PM, Stephen Hemminger wrote: > > On Tue, 5 Jun 2018 22:38:43 +0300 > > "Michael S. Tsirkin" wrote: > > > >>> See: > >>> https://patchwork.ozlabs.org/patch/851711/ > >> Let me try to summarize that: > >> > >>You wanted to speed up the delayed link up. You had an idea to > >>additionally take link up when userspace renames the interface (standby > >>one which is also the failover for netvsc). > >> > >>But userspace might not do any renames, in which case there will > >>still be the delay, and so this never got applied. > >> > >>Is this a good summary? > >> > >> Davem said delay should go away completely as it's not robust, and I > >> think I agree. So I don't think we should make all failover users use > >> delay. IIUC failover kept a delay option especially for netvsc to > >> minimize the surprise factor. Hopefully we can come up with > >> something more robust and drop that option completely. > > The timeout was the original solution to how to complete setup after > > userspace has had a chance to rename the device. Unfortunately, the whole > > network > > device initialization (cooperation with udev and userspace) is a a mess > > because > > there is no well defined specification, and there are multiple ways > > userspace > > does this in old and new distributions. The timeout has its own issues > > (how long, handling errors during that window, what if userspace modifies > > other > > device state); and open to finding a better solution. > > > > My point was that if name change can not be relied on (or used) by netvsc, > > then we can't allow it for net_failover either. > > I think the push back was with the usage of the delay, not bringing up the > primary/standby > device in the name change event handler. > Can't netvsc use this mechanism instead of depending on the delay? > > The patch that was rejected for netvsc was about using name change. Also, you can't depend on name change; you still need a timer. Not all distributions change name of devices. Or user has blocked that by udev rules.
Re: [PATCH net] failover: eliminate callback hell
On 6/5/2018 2:52 PM, Stephen Hemminger wrote: On Tue, 5 Jun 2018 22:38:43 +0300 "Michael S. Tsirkin" wrote: See: https://patchwork.ozlabs.org/patch/851711/ Let me try to summarize that: You wanted to speed up the delayed link up. You had an idea to additionally take link up when userspace renames the interface (standby one which is also the failover for netvsc). But userspace might not do any renames, in which case there will still be the delay, and so this never got applied. Is this a good summary? Davem said delay should go away completely as it's not robust, and I think I agree. So I don't think we should make all failover users use delay. IIUC failover kept a delay option especially for netvsc to minimize the surprise factor. Hopefully we can come up with something more robust and drop that option completely. The timeout was the original solution to how to complete setup after userspace has had a chance to rename the device. Unfortunately, the whole network device initialization (cooperation with udev and userspace) is a a mess because there is no well defined specification, and there are multiple ways userspace does this in old and new distributions. The timeout has its own issues (how long, handling errors during that window, what if userspace modifies other device state); and open to finding a better solution. My point was that if name change can not be relied on (or used) by netvsc, then we can't allow it for net_failover either. I think the push back was with the usage of the delay, not bringing up the primary/standby device in the name change event handler. Can't netvsc use this mechanism instead of depending on the delay?
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 22:38:43 +0300 "Michael S. Tsirkin" wrote: > > > > See: > >https://patchwork.ozlabs.org/patch/851711/ > > Let me try to summarize that: > > You wanted to speed up the delayed link up. You had an idea to > additionally take link up when userspace renames the interface (standby > one which is also the failover for netvsc). > > But userspace might not do any renames, in which case there will > still be the delay, and so this never got applied. > > Is this a good summary? > > Davem said delay should go away completely as it's not robust, and I > think I agree. So I don't think we should make all failover users use > delay. IIUC failover kept a delay option especially for netvsc to > minimize the surprise factor. Hopefully we can come up with > something more robust and drop that option completely. The timeout was the original solution to how to complete setup after userspace has had a chance to rename the device. Unfortunately, the whole network device initialization (cooperation with udev and userspace) is a a mess because there is no well defined specification, and there are multiple ways userspace does this in old and new distributions. The timeout has its own issues (how long, handling errors during that window, what if userspace modifies other device state); and open to finding a better solution. My point was that if name change can not be relied on (or used) by netvsc, then we can't allow it for net_failover either.
Re: [PATCH net] failover: eliminate callback hell
On Tue, Jun 05, 2018 at 11:53:05AM -0700, Stephen Hemminger wrote: > > > * Now, netvsc and net_failover use the same delayed work type > > > mechanism for setup. Previously, net_failover code was triggering off > > > name change but a similar policy was rejected for netvsc. > > > "what is good for the goose is good for the gander" > > > > I don't really understand what you are saying here. I think the delayed > > hack is kind of ugly and seems racy. Current failover code was rejected > > by whom? Why is new one good and for whom? Did you want to do a name > > change in netvsc but it was rejected? Could you clarify please? > > See: >https://patchwork.ozlabs.org/patch/851711/ Let me try to summarize that: You wanted to speed up the delayed link up. You had an idea to additionally take link up when userspace renames the interface (standby one which is also the failover for netvsc). But userspace might not do any renames, in which case there will still be the delay, and so this never got applied. Is this a good summary? Davem said delay should go away completely as it's not robust, and I think I agree. So I don't think we should make all failover users use delay. IIUC failover kept a delay option especially for netvsc to minimize the surprise factor. Hopefully we can come up with something more robust and drop that option completely. > > > * Set permanent and current address of net_failover device > > > to match the primary. > > > > > > * Carrier should be marked off before registering device > > > the net_failover device. > > > > Are above two bugfixes? > > Yes. Maybe fix these two as first patches in the set? > > > Although this patch needs to go into 4.18 (linux-net), > > > > I'd rather we focused on fixing bugs in 4.18, and left refactoring to > > 4.19. > > > > Either we fix or revert the current code in 4.18. > Sorry, I am not having callback hell code in any vendor or upstream kernel. I agree callbacks add complexity which often isn't necessary, so removing them where possible is a good cleanup. But maybe a patch shouldn't mix bugfixes, cleanups and behaviour changes all together. If nothing else it makes review harder. Splitting patches up might make it more likely they can go into 4.18 which seems to be what you want. HTH, -- MST
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 21:35:26 +0300 "Michael S. Tsirkin" wrote: > Thanks, I think this is nice patch but I wonder whether it can be split > up somewhat. Not all of it is uncontroversial. I started that way, but then I was fixing code that was later deleted. The big change was eliminating the callbacks. > > On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote: > > * The matching of secondary device to primary device policy > > is up to the network device. Both net_failover and netvsc > > will use MAC for now but can change separately. > > I actually suspect both will change to a serial number > down the road. > > > * The match policy is only used during initial discovery; after > > that the secondary device knows what the upper device is because > > of the parent/child relationship; no searching is required. > > That would obviously be an improvement - does it have to be tied with > rest of changes? This was not possible with the version of the common code that is in net now. > > > * Now, netvsc and net_failover use the same delayed work type > > mechanism for setup. Previously, net_failover code was triggering off > > name change but a similar policy was rejected for netvsc. > > "what is good for the goose is good for the gander" > > I don't really understand what you are saying here. I think the delayed > hack is kind of ugly and seems racy. Current failover code was rejected > by whom? Why is new one good and for whom? Did you want to do a name > change in netvsc but it was rejected? Could you clarify please? See: https://patchwork.ozlabs.org/patch/851711/ > > > * The net_failover private device info 'struct net_failover_info' > > should have been private to the driver file, not a visible > > API. > > > > * The net_failover device should use SET_NETDEV_DEV > > that is intended only for physical devices not virtual devices. > > You mean should not. Yes. Virtual device should not set device parent. > > > * No point in having DocBook style comments on a driver file. > > They only make sense on an external exposed API. > > > > * net_failover only supports Ethernet, so use ether_addr_copy. > > It is since you need to know about all the things you need to copy, and > because of mac matching. But it isn't too much effort to add more > transports and I don't see value in going in the reverse direction and > making it more ethernet specific that it already is. Sure, then do memcpy base on addr_len > > > * Set permanent and current address of net_failover device > > to match the primary. > > > > * Carrier should be marked off before registering device > > the net_failover device. > > Are above two bugfixes? Yes. > > > * Use netdev_XXX for log messages, in net_failover (not dev_xxx) > > > > * Since failover infrastructure is about linking devices just > > use RTNL no need for other locking in init and teardown. > > > > * Don't bother with ERR_PTR() style return if only possible > > return is success or no memory. > > > > * As much as possible, the terms master and slave should be avoided > > because of their cultural connotations. > > Also for consistency, failover is calling these primary and standby now. Good, let's standardize on that. > > > Note; this code has been tested on Hyper-V > > but is compile tested only on virtio. > > > > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") > > Signed-off-by: Stephen Hemminger > > --- > > > > Although this patch needs to go into 4.18 (linux-net), > > I'd rather we focused on fixing bugs in 4.18, and left refactoring to > 4.19. > Either we fix or revert the current code in 4.18. Sorry, I am not having callback hell code in any vendor or upstream kernel.
Re: [PATCH net] failover: eliminate callback hell
Thanks, I think this is nice patch but I wonder whether it can be split up somewhat. Not all of it is uncontroversial. On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote: > * The matching of secondary device to primary device policy > is up to the network device. Both net_failover and netvsc > will use MAC for now but can change separately. I actually suspect both will change to a serial number down the road. > * The match policy is only used during initial discovery; after > that the secondary device knows what the upper device is because > of the parent/child relationship; no searching is required. That would obviously be an improvement - does it have to be tied with rest of changes? > * Now, netvsc and net_failover use the same delayed work type > mechanism for setup. Previously, net_failover code was triggering off > name change but a similar policy was rejected for netvsc. > "what is good for the goose is good for the gander" I don't really understand what you are saying here. I think the delayed hack is kind of ugly and seems racy. Current failover code was rejected by whom? Why is new one good and for whom? Did you want to do a name change in netvsc but it was rejected? Could you clarify please? > * The net_failover private device info 'struct net_failover_info' > should have been private to the driver file, not a visible > API. > > * The net_failover device should use SET_NETDEV_DEV > that is intended only for physical devices not virtual devices. You mean should not. > * No point in having DocBook style comments on a driver file. > They only make sense on an external exposed API. > > * net_failover only supports Ethernet, so use ether_addr_copy. It is since you need to know about all the things you need to copy, and because of mac matching. But it isn't too much effort to add more transports and I don't see value in going in the reverse direction and making it more ethernet specific that it already is. > * Set permanent and current address of net_failover device > to match the primary. > > * Carrier should be marked off before registering device > the net_failover device. Are above two bugfixes? > * Use netdev_XXX for log messages, in net_failover (not dev_xxx) > > * Since failover infrastructure is about linking devices just > use RTNL no need for other locking in init and teardown. > > * Don't bother with ERR_PTR() style return if only possible > return is success or no memory. > > * As much as possible, the terms master and slave should be avoided > because of their cultural connotations. Also for consistency, failover is calling these primary and standby now. > Note; this code has been tested on Hyper-V > but is compile tested only on virtio. > > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") > Signed-off-by: Stephen Hemminger > --- > > Although this patch needs to go into 4.18 (linux-net), I'd rather we focused on fixing bugs in 4.18, and left refactoring to 4.19. At some point you said refactoring is needed to support matching using the serial number, but I see this didn't make 4.18. So no rush IMHO. > this version is based against net-next because net-next > hasn't been merged into linux-net yet. > > > drivers/net/hyperv/hyperv_net.h | 3 +- > drivers/net/hyperv/netvsc_drv.c | 173 +++-- > drivers/net/net_failover.c | 312 --- > drivers/net/virtio_net.c| 9 +- > include/net/failover.h | 31 +--- > include/net/net_failover.h | 32 +--- > net/Kconfig | 13 +- > net/core/failover.c | 316 > 8 files changed, 373 insertions(+), 516 deletions(-) > > diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h > index 99d8e7398a5b..c7d25d10765e 100644 > --- a/drivers/net/hyperv/hyperv_net.h > +++ b/drivers/net/hyperv/hyperv_net.h > @@ -902,6 +902,8 @@ struct net_device_context { > struct hv_device *device_ctx; > /* netvsc_device */ > struct netvsc_device __rcu *nvdev; > + /* list of netvsc net_devices */ > + struct list_head list; > /* reconfigure work */ > struct delayed_work dwork; > /* last reconfig time */ > @@ -933,7 +935,6 @@ struct net_device_context { > /* Serial number of the VF to team with */ > u32 vf_serial; > > - struct failover *failover; > }; > > /* Per channel data */ > diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c > index bef4d55a108c..074e6b8578df 100644 > --- a/drivers/net/hyperv/netvsc_drv.c > +++ b/drivers/net/hyperv/netvsc_drv.c > @@ -70,6 +70,8 @@ static int debug = -1; > module_param(debug, int, 0444); > MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); > > +static LIST_HEAD(netvsc_dev_list); > + > static void netvsc_change_rx_flags(struct net_device
Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger Date: Tue, 5 Jun 2018 10:45:10 -0700 > I said it wasn't tested. Not surprising. Don't have a version of KVM > that supports standby (and not going to build KVM from scratch for > this). It would definitely help me if you put "RFC" in the subject line for patches which aren't tested :-) Thanks.
Re: [PATCH net] failover: eliminate callback hell
On Tue, 5 Jun 2018 10:22:13 -0700 "Samudrala, Sridhar" wrote: > On 6/4/2018 8:42 PM, Stephen Hemminger wrote: > > The net failover should be a simple library, not a virtual > > object with function callbacks (see callback hell). > > The code is simpler is smaller both for the netvsc and virtio use case. > > I quickly tried this patch and it breaks virtio-net in standby mode. > I don't see failover netdev, unloading virtio-net causes a crash. I said it wasn't tested. Not surprising. Don't have a version of KVM that supports standby (and not going to build KVM from scratch for this). > > With these changes, there is very minimal code that is shared between > netvsc and virtio-net. The notifier and event handling code and the > lookup_bymac routines are now duplicated in both the drivers. I thought > we wanted to keep this code common between the 2 drivers and we went through > multiple revisions to make sure that it works with both netvsc's 2 netdev > and virtio-net's 3 netdev models. The sharing is what needs to be shared; it turns out not much is common really. The total code size is less with this version. Also, since even with nested virtualization, both drivers will not be present on same guest at same time. > The reason for the indirect ops is to support these 2 different models and > i am not sure if the overhead of the callbacks is that significant considering > that they are not called in the hot path. The callbacks invert the functionality. It makes everything bottom up.
Re: [PATCH net] failover: eliminate callback hell
On 6/4/2018 8:42 PM, Stephen Hemminger wrote: The net failover should be a simple library, not a virtual object with function callbacks (see callback hell). The code is simpler is smaller both for the netvsc and virtio use case. I quickly tried this patch and it breaks virtio-net in standby mode. I don't see failover netdev, unloading virtio-net causes a crash. With these changes, there is very minimal code that is shared between netvsc and virtio-net. The notifier and event handling code and the lookup_bymac routines are now duplicated in both the drivers. I thought we wanted to keep this code common between the 2 drivers and we went through multiple revisions to make sure that it works with both netvsc's 2 netdev and virtio-net's 3 netdev models. The reason for the indirect ops is to support these 2 different models and i am not sure if the overhead of the callbacks is that significant considering that they are not called in the hot path. The code is restructured in many ways. I should have given these as review comments to net_failover during review but did not want to overwhelm the original submitter. Therefore it was merged prematurely. Some of the many items changed are: * The support routines should just be selected as needed in kernel config, no need for them to be visible config items. * Both netvsc and net_failover should keep their list of their own devices. Not a common list. * The matching of secondary device to primary device policy is up to the network device. Both net_failover and netvsc will use MAC for now but can change separately. * The match policy is only used during initial discovery; after that the secondary device knows what the upper device is because of the parent/child relationship; no searching is required. * Now, netvsc and net_failover use the same delayed work type mechanism for setup. Previously, net_failover code was triggering off name change but a similar policy was rejected for netvsc. "what is good for the goose is good for the gander" * The net_failover private device info 'struct net_failover_info' should have been private to the driver file, not a visible API. * The net_failover device should use SET_NETDEV_DEV that is intended only for physical devices not virtual devices. * No point in having DocBook style comments on a driver file. They only make sense on an external exposed API. * net_failover only supports Ethernet, so use ether_addr_copy. * Set permanent and current address of net_failover device to match the primary. * Carrier should be marked off before registering device the net_failover device. * Use netdev_XXX for log messages, in net_failover (not dev_xxx) * Since failover infrastructure is about linking devices just use RTNL no need for other locking in init and teardown. * Don't bother with ERR_PTR() style return if only possible return is success or no memory. * As much as possible, the terms master and slave should be avoided because of their cultural connotations. Note; this code has been tested on Hyper-V but is compile tested only on virtio. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Stephen Hemminger --- Although this patch needs to go into 4.18 (linux-net), this version is based against net-next because net-next hasn't been merged into linux-net yet. drivers/net/hyperv/hyperv_net.h | 3 +- drivers/net/hyperv/netvsc_drv.c | 173 +++-- drivers/net/net_failover.c | 312 --- drivers/net/virtio_net.c| 9 +- include/net/failover.h | 31 +--- include/net/net_failover.h | 32 +--- net/Kconfig | 13 +- net/core/failover.c | 316 8 files changed, 373 insertions(+), 516 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 99d8e7398a5b..c7d25d10765e 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -902,6 +902,8 @@ struct net_device_context { struct hv_device *device_ctx; /* netvsc_device */ struct netvsc_device __rcu *nvdev; + /* list of netvsc net_devices */ + struct list_head list; /* reconfigure work */ struct delayed_work dwork; /* last reconfig time */ @@ -933,7 +935,6 @@ struct net_device_context { /* Serial number of the VF to team with */ u32 vf_serial; - struct failover *failover; }; /* Per channel data */ diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c index bef4d55a108c..074e6b8578df 100644 --- a/drivers/net/hyperv/netvsc_drv.c +++ b/drivers/net/hyperv/netvsc_drv.c @@ -70,6 +70,8 @@ static int debug = -1; module_param(debug, int, 0444); MODULE_PARM_DESC(debug, "Debug level