Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Mon, Jun 11, 2018 at 11:56:56AM -0700, Siwei Liu wrote:
> The current implementation may only work with new userspace, even so
> the eth0/eth0nsby naming is not consistenly persisted due to races in
> bus probing.

Which race do you mean exactly?

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Samudrala, Sridhar

On 6/11/2018 12:34 PM, Samudrala, Sridhar wrote:


On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:

   * Set permanent and current address of net_failover device
 to match the primary.


We copy the dev_addr of standby dev to failover_dev in net_failover_create()
before calling register_netdev().
register_netdev() does a copy of dev_addr to perm_addr.

So i don't think this is an issue.



   * Carrier should be marked off before registering device
 the net_failover device.


Will fix this and also a couple of places dev_err() needs to be replaced with 
netdev_err()


Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?


Sure. I will submit patches to address these issues raised by Stephen.






Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Samudrala, Sridhar



On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:

   * Set permanent and current address of net_failover device
 to match the primary.

   * Carrier should be marked off before registering device
 the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?


Sure. I will submit patches to address these issues raised by Stephen.




Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu  wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >>  wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin"  wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> >> > wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it 
>> >> >> >> > should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> >> > attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 
>> >> >> >> > 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do 
>> >> >> >> you
>> >> >> >> want to find a way to use namespaces to hide the PV device for 
>> >> >> >> netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough 
>> >> >> > distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and 
>> >> >> > go.
>> >> >> > After this history, there is a strong motivation not to change how 
>> >> >> > kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can  work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is essential to realize transparent live migration that
>> >> >> users dont have to learn and be aware of the undertaken.
>> >> >
>> >> > Inheriting the VF name will fail in the migration scenario.
>> >> > It is perfectly reasonable to migrate a guest to another machine where
>> >> > the VF PCI address is different. And since current udev/systemd model
>> >> > is to base network device name off of PCI address, the device will 
>> >> > change
>> >> > name when guest is migrated.
>> >> >
>> >> The scenario of having VF on a different PCI address on post migration
>> >> is essentially equal to plugging in a new NIC. Why it has to pair with
>> >> the original PV? A sepearte PV device should be in place to pair the
>> >> new VF.
>> >
>> > The host only guarantees that the PV device will be on the same network.
>> > It does not make any PCI guarantees. The way Windows works is to find
>> > the device based on "serial number" which is an Hyper-V specific attribute
>> > of PCI devices.
>> >
>> > I considered naming off of serial number but that won't work for the
>> > case where PV device is present first and VF arrives later. The serial
>> > number is attribute of VF, not the PV which is there first.
>>
>> I assume the PV can get that information ahead of time before VF
>> arrives? Without it how do you match the device when you see a VF
>> coming with some serial number? Is it possible for PV to get the
>> matching SN even earlier during probe time? Or it has to depend on the
>> presence of vPCI bridge to 

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski  wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> IMHO it may be a better idea to look at the VF as acceleration for the
> PV rather than PV a migration vehicle from the VF.  Hence we should

I'm basically talking about two use cases not about solutions or
implementations specifically. As said, the one I'm looking into needs
to migrate a pre-failover VF setup to 1-netdev failover model in a
transparent manner. There's no point to switch PCI address back and
forth in the backend to set where to bind the PV or the VF, as you
have no ways to predict what guest kernel will be running until its
fully loaded. Supporting a VF on new location binding to existing PV
might be nice, but not directly relevant to those who don't need this
side feature than migration itself.

Having said that, while I somewhat agree both use cases should have
its own place in the picture, I don't think judging one better than
the other or vice versa is logical IMHO.

> continue to follow the naming of PV, like the current implementation
> does implicitly by linking to PV's struct device.

The current implementation may only work with new userspace, even so
the eth0/eth0nsby naming is not consistenly persisted due to races in
bus probing. The naming part should be fixed.

-Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * Set permanent and current address of net_failover device
> to match the primary.
> 
>   * Carrier should be marked off before registering device
> the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 03:21:21PM -0700, Stephen Hemminger wrote:
> > > 
> > > I think you need to get rid of triggering off of the name change.  
> > 
> > Worth considering down the road since it's a bit of a hack but it's one
> > we won't have trouble supporting, unlike the delayed uplink.
> 
> You can't depend on userspace doing rename.

There's only so much we can do to add new functionality to old
userspace. You can always just use PV and all will work.

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Stephen Hemminger
On Fri, 8 Jun 2018 17:42:21 -0700
Siwei Liu  wrote:

> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>  wrote:
> > On Fri, 8 Jun 2018 16:44:12 -0700
> > Siwei Liu  wrote:
> >  
> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> >>  wrote:  
> >> > On Fri, 8 Jun 2018 15:25:59 -0700
> >> > Siwei Liu  wrote:
> >> >  
> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> >>  wrote:  
> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> >> > "Michael S. Tsirkin"  wrote:
> >> >> >  
> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >> >> > wrote:  
> >> >> >> > >The net failover should be a simple library, not a virtual
> >> >> >> > >object with function callbacks (see callback hell).  
> >> >> >> >
> >> >> >> > Why just a library? It should do a common things. I think it 
> >> >> >> > should be a
> >> >> >> > virtual object. Looks like your patch again splits the common
> >> >> >> > functionality into multiple drivers. That is kind of backwards 
> >> >> >> > attitude.
> >> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> >> > model.  
> >> >> >>
> >> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> >> handling of renames.
> >> >> >>
> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> >> concerned about risk of breaking some userspace.
> >> >> >>
> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> >> address, and you said then "why not use existing network namespaces
> >> >> >> rather than inventing a new abstraction". So how about it then? Do 
> >> >> >> you
> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> >> compatibility?
> >> >> >>  
> >> >> >
> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough 
> >> >> > distro's and
> >> >> > startups that all demand eth0 always be present. And VF may come and 
> >> >> > go.
> >> >> > After this history, there is a strong motivation not to change how 
> >> >> > kernel
> >> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> >> > existing userspace.
> >> >> >
> >> >> > With virtio you can  work it out with the distro's yourself.
> >> >> > There is no pre-existing semantics to deal with.
> >> >> >
> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >> >>
> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> >> that flag, as well as the 1-netdev model, is to have a means to
> >> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> >> around renaming devices, customizing udev rules and et al. Why
> >> >> inheriting VF's name important? To allow existing config/setup around
> >> >> VF continues to work across kernel feature upgrade. Most of network
> >> >> config files in all distros are based on interface names. Few are MAC
> >> >> address based but making lower slaves hidden would cover the rest. And
> >> >> most importantly, preserving the same level of user experience as
> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> >> exposed. This is essential to realize transparent live migration that
> >> >> users dont have to learn and be aware of the undertaken.  
> >> >
> >> > Inheriting the VF name will fail in the migration scenario.
> >> > It is perfectly reasonable to migrate a guest to another machine where
> >> > the VF PCI address is different. And since current udev/systemd model
> >> > is to base network device name off of PCI address, the device will change
> >> > name when guest is migrated.
> >> >  
> >> The scenario of having VF on a different PCI address on post migration
> >> is essentially equal to plugging in a new NIC. Why it has to pair with
> >> the original PV? A sepearte PV device should be in place to pair the
> >> new VF.  
> >
> > The host only guarantees that the PV device will be on the same network.
> > It does not make any PCI guarantees. The way Windows works is to find
> > the device based on "serial number" which is an Hyper-V specific attribute
> > of PCI devices.
> >
> > I considered naming off of serial number but that won't work for the
> > case where PV device is present first and VF arrives later. The serial
> > number is attribute of VF, not the PV which is there first.  
> 
> I assume the PV can get that information ahead of time before VF
> arrives? Without it how do you match the device when you see a VF
> coming with some serial number? Is it possible for PV to get the
> matching SN even earlier during probe time? Or it has to depend on the
> presence of vPCI bridge to generate this SN?



NO. the PV device does not know ahead of time and there are scenario
where the serial and PCI info can change when it does arrive. These
are test 

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Stephen Hemminger
On Fri, 8 Jun 2018 15:54:38 -0700
Siwei Liu  wrote:

> On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
>  wrote:
> >
> >
> > On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >>
> >> On Wed, 6 Jun 2018 15:30:27 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> 
>  Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> >
> > The net failover should be a simple library, not a virtual
> > object with function callbacks (see callback hell).  
> 
>  Why just a library? It should do a common things. I think it should be a
>  virtual object. Looks like your patch again splits the common
>  functionality into multiple drivers. That is kind of backwards attitude.
>  I don't get it. We should rather focus on fixing the mess the
>  introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>  model.  
> >>>
> >>> So it seems that at least one benefit for netvsc would be better
> >>> handling of renames.
> >>>
> >>> Question is how can this change to 3-netdev happen?  Stephen is
> >>> concerned about risk of breaking some userspace.
> >>>
> >>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >>> address, and you said then "why not use existing network namespaces
> >>> rather than inventing a new abstraction". So how about it then? Do you
> >>> want to find a way to use namespaces to hide the PV device for netvsc
> >>> compatibility?
> >>>  
> >> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> startups that all demand eth0 always be present. And VF may come and go.
> >> After this history, there is a strong motivation not to change how kernel
> >> behaves. Switching to 3 device model would be perceived as breaking
> >> existing userspace.  
> >
> >
> > I think it should be possible for netvsc to work with 3 dev model if the
> > only
> > requirement is that eth0 will always be present. With net_failover, you will
> > see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> > issue
> > if somehow there is userspace requirement that there can be only 2 netdevs,
> > not 3
> > when VF is plugged.
> >
> > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> > device
> > and the IP address gets configured on eth0. Will this be an issue?
> >  
> Did you realize that the eth0 name in the current 3-netdev code can't
> be consistently persisted across reboot, if you have more than one VFs
> to pair with? On one boot it got eth0/eth0nsby, on the next it may get
> eth1/eth1nsby on the same interface.
> 
> It won't be useable by default until you add some custom udev rules.
> 

I think there is no reason to break things by moving netvsc to 3 device
model. 

The first device probed is always the same on Hyper-V/Azure, and always
comes up as eth0. The order comes from the fact that they are reported
to guest in that order and currently vmbus probe is single threaded.


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Fri, Jun 08, 2018 at 05:02:35PM -0700, Stephen Hemminger wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu  wrote:
> 
> > On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> >  wrote:
> > > On Fri, 8 Jun 2018 15:25:59 -0700
> > > Siwei Liu  wrote:
> > >  
> > >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> > >>  wrote:  
> > >> > On Wed, 6 Jun 2018 15:30:27 +0300
> > >> > "Michael S. Tsirkin"  wrote:
> > >> >  
> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > >> >> > wrote:  
> > >> >> > >The net failover should be a simple library, not a virtual
> > >> >> > >object with function callbacks (see callback hell).  
> > >> >> >
> > >> >> > Why just a library? It should do a common things. I think it should 
> > >> >> > be a
> > >> >> > virtual object. Looks like your patch again splits the common
> > >> >> > functionality into multiple drivers. That is kind of backwards 
> > >> >> > attitude.
> > >> >> > I don't get it. We should rather focus on fixing the mess the
> > >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > >> >> > model.  
> > >> >>
> > >> >> So it seems that at least one benefit for netvsc would be better
> > >> >> handling of renames.
> > >> >>
> > >> >> Question is how can this change to 3-netdev happen?  Stephen is
> > >> >> concerned about risk of breaking some userspace.
> > >> >>
> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > >> >> address, and you said then "why not use existing network namespaces
> > >> >> rather than inventing a new abstraction". So how about it then? Do you
> > >> >> want to find a way to use namespaces to hide the PV device for netvsc
> > >> >> compatibility?
> > >> >>  
> > >> >
> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> > >> > and
> > >> > startups that all demand eth0 always be present. And VF may come and 
> > >> > go.
> > >> > After this history, there is a strong motivation not to change how 
> > >> > kernel
> > >> > behaves. Switching to 3 device model would be perceived as breaking
> > >> > existing userspace.
> > >> >
> > >> > With virtio you can  work it out with the distro's yourself.
> > >> > There is no pre-existing semantics to deal with.
> > >> >
> > >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> > >>
> > >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> > >> that flag, as well as the 1-netdev model, is to have a means to
> > >> inherit the interface name from the VF, and to eliminate playing hacks
> > >> around renaming devices, customizing udev rules and et al. Why
> > >> inheriting VF's name important? To allow existing config/setup around
> > >> VF continues to work across kernel feature upgrade. Most of network
> > >> config files in all distros are based on interface names. Few are MAC
> > >> address based but making lower slaves hidden would cover the rest. And
> > >> most importantly, preserving the same level of user experience as
> > >> using raw VF interface once getting all ndo_ops and ethtool_ops
> > >> exposed. This is essential to realize transparent live migration that
> > >> users dont have to learn and be aware of the undertaken.  
> > >
> > > Inheriting the VF name will fail in the migration scenario.
> > > It is perfectly reasonable to migrate a guest to another machine where
> > > the VF PCI address is different. And since current udev/systemd model
> > > is to base network device name off of PCI address, the device will change
> > > name when guest is migrated.
> > >  
> > The scenario of having VF on a different PCI address on post migration
> > is essentially equal to plugging in a new NIC. Why it has to pair with
> > the original PV? A sepearte PV device should be in place to pair the
> > new VF.
> 
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
> 
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.
> 
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

We plan to add the serial number to the PV.


> 
> > > On Azure, the VF maybe removed (by host) at any time and then later
> > > reattached. There is no guarantee that VF will show back up at
> > > the same synthetic PCI address. It will likely have a different
> > > PCI domain value.  
> > 
> > This is something QEMU can do and make sure the PCI address is
> > consistent after migration.
> > 
> > -Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Jakub Kicinski
On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

IMHO it may be a better idea to look at the VF as acceleration for the
PV rather than PV a migration vehicle from the VF.  Hence we should
continue to follow the naming of PV, like the current implementation
does implicitly by linking to PV's struct device.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 15:25:59 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> > "Michael S. Tsirkin"  wrote:
>> >> >
>> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> > wrote:
>> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> > >object with function callbacks (see callback hell).
>> >> >> >
>> >> >> > Why just a library? It should do a common things. I think it should 
>> >> >> > be a
>> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> > attitude.
>> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> > model.
>> >> >>
>> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> handling of renames.
>> >> >>
>> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> concerned about risk of breaking some userspace.
>> >> >>
>> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> address, and you said then "why not use existing network namespaces
>> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> compatibility?
>> >> >>
>> >> >
>> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
>> >> > and
>> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> > After this history, there is a strong motivation not to change how 
>> >> > kernel
>> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> > existing userspace.
>> >> >
>> >> > With virtio you can  work it out with the distro's yourself.
>> >> > There is no pre-existing semantics to deal with.
>> >> >
>> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >>
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> The host only guarantees that the PV device will be on the same network.
> It does not make any PCI guarantees. The way Windows works is to find
> the device based on "serial number" which is an Hyper-V specific attribute
> of PCI devices.
>
> I considered naming off of serial number but that won't work for the
> case where PV device is present first and VF arrives later. The serial
> number is attribute of VF, not the PV which is there first.

I assume the PV can get that information ahead of time before VF
arrives? Without it how do you match the device when you see a VF
coming with some serial number? Is it possible for PV to get the
matching SN even earlier during probe time? Or it has to depend on the
presence of vPCI bridge to generate this SN?

>
> Your ideas about having the PCI information of the VF form the name
> of the failover device have the same problem. The PV device may
> be the only one present on boot.

Yeah, this is a chicken-egg problem indeed, and that was the reason
why I supply the BDF info for PV to name the master interface.
However, the ACPI PCI slot needs to depend on the PCI bus enumeration
so that can't be predictable.  Would it make sense to only rename 

Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2018 16:44:12 -0700
Siwei Liu  wrote:

> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>  wrote:
> > On Fri, 8 Jun 2018 15:25:59 -0700
> > Siwei Liu  wrote:
> >  
> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >>  wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin"  wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >> > wrote:  
> >> >> > >The net failover should be a simple library, not a virtual
> >> >> > >object with function callbacks (see callback hell).  
> >> >> >
> >> >> > Why just a library? It should do a common things. I think it should 
> >> >> > be a
> >> >> > virtual object. Looks like your patch again splits the common
> >> >> > functionality into multiple drivers. That is kind of backwards 
> >> >> > attitude.
> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> > model.  
> >> >>
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> >
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> >> > and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.
> >> >
> >> > With virtio you can  work it out with the distro's yourself.
> >> > There is no pre-existing semantics to deal with.
> >> >
> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >>
> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> that flag, as well as the 1-netdev model, is to have a means to
> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> around renaming devices, customizing udev rules and et al. Why
> >> inheriting VF's name important? To allow existing config/setup around
> >> VF continues to work across kernel feature upgrade. Most of network
> >> config files in all distros are based on interface names. Few are MAC
> >> address based but making lower slaves hidden would cover the rest. And
> >> most importantly, preserving the same level of user experience as
> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> exposed. This is essential to realize transparent live migration that
> >> users dont have to learn and be aware of the undertaken.  
> >
> > Inheriting the VF name will fail in the migration scenario.
> > It is perfectly reasonable to migrate a guest to another machine where
> > the VF PCI address is different. And since current udev/systemd model
> > is to base network device name off of PCI address, the device will change
> > name when guest is migrated.
> >  
> The scenario of having VF on a different PCI address on post migration
> is essentially equal to plugging in a new NIC. Why it has to pair with
> the original PV? A sepearte PV device should be in place to pair the
> new VF.

The host only guarantees that the PV device will be on the same network.
It does not make any PCI guarantees. The way Windows works is to find
the device based on "serial number" which is an Hyper-V specific attribute
of PCI devices.

I considered naming off of serial number but that won't work for the
case where PV device is present first and VF arrives later. The serial
number is attribute of VF, not the PV which is there first.

Your ideas about having the PCI information of the VF form the name
of the failover device have the same problem. The PV device may
be the only one present on boot.


> > On Azure, the VF maybe removed (by host) at any time and then later
> > reattached. There is no guarantee that VF will show back up at
> > the same synthetic PCI address. It will likely have a different
> > PCI domain value.  
> 
> This is something QEMU can do and make sure the PCI address is
> consistent after migration.
> 
> -Siwei



Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu  wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>>  wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can  work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.


> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.

This is something QEMU can do and make sure the PCI address is
consistent after migration.

-Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Fri, 8 Jun 2018 15:25:59 -0700
Siwei Liu  wrote:

> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>  wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin"  wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> >> > >The net failover should be a simple library, not a virtual
> >> > >object with function callbacks (see callback hell).  
> >> >
> >> > Why just a library? It should do a common things. I think it should be a
> >> > virtual object. Looks like your patch again splits the common
> >> > functionality into multiple drivers. That is kind of backwards attitude.
> >> > I don't get it. We should rather focus on fixing the mess the
> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> > model.  
> >>
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> >
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.
> >
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> >
> > For the virtio, I don't see the need for IFF_HIDDEN.  
> 
> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> that flag, as well as the 1-netdev model, is to have a means to
> inherit the interface name from the VF, and to eliminate playing hacks
> around renaming devices, customizing udev rules and et al. Why
> inheriting VF's name important? To allow existing config/setup around
> VF continues to work across kernel feature upgrade. Most of network
> config files in all distros are based on interface names. Few are MAC
> address based but making lower slaves hidden would cover the rest. And
> most importantly, preserving the same level of user experience as
> using raw VF interface once getting all ndo_ops and ethtool_ops
> exposed. This is essential to realize transparent live migration that
> users dont have to learn and be aware of the undertaken.

Inheriting the VF name will fail in the migration scenario.
It is perfectly reasonable to migrate a guest to another machine where
the VF PCI address is different. And since current udev/systemd model
is to base network device name off of PCI address, the device will change
name when guest is migrated.

On Azure, the VF maybe removed (by host) at any time and then later
reattached. There is no guarantee that VF will show back up at
the same synthetic PCI address. It will likely have a different
PCI domain value.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
 wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:

 Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>
> The net failover should be a simple library, not a virtual
> object with function callbacks (see callback hell).

 Why just a library? It should do a common things. I think it should be a
 virtual object. Looks like your patch again splits the common
 functionality into multiple drivers. That is kind of backwards attitude.
 I don't get it. We should rather focus on fixing the mess the
 introduction of netvsc-bonding caused and switch netvsc to 3-netdev
 model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen?  Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.

It won't be useable by default until you add some custom udev rules.

-Siwei

>
>
>>
>> With virtio you can  work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin"  wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.

I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.

It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.

-Siwei


> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Michael S. Tsirkin
On Fri, Jun 08, 2018 at 11:30:08AM -0700, Stephen Hemminger wrote:
>   * what about nested KVM on Hyper-V? Would it make sense to
> have a way to pass subset of VF queues to guest?

No as long as hyper-v doesn't have a vIOMMU.

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Stephen Hemminger
On Thu, 7 Jun 2018 20:22:15 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 18:41:31 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > > vfio-pci on the netdevs it sees?
> > > > 
> > > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > > > based.,.
> > > > The DPDK support uses:
> > > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > > >userspace. This means VF netdev device must exist and be visible.
> > > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > > >path packets to userspace.
> > > >  * A autodiscovery mechanism that to set all this up that relies on
> > > >2 device model and sysfs.
> > > 
> > > Could you describe what does it look for exactly? What will break if
> > > instead of MLX5 being a child of the PV, it's a child of the failover
> > > device?  
> > 
> > So in DPDK there is an internal four device model:
> > 1. failsafe is like failover in your model
> > 2. TAP is used like netvsc in kernel
> > 3. MLX5 is the VF
> > 4. vdev_netvsc is a pseudo device whose only reason to exist
> >is to glue everything together.
> > 
> > Digging deeper inside...
> > 
> > Vdev_netvsc does:
> >* driver is started in a convuluted way off device arguments
> >* probe routine for driver runs
> >   - scans list of kernel interfaces in sysfs
> >   - matches those using VMBUS   
> 
> Could you tell a bit more what does this step entail?

Quick code high/low lights.


ret = vdev_netvsc_foreach_iface(vdev_netvsc_netvsc_probe, 1, name,
kvargs, specified, );
static int
vdev_netvsc_foreach_iface(int (*func)(const struct if_nameindex *iface,
  const struct ether_addr *eth_addr,
  va_list ap), int is_netvsc, ...)
{
struct if_nameindex *iface = if_nameindex();


for (i = 0; iface[i].if_name; ++i) {

is_netvsc_ret = vdev_netvsc_iface_is_netvsc([i]) ? 1 : 0;
if (is_netvsc ^ is_netvsc_ret)
continue;

strlcpy(req.ifr_name, iface[i].if_name, sizeof(req.ifr_name));
if (ioctl(s, SIOCGIFHWADDR, ) == -1) {
}

memcpy(eth_addr.addr_bytes, req.ifr_hwaddr.sa_data,
   RTE_DIM(eth_addr.addr_bytes));

ret = func([i], _addr, ap);  << func is 
vdev_netvsc_netvsc_probe


static int
vdev_netvsc_netvsc_probe(const struct if_nameindex *iface,
 const struct ether_addr *eth_addr,
 va_list ap)
{

/* Routed NetVSC should not be probed. */
if (vdev_netvsc_has_route(iface, AF_INET) ||
vdev_netvsc_has_route(iface, AF_INET6)) {
if (!specified)
return 0;
DRV_LOG(WARNING, "probably using routed NetVSC interface \"%s\""
" (index %u)", iface->if_name, iface->if_index);
}
/* Create interface context. */
ctx = calloc(1, sizeof(*ctx));
...


> 
> >   - skip netvsc devices that have an IPV4 route
> >* scan for PCI devices that have same MAC address as kernel netvsc
> >  devices discovered in previous step
> >* add these interfaces to arguments to failsafe
> > 
> > Then failsafe configures based on arguments on device
> > 
> > The code works but is specific to the Azure hardware model, and exposes lots
> > of things to application that it should not have to care about.
> > 
> > If you  try and walk through this code in DPDK, you will see why I have 
> > developed
> > a dislike for high levels of indirection.
> > 
> > 
> >  
> 
> Thanks that was helpful!  I'll try to poke at it next week.  Just from
> the description it seems the kernel is merely used to locate the MAC
> address through sysfs and that for this DPDK code to keep working the
> hidden device must be hidden from it in sysfs - is that a fair summary?

What is the point of the 3 device model? What value does it have
to userspace? How would userspace use each of the three devices.
Going back to 3 device model really doesn't make sense to me if
there is not visible benefit.

Some other considerations:
   * there is ongoing development to support RDMA failover as
 well in netvsc.

   * there is a new driver which implements the VMBUS protocol
 in userspace for DPDK. This gets rid of several layers and
 removes any special scanning code. The vmbus device is
 unbound from netvsc and bound to UIO device.  Then the user
 space DPDK driver manages all the host signalling events
 including VF discovery. It is really 2 device model done
 all in userspace. The kernel device is still needed when
 

Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Thu, Jun 07, 2018 at 09:17:42AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 18:41:31 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > > vfio-pci on the netdevs it sees?  
> > > 
> > > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > > based.,.
> > > The DPDK support uses:
> > >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> > >userspace. This means VF netdev device must exist and be visible.
> > >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> > >path packets to userspace.
> > >  * A autodiscovery mechanism that to set all this up that relies on
> > >2 device model and sysfs.  
> > 
> > Could you describe what does it look for exactly? What will break if
> > instead of MLX5 being a child of the PV, it's a child of the failover
> > device?
> 
> So in DPDK there is an internal four device model:
>   1. failsafe is like failover in your model
>   2. TAP is used like netvsc in kernel
>   3. MLX5 is the VF
>   4. vdev_netvsc is a pseudo device whose only reason to exist
>  is to glue everything together.
> 
> Digging deeper inside...
> 
> Vdev_netvsc does:
>* driver is started in a convuluted way off device arguments
>* probe routine for driver runs
>   - scans list of kernel interfaces in sysfs
>   - matches those using VMBUS 

Could you tell a bit more what does this step entail?

>   - skip netvsc devices that have an IPV4 route
>* scan for PCI devices that have same MAC address as kernel netvsc
>  devices discovered in previous step
>* add these interfaces to arguments to failsafe
> 
> Then failsafe configures based on arguments on device
> 
> The code works but is specific to the Azure hardware model, and exposes lots
> of things to application that it should not have to care about.
> 
> If you  try and walk through this code in DPDK, you will see why I have 
> developed
> a dislike for high levels of indirection.
> 
> 
>  

Thanks that was helpful!  I'll try to poke at it next week.  Just from
the description it seems the kernel is merely used to locate the MAC
address through sysfs and that for this DPDK code to keep working the
hidden device must be hidden from it in sysfs - is that a fair summary?

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 18:41:31 +0300
"Michael S. Tsirkin"  wrote:

> > > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > > vfio-pci on the netdevs it sees?  
> > 
> > Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> > based.,.
> > The DPDK support uses:
> >  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
> >userspace. This means VF netdev device must exist and be visible.
> >  * Slow path using kernel netvsc device, TAP and BPF to get exception
> >path packets to userspace.
> >  * A autodiscovery mechanism that to set all this up that relies on
> >2 device model and sysfs.  
> 
> Could you describe what does it look for exactly? What will break if
> instead of MLX5 being a child of the PV, it's a child of the failover
> device?

So in DPDK there is an internal four device model:
1. failsafe is like failover in your model
2. TAP is used like netvsc in kernel
3. MLX5 is the VF
4. vdev_netvsc is a pseudo device whose only reason to exist
   is to glue everything together.

Digging deeper inside...

Vdev_netvsc does:
   * driver is started in a convuluted way off device arguments
   * probe routine for driver runs
  - scans list of kernel interfaces in sysfs
  - matches those using VMBUS 
  - skip netvsc devices that have an IPV4 route
   * scan for PCI devices that have same MAC address as kernel netvsc
 devices discovered in previous step
   * add these interfaces to arguments to failsafe

Then failsafe configures based on arguments on device

The code works but is specific to the Azure hardware model, and exposes lots
of things to application that it should not have to care about.

If you  try and walk through this code in DPDK, you will see why I have 
developed
a dislike for high levels of indirection.


   


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Thu, Jun 07, 2018 at 07:51:12AM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 07:17:51 -0700
> Alexander Duyck  wrote:
> 
> > On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
> >  wrote:
> > > On Wed, 6 Jun 2018 14:54:04 -0700
> > > "Samudrala, Sridhar"  wrote:
> > >  
> > >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> > >> > On Wed, 6 Jun 2018 15:30:27 +0300
> > >> > "Michael S. Tsirkin"  wrote:
> > >> >  
> > >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > >> >>> wrote:  
> > >>  The net failover should be a simple library, not a virtual
> > >>  object with function callbacks (see callback hell).  
> > >> >>> Why just a library? It should do a common things. I think it should 
> > >> >>> be a
> > >> >>> virtual object. Looks like your patch again splits the common
> > >> >>> functionality into multiple drivers. That is kind of backwards 
> > >> >>> attitude.
> > >> >>> I don't get it. We should rather focus on fixing the mess the
> > >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > >> >>> model.  
> > >> >> So it seems that at least one benefit for netvsc would be better
> > >> >> handling of renames.
> > >> >>
> > >> >> Question is how can this change to 3-netdev happen?  Stephen is
> > >> >> concerned about risk of breaking some userspace.
> > >> >>
> > >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > >> >> address, and you said then "why not use existing network namespaces
> > >> >> rather than inventing a new abstraction". So how about it then? Do you
> > >> >> want to find a way to use namespaces to hide the PV device for netvsc
> > >> >> compatibility?
> > >> >>  
> > >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> > >> > and
> > >> > startups that all demand eth0 always be present. And VF may come and 
> > >> > go.
> > >> > After this history, there is a strong motivation not to change how 
> > >> > kernel
> > >> > behaves. Switching to 3 device model would be perceived as breaking
> > >> > existing userspace.  
> > >>
> > >> I think it should be possible for netvsc to work with 3 dev model if the 
> > >> only
> > >> requirement is that eth0 will always be present. With net_failover, you 
> > >> will
> > >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
> > >> issue
> > >> if somehow there is userspace requirement that there can be only 2 
> > >> netdevs, not 3
> > >> when VF is plugged.
> > >>
> > >> eth0 will be the net_failover device and eth0nsby/eth1 will be the 
> > >> netvsc device
> > >> and the IP address gets configured on eth0. Will this be an issue?  
> > >
> > > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > > but that is the way it is.  
> > 
> > Why would DPDK care what we do in the kernel? Isn't it just slapping
> > vfio-pci on the netdevs it sees?
> 
> Alex, you are correct for Intel devices; but DPDK on Azure is not Intel 
> based.,.
> The DPDK support uses:
>  * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
>userspace. This means VF netdev device must exist and be visible.
>  * Slow path using kernel netvsc device, TAP and BPF to get exception
>path packets to userspace.
>  * A autodiscovery mechanism that to set all this up that relies on
>2 device model and sysfs.

Could you describe what does it look for exactly? What will break if
instead of MLX5 being a child of the PV, it's a child of the failover
device?

> In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
> IOMMU so VFIO will not work there at all.
>


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 17:57:50 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> > On Thu, 7 Jun 2018 00:47:52 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:  
> > > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > > "Michael S. Tsirkin"  wrote:
> > > > 
> > > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > > > > > wrote:  
> > > > > > >The net failover should be a simple library, not a virtual
> > > > > > >object with function callbacks (see callback hell).  
> > > > > > 
> > > > > > Why just a library? It should do a common things. I think it should 
> > > > > > be a
> > > > > > virtual object. Looks like your patch again splits the common
> > > > > > functionality into multiple drivers. That is kind of backwards 
> > > > > > attitude.
> > > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > > model.  
> > > > > 
> > > > > So it seems that at least one benefit for netvsc would be better
> > > > > handling of renames.
> > > > > 
> > > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > > concerned about risk of breaking some userspace.
> > > > > 
> > > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > > address, and you said then "why not use existing network namespaces
> > > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > > compatibility?
> > > > > 
> > > > 
> > > > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> > > > and
> > > > startups that all demand eth0 always be present. And VF may come and 
> > > > go.
> > > 
> > > Well failover seems to maintain this invariant with the 3 dev model.
> > >   
> > > > After this history, there is a strong motivation not to change how 
> > > > kernel
> > > > behaves. Switching to 3 device model would be perceived as breaking
> > > > existing userspace.
> > > 
> > > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > > hidden so that userspace does not know that you switched to a 3 device
> > > model. It will think there are 2 devices and will keep working.
> > > 
> > > If you do that, then there won't be anything that
> > > would be perceived as breaking existing userspace, will there?  
> > 
> > DPDK now knows about the netvsc 2 device model and drivers in userspace
> > depend on it.  
> 
> Interesting but I'm not sure how this answers the question. How would
> DPDK care that there's a hidden device? If you can point out the
> code in question, maybe a way can be found to make changes while
> keeping it working.

See http://dpdk.org/browse/dpdk/tree/drivers/net/vdev_netvsc/vdev_netvsc.c

I am working to eliminate the necessity of this complex model in DPDK.
Having a 3 device model inside DPDK has just as many problems as in the
kernel.



Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 03:24:08PM -0700, Stephen Hemminger wrote:
> On Thu, 7 Jun 2018 00:47:52 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > > On Wed, 6 Jun 2018 15:30:27 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> > > > > wrote:
> > > > > >The net failover should be a simple library, not a virtual
> > > > > >object with function callbacks (see callback hell).
> > > > > 
> > > > > Why just a library? It should do a common things. I think it should 
> > > > > be a
> > > > > virtual object. Looks like your patch again splits the common
> > > > > functionality into multiple drivers. That is kind of backwards 
> > > > > attitude.
> > > > > I don't get it. We should rather focus on fixing the mess the
> > > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > > model.
> > > > 
> > > > So it seems that at least one benefit for netvsc would be better
> > > > handling of renames.
> > > > 
> > > > Question is how can this change to 3-netdev happen?  Stephen is
> > > > concerned about risk of breaking some userspace.
> > > > 
> > > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > > address, and you said then "why not use existing network namespaces
> > > > rather than inventing a new abstraction". So how about it then? Do you
> > > > want to find a way to use namespaces to hide the PV device for netvsc
> > > > compatibility?
> > > >   
> > > 
> > > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > > startups that all demand eth0 always be present. And VF may come and go.  
> > 
> > Well failover seems to maintain this invariant with the 3 dev model.
> > 
> > > After this history, there is a strong motivation not to change how kernel
> > > behaves. Switching to 3 device model would be perceived as breaking
> > > existing userspace.  
> > 
> > I feel I'm misunderstood. I was asking whether a 3-rd device can be
> > hidden so that userspace does not know that you switched to a 3 device
> > model. It will think there are 2 devices and will keep working.
> > 
> > If you do that, then there won't be anything that
> > would be perceived as breaking existing userspace, will there?
> 
> DPDK now knows about the netvsc 2 device model and drivers in userspace
> depend on it.

Interesting but I'm not sure how this answers the question. How would
DPDK care that there's a hidden device? If you can point out the
code in question, maybe a way can be found to make changes while
keeping it working.

> > 
> > 
> > > With virtio you can  work it out with the distro's yourself.
> > > There is no pre-existing semantics to deal with.
> > > 
> > > For the virtio, I don't see the need for IFF_HIDDEN.
> > > With 3-dev model as long as you mark the PV and VF devices
> > > as slaves, then userspace knows to leave them alone. Assuming userspace
> > > is already able to deal with team and bond devices.  
> > 
> > That's clear enough.
> > 
> > > Any time you introduce new UAPI behavior something breaks.  
> > 
> > Not if we do it right.
> > 
> > > On the rename front, I really don't care if VF can be renamed.  
> > 
> > OK that's nice.
> > 
> > > And for
> > > netvsc want to allow the PV device to be renamed.  
> > 
> > That's because of the 2 device model, right?  So that explains why even
> > if the delayed hack is good for the goose it might not be good for the
> > gander :)
> 
> You are bringing up the VF right away. How does the 3-device initialization
> state machine work?  Do you give a window for udev to possibly rename the
> VF? Do you rely on that?
> 
> > 
> > > Udev developers want that
> > > but have not found a stable/persistent value to expose to userspace
> > > to allow it.  


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Stephen Hemminger
On Thu, 7 Jun 2018 07:17:51 -0700
Alexander Duyck  wrote:

> On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
>  wrote:
> > On Wed, 6 Jun 2018 14:54:04 -0700
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> > "Michael S. Tsirkin"  wrote:
> >> >  
> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >>> wrote:  
> >>  The net failover should be a simple library, not a virtual
> >>  object with function callbacks (see callback hell).  
> >> >>> Why just a library? It should do a common things. I think it should be 
> >> >>> a
> >> >>> virtual object. Looks like your patch again splits the common
> >> >>> functionality into multiple drivers. That is kind of backwards 
> >> >>> attitude.
> >> >>> I don't get it. We should rather focus on fixing the mess the
> >> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >>> model.  
> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> handling of renames.
> >> >>
> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> concerned about risk of breaking some userspace.
> >> >>
> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> address, and you said then "why not use existing network namespaces
> >> >> rather than inventing a new abstraction". So how about it then? Do you
> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> compatibility?
> >> >>  
> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
> >> > and
> >> > startups that all demand eth0 always be present. And VF may come and go.
> >> > After this history, there is a strong motivation not to change how kernel
> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> > existing userspace.  
> >>
> >> I think it should be possible for netvsc to work with 3 dev model if the 
> >> only
> >> requirement is that eth0 will always be present. With net_failover, you 
> >> will
> >> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
> >> issue
> >> if somehow there is userspace requirement that there can be only 2 
> >> netdevs, not 3
> >> when VF is plugged.
> >>
> >> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
> >> device
> >> and the IP address gets configured on eth0. Will this be an issue?  
> >
> > DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> > but that is the way it is.  
> 
> Why would DPDK care what we do in the kernel? Isn't it just slapping
> vfio-pci on the netdevs it sees?

Alex, you are correct for Intel devices; but DPDK on Azure is not Intel based.,.
The DPDK support uses:
 * Mellanox MLX5 which uses the Infinband hooks to do DMA directly to
   userspace. This means VF netdev device must exist and be visible.
 * Slow path using kernel netvsc device, TAP and BPF to get exception
   path packets to userspace.
 * A autodiscovery mechanism that to set all this up that relies on
   2 device model and sysfs.

In this version, there is no VFIO-PCI. And also Hyper-V does not have virtual
IOMMU so VFIO will not work there at all.
   


Re: [PATCH net] failover: eliminate callback hell

2018-06-07 Thread Alexander Duyck
On Wed, Jun 6, 2018 at 3:25 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 14:54:04 -0700
> "Samudrala, Sridhar"  wrote:
>
>> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>>  The net failover should be a simple library, not a virtual
>>  object with function callbacks (see callback hell).
>> >>> Why just a library? It should do a common things. I think it should be a
>> >>> virtual object. Looks like your patch again splits the common
>> >>> functionality into multiple drivers. That is kind of backwards attitude.
>> >>> I don't get it. We should rather focus on fixing the mess the
>> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >>> model.
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>>
>> I think it should be possible for netvsc to work with 3 dev model if the only
>> requirement is that eth0 will always be present. With net_failover, you will
>> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an 
>> issue
>> if somehow there is userspace requirement that there can be only 2 netdevs, 
>> not 3
>> when VF is plugged.
>>
>> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
>> device
>> and the IP address gets configured on eth0. Will this be an issue?
>
> DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
> but that is the way it is.

Why would DPDK care what we do in the kernel? Isn't it just slapping
vfio-pci on the netdevs it sees?


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Wed, 6 Jun 2018 15:19:30 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > > I think the push back was with the usage of the delay, not bringing up 
> > > the primary/standby
> > > device in the name change event handler.
> > > Can't netvsc use this mechanism instead of depending on the delay?
> > > 
> > >   
> > 
> > The patch that was rejected for netvsc was about using name change.  
> 
> So failover is now doing exactly what you wanted netvsc to do.  Rather
> than reverting everyone to old behaviour how about using more pieces
> from failover?
> 
> > Also, you can't depend on name change; you still need a timer. Not all 
> > distributions
> > change name of devices.  
> 
> So failover chose not to implement the delayed open so far.
> If it does I suspect we'll have to keep it around forever -
> kind of like netvsc seems to be stuck with it.
> But let's see if there's enough actual demand from people running
> ancient distros with latest kernels to even start looking for
> a solution for failover.
> 
> And this kind of behaviour change really should be split out
> so we can discuss it separately.
> 
> > Or user has blocked that by udev rules.  
> 
> Don't do that then?
> 

If you don't want to allow udev to rename the device, then just pull
the name change hook.


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Wed, 6 Jun 2018 14:54:04 -0700
"Samudrala, Sridhar"  wrote:

> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin"  wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
>  The net failover should be a simple library, not a virtual
>  object with function callbacks (see callback hell).  
> >>> Why just a library? It should do a common things. I think it should be a
> >>> virtual object. Looks like your patch again splits the common
> >>> functionality into multiple drivers. That is kind of backwards attitude.
> >>> I don't get it. We should rather focus on fixing the mess the
> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >>> model.  
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I think it should be possible for netvsc to work with 3 dev model if the only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
> if somehow there is userspace requirement that there can be only 2 netdevs, 
> not 3
> when VF is plugged.
> 
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc 
> device
> and the IP address gets configured on eth0. Will this be an issue?

DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
but that is the way it is.



Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Thu, 7 Jun 2018 00:47:52 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin"  wrote:
> >   
> > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote: 
> > > >
> > > > >The net failover should be a simple library, not a virtual
> > > > >object with function callbacks (see callback hell).
> > > > 
> > > > Why just a library? It should do a common things. I think it should be a
> > > > virtual object. Looks like your patch again splits the common
> > > > functionality into multiple drivers. That is kind of backwards attitude.
> > > > I don't get it. We should rather focus on fixing the mess the
> > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > model.
> > > 
> > > So it seems that at least one benefit for netvsc would be better
> > > handling of renames.
> > > 
> > > Question is how can this change to 3-netdev happen?  Stephen is
> > > concerned about risk of breaking some userspace.
> > > 
> > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > address, and you said then "why not use existing network namespaces
> > > rather than inventing a new abstraction". So how about it then? Do you
> > > want to find a way to use namespaces to hide the PV device for netvsc
> > > compatibility?
> > >   
> > 
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.  
> 
> Well failover seems to maintain this invariant with the 3 dev model.
> 
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I feel I'm misunderstood. I was asking whether a 3-rd device can be
> hidden so that userspace does not know that you switched to a 3 device
> model. It will think there are 2 devices and will keep working.
> 
> If you do that, then there won't be anything that
> would be perceived as breaking existing userspace, will there?

DPDK now knows about the netvsc 2 device model and drivers in userspace
depend on it.

> 
> 
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> > 
> > For the virtio, I don't see the need for IFF_HIDDEN.
> > With 3-dev model as long as you mark the PV and VF devices
> > as slaves, then userspace knows to leave them alone. Assuming userspace
> > is already able to deal with team and bond devices.  
> 
> That's clear enough.
> 
> > Any time you introduce new UAPI behavior something breaks.  
> 
> Not if we do it right.
> 
> > On the rename front, I really don't care if VF can be renamed.  
> 
> OK that's nice.
> 
> > And for
> > netvsc want to allow the PV device to be renamed.  
> 
> That's because of the 2 device model, right?  So that explains why even
> if the delayed hack is good for the goose it might not be good for the
> gander :)

You are bringing up the VF right away. How does the 3-device initialization
state machine work?  Do you give a window for udev to possibly rename the
VF? Do you rely on that?

> 
> > Udev developers want that
> > but have not found a stable/persistent value to expose to userspace
> > to allow it.  



Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Thu, 7 Jun 2018 00:30:21 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 23:11:37 -0700
> > "Samudrala, Sridhar"  wrote:
> >   
> > > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:  
> > > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > > "Samudrala, Sridhar"  wrote:
> > > >
> > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
> > > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > > >>> "Samudrala, Sridhar"  wrote:
> > > >>>   
> > >  On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> > > > On Tue, 5 Jun 2018 22:38:43 +0300
> > > > "Michael S. Tsirkin"  wrote:
> > > >  
> > > >>> See:
> > > >>>   https://patchwork.ozlabs.org/patch/851711/
> > > >> Let me try to summarize that:
> > > >>
> > > >>You wanted to speed up the delayed link up.  You had an idea to
> > > >>additionally take link up when userspace renames the interface 
> > > >> (standby
> > > >>one which is also the failover for netvsc).
> > > >>
> > > >>But userspace might not do any renames, in which case there will
> > > >>still be the delay, and so this never got applied.
> > > >>
> > > >>Is this a good summary?
> > > >>
> > > >> Davem said delay should go away completely as it's not robust, and 
> > > >> I
> > > >> think I agree.  So I don't think we should make all failover users 
> > > >> use
> > > >> delay. IIUC failover kept a delay option especially for netvsc to
> > > >> minimize the surprise factor. Hopefully we can come up with
> > > >> something more robust and drop that option completely.
> > > > The timeout was the original solution to how to complete setup after
> > > > userspace has had a chance to rename the device. Unfortunately, the 
> > > > whole network
> > > > device initialization (cooperation with udev and userspace) is a a 
> > > > mess because
> > > > there is no well defined specification, and there are multiple ways 
> > > > userspace
> > > > does this in old and new distributions.  The timeout has its own 
> > > > issues
> > > > (how long, handling errors during that window, what if userspace 
> > > > modifies other
> > > > device state); and open to finding a better solution.
> > > >
> > > > My point was that if name change can not be relied on (or used) by 
> > > > netvsc,
> > > > then we can't allow it for net_failover either.
> > >  I think the push back was with the usage of the delay, not bringing 
> > >  up the primary/standby
> > >  device in the name change event handler.
> > >  Can't netvsc use this mechanism instead of depending on the delay?
> > > 
> > >    
> > > >>> The patch that was rejected for netvsc was about using name change.
> > > >>> Also, you can't depend on name change; you still need a timer. Not 
> > > >>> all distributions
> > > >>> change name of devices. Or user has blocked that by udev rules.
> > > >> In the net_failover_slave_register() we do a dev_open() and ignore any 
> > > >> failure due to
> > > >> EBUSY and do another dev_open() in the name change event handler.
> > > >> If the name is not expected to change, i would think the dev_open() at 
> > > >> the time of
> > > >> register will succeed.
> > > > The problem is your first dev_open will bring device up and lockout
> > > > udev from changing the network device name.
> > > 
> > > I have tried with/without udev and didn't see any issue with the naming 
> > > of the primary/standby
> > > devices in my testing.
> > > 
> > > With the 3-netdev failover model, we are only interested in setting the 
> > > right name for the failover
> > > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does 
> > > it really matter if udev fails
> > > to rename the lower primary/standby netdevs, i don't think it will 
> > > matter? The user is not expected
> > > to touch the lower netdevs.  
> > 
> > Renaming matters to some users and the udev maintainers. They want the VF 
> > to be named enpXXX
> > The primary in virtio case needs to be ens3 with some cloud platforms.  
> 
> Confused. VF can't be a standby, of it's used in a failover it's a
> primary, you can't call it both enpXXX amd ens3. Could you describe the
> use case in a bit more detail?

Sorry, got things backwards.  The primary is VF and it should be possible
to have it renamed based on PCI info.
The standby PV (in 3 dev model) would get renamed by udev to ens3.
So the failover device would just be ethX right?

> > 
> > I think you need to get rid of triggering off of the name change.  
> 
> Worth considering down the road since it's a bit of a hack but it's one
> we won't have trouble supporting, unlike the delayed uplink.

You can't depend on userspace doing rename.

> 
> > Long term, let's open the discussion about allowing network 

Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Samudrala, Sridhar




On 6/6/2018 2:24 PM, Stephen Hemminger wrote:

On Wed, 6 Jun 2018 15:30:27 +0300
"Michael S. Tsirkin"  wrote:


On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:

Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:

The net failover should be a simple library, not a virtual
object with function callbacks (see callback hell).

Why just a library? It should do a common things. I think it should be a
virtual object. Looks like your patch again splits the common
functionality into multiple drivers. That is kind of backwards attitude.
I don't get it. We should rather focus on fixing the mess the
introduction of netvsc-bonding caused and switch netvsc to 3-netdev
model.

So it seems that at least one benefit for netvsc would be better
handling of renames.

Question is how can this change to 3-netdev happen?  Stephen is
concerned about risk of breaking some userspace.

Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
address, and you said then "why not use existing network namespaces
rather than inventing a new abstraction". So how about it then? Do you
want to find a way to use namespaces to hide the PV device for netvsc
compatibility?


Netvsc can't work with 3 dev model. MS has worked with enough distro's and
startups that all demand eth0 always be present. And VF may come and go.
After this history, there is a strong motivation not to change how kernel
behaves. Switching to 3 device model would be perceived as breaking
existing userspace.


I think it should be possible for netvsc to work with 3 dev model if the only
requirement is that eth0 will always be present. With net_failover, you will
see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
if somehow there is userspace requirement that there can be only 2 netdevs, not 
3
when VF is plugged.

eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
and the IP address gets configured on eth0. Will this be an issue?




With virtio you can  work it out with the distro's yourself.
There is no pre-existing semantics to deal with.

For the virtio, I don't see the need for IFF_HIDDEN.
With 3-dev model as long as you mark the PV and VF devices
as slaves, then userspace knows to leave them alone. Assuming userspace
is already able to deal with team and bond devices.
Any time you introduce new UAPI behavior something breaks.

On the rename front, I really don't care if VF can be renamed. And for
netvsc want to allow the PV device to be renamed. Udev developers want that
but have not found a stable/persistent value to expose to userspace
to allow it.




Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> > > >The net failover should be a simple library, not a virtual
> > > >object with function callbacks (see callback hell).  
> > > 
> > > Why just a library? It should do a common things. I think it should be a
> > > virtual object. Looks like your patch again splits the common
> > > functionality into multiple drivers. That is kind of backwards attitude.
> > > I don't get it. We should rather focus on fixing the mess the
> > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > model.  
> > 
> > So it seems that at least one benefit for netvsc would be better
> > handling of renames.
> > 
> > Question is how can this change to 3-netdev happen?  Stephen is
> > concerned about risk of breaking some userspace.
> > 
> > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > address, and you said then "why not use existing network namespaces
> > rather than inventing a new abstraction". So how about it then? Do you
> > want to find a way to use namespaces to hide the PV device for netvsc
> > compatibility?
> > 
> 
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.

Well failover seems to maintain this invariant with the 3 dev model.

> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.

I feel I'm misunderstood. I was asking whether a 3-rd device can be
hidden so that userspace does not know that you switched to a 3 device
model. It will think there are 2 devices and will keep working.

If you do that, then there won't be anything that
would be perceived as breaking existing userspace, will there?


> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
> 
> For the virtio, I don't see the need for IFF_HIDDEN.
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.

That's clear enough.

> Any time you introduce new UAPI behavior something breaks.

Not if we do it right.

> On the rename front, I really don't care if VF can be renamed.

OK that's nice.

> And for
> netvsc want to allow the PV device to be renamed.

That's because of the 2 device model, right?  So that explains why even
if the delayed hack is good for the goose it might not be good for the
gander :)

> Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 23:11:37 -0700
> "Samudrala, Sridhar"  wrote:
> 
> > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > "Samudrala, Sridhar"  wrote:
> > >  
> > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:  
> > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > >>> "Samudrala, Sridhar"  wrote:
> > >>> 
> >  On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> > > On Tue, 5 Jun 2018 22:38:43 +0300
> > > "Michael S. Tsirkin"  wrote:
> > >
> > >>> See:
> > >>>   https://patchwork.ozlabs.org/patch/851711/  
> > >> Let me try to summarize that:
> > >>
> > >>  You wanted to speed up the delayed link up.  You had an idea to
> > >>  additionally take link up when userspace renames the interface 
> > >> (standby
> > >>  one which is also the failover for netvsc).
> > >>
> > >>  But userspace might not do any renames, in which case there will
> > >>  still be the delay, and so this never got applied.
> > >>
> > >>  Is this a good summary?
> > >>
> > >> Davem said delay should go away completely as it's not robust, and I
> > >> think I agree.  So I don't think we should make all failover users 
> > >> use
> > >> delay. IIUC failover kept a delay option especially for netvsc to
> > >> minimize the surprise factor. Hopefully we can come up with
> > >> something more robust and drop that option completely.  
> > > The timeout was the original solution to how to complete setup after
> > > userspace has had a chance to rename the device. Unfortunately, the 
> > > whole network
> > > device initialization (cooperation with udev and userspace) is a a 
> > > mess because
> > > there is no well defined specification, and there are multiple ways 
> > > userspace
> > > does this in old and new distributions.  The timeout has its own 
> > > issues
> > > (how long, handling errors during that window, what if userspace 
> > > modifies other
> > > device state); and open to finding a better solution.
> > >
> > > My point was that if name change can not be relied on (or used) by 
> > > netvsc,
> > > then we can't allow it for net_failover either.  
> >  I think the push back was with the usage of the delay, not bringing up 
> >  the primary/standby
> >  device in the name change event handler.
> >  Can't netvsc use this mechanism instead of depending on the delay?
> > 
> >  
> > >>> The patch that was rejected for netvsc was about using name change.
> > >>> Also, you can't depend on name change; you still need a timer. Not all 
> > >>> distributions
> > >>> change name of devices. Or user has blocked that by udev rules.  
> > >> In the net_failover_slave_register() we do a dev_open() and ignore any 
> > >> failure due to
> > >> EBUSY and do another dev_open() in the name change event handler.
> > >> If the name is not expected to change, i would think the dev_open() at 
> > >> the time of
> > >> register will succeed.  
> > > The problem is your first dev_open will bring device up and lockout
> > > udev from changing the network device name.  
> > 
> > I have tried with/without udev and didn't see any issue with the naming of 
> > the primary/standby
> > devices in my testing.
> > 
> > With the 3-netdev failover model, we are only interested in setting the 
> > right name for the failover
> > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it 
> > really matter if udev fails
> > to rename the lower primary/standby netdevs, i don't think it will matter? 
> > The user is not expected
> > to touch the lower netdevs.
> 
> Renaming matters to some users and the udev maintainers. They want the VF to 
> be named enpXXX
> The primary in virtio case needs to be ens3 with some cloud platforms.

Confused. VF can't be a standby, of it's used in a failover it's a
primary, you can't call it both enpXXX amd ens3. Could you describe the
use case in a bit more detail?

> 
> I think you need to get rid of triggering off of the name change.

Worth considering down the road since it's a bit of a hack but it's one
we won't have trouble supporting, unlike the delayed uplink.

> Long term, let's open the discussion about allowing network devices to change 
> name when up?
> Maybe with NETIF_LIVENAME_CHANGE flag?

That's probably the cleanest approach assuming it can be made to work
without races. I suspect we can just live with what we have until then.


-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Wed, 6 Jun 2018 09:25:12 +0200
Jiri Pirko  wrote:

> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
> >The net failover should be a simple library, not a virtual
> >object with function callbacks (see callback hell).  
> 
> Why just a library? It should do a common things. I think it should be a
> virtual object. Looks like your patch again splits the common
> functionality into multiple drivers. That is kind of backwards attitude.
> I don't get it. We should rather focus on fixing the mess the
> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> model.

To me, callbacks are worse than a simple notifier. Is there some other
agenda going on? do you want to get rid of network notifier.

Netvsc is not switching to 3 device model in near future.
Too much userspace is wrapped around it already.


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Wed, 6 Jun 2018 15:30:27 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> > >The net failover should be a simple library, not a virtual
> > >object with function callbacks (see callback hell).  
> > 
> > Why just a library? It should do a common things. I think it should be a
> > virtual object. Looks like your patch again splits the common
> > functionality into multiple drivers. That is kind of backwards attitude.
> > I don't get it. We should rather focus on fixing the mess the
> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > model.  
> 
> So it seems that at least one benefit for netvsc would be better
> handling of renames.
> 
> Question is how can this change to 3-netdev happen?  Stephen is
> concerned about risk of breaking some userspace.
> 
> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> address, and you said then "why not use existing network namespaces
> rather than inventing a new abstraction". So how about it then? Do you
> want to find a way to use namespaces to hide the PV device for netvsc
> compatibility?
> 

Netvsc can't work with 3 dev model. MS has worked with enough distro's and
startups that all demand eth0 always be present. And VF may come and go.
After this history, there is a strong motivation not to change how kernel
behaves. Switching to 3 device model would be perceived as breaking
existing userspace.

With virtio you can  work it out with the distro's yourself.
There is no pre-existing semantics to deal with.

For the virtio, I don't see the need for IFF_HIDDEN.
With 3-dev model as long as you mark the PV and VF devices
as slaves, then userspace knows to leave them alone. Assuming userspace
is already able to deal with team and bond devices.
Any time you introduce new UAPI behavior something breaks.

On the rename front, I really don't care if VF can be renamed. And for
netvsc want to allow the PV device to be renamed. Udev developers want that
but have not found a stable/persistent value to expose to userspace
to allow it.


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Tue, 5 Jun 2018 23:11:37 -0700
"Samudrala, Sridhar"  wrote:

> On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 22:39:12 -0700
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:  
> >>> On Tue, 5 Jun 2018 16:52:22 -0700
> >>> "Samudrala, Sridhar"  wrote:
> >>> 
>  On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> > On Tue, 5 Jun 2018 22:38:43 +0300
> > "Michael S. Tsirkin"  wrote:
> >
> >>> See:
> >>>   https://patchwork.ozlabs.org/patch/851711/  
> >> Let me try to summarize that:
> >>
> >>You wanted to speed up the delayed link up.  You had an idea to
> >>additionally take link up when userspace renames the interface 
> >> (standby
> >>one which is also the failover for netvsc).
> >>
> >>But userspace might not do any renames, in which case there will
> >>still be the delay, and so this never got applied.
> >>
> >>Is this a good summary?
> >>
> >> Davem said delay should go away completely as it's not robust, and I
> >> think I agree.  So I don't think we should make all failover users use
> >> delay. IIUC failover kept a delay option especially for netvsc to
> >> minimize the surprise factor. Hopefully we can come up with
> >> something more robust and drop that option completely.  
> > The timeout was the original solution to how to complete setup after
> > userspace has had a chance to rename the device. Unfortunately, the 
> > whole network
> > device initialization (cooperation with udev and userspace) is a a mess 
> > because
> > there is no well defined specification, and there are multiple ways 
> > userspace
> > does this in old and new distributions.  The timeout has its own issues
> > (how long, handling errors during that window, what if userspace 
> > modifies other
> > device state); and open to finding a better solution.
> >
> > My point was that if name change can not be relied on (or used) by 
> > netvsc,
> > then we can't allow it for net_failover either.  
>  I think the push back was with the usage of the delay, not bringing up 
>  the primary/standby
>  device in the name change event handler.
>  Can't netvsc use this mechanism instead of depending on the delay?
> 
>  
> >>> The patch that was rejected for netvsc was about using name change.
> >>> Also, you can't depend on name change; you still need a timer. Not all 
> >>> distributions
> >>> change name of devices. Or user has blocked that by udev rules.  
> >> In the net_failover_slave_register() we do a dev_open() and ignore any 
> >> failure due to
> >> EBUSY and do another dev_open() in the name change event handler.
> >> If the name is not expected to change, i would think the dev_open() at the 
> >> time of
> >> register will succeed.  
> > The problem is your first dev_open will bring device up and lockout
> > udev from changing the network device name.  
> 
> I have tried with/without udev and didn't see any issue with the naming of 
> the primary/standby
> devices in my testing.
> 
> With the 3-netdev failover model, we are only interested in setting the right 
> name for the failover
> netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it 
> really matter if udev fails
> to rename the lower primary/standby netdevs, i don't think it will matter? 
> The user is not expected
> to touch the lower netdevs.

Renaming matters to some users and the udev maintainers. They want the VF to be 
named enpXXX
The primary in virtio case needs to be ens3 with some cloud platforms.

I think you need to get rid of triggering off of the name change.

Long term, let's open the discussion about allowing network devices to change 
name when up?
Maybe with NETIF_LIVENAME_CHANGE flag?






Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
> >The net failover should be a simple library, not a virtual
> >object with function callbacks (see callback hell).
> 
> Why just a library? It should do a common things. I think it should be a
> virtual object. Looks like your patch again splits the common
> functionality into multiple drivers. That is kind of backwards attitude.
> I don't get it. We should rather focus on fixing the mess the
> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> model.

So it seems that at least one benefit for netvsc would be better
handling of renames.

Question is how can this change to 3-netdev happen?  Stephen is
concerned about risk of breaking some userspace.

Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
address, and you said then "why not use existing network namespaces
rather than inventing a new abstraction". So how about it then? Do you
want to find a way to use namespaces to hide the PV device for netvsc
compatibility?

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Michael S. Tsirkin
On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > I think the push back was with the usage of the delay, not bringing up the 
> > primary/standby
> > device in the name change event handler.
> > Can't netvsc use this mechanism instead of depending on the delay?
> > 
> > 
> 
> The patch that was rejected for netvsc was about using name change.

So failover is now doing exactly what you wanted netvsc to do.  Rather
than reverting everyone to old behaviour how about using more pieces
from failover?

> Also, you can't depend on name change; you still need a timer. Not all 
> distributions
> change name of devices.

So failover chose not to implement the delayed open so far.
If it does I suspect we'll have to keep it around forever -
kind of like netvsc seems to be stuck with it.
But let's see if there's enough actual demand from people running
ancient distros with latest kernels to even start looking for
a solution for failover.

And this kind of behaviour change really should be split out
so we can discuss it separately.

> Or user has blocked that by udev rules.

Don't do that then?

-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Jiri Pirko
Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>The net failover should be a simple library, not a virtual
>object with function callbacks (see callback hell).

Why just a library? It should do a common things. I think it should be a
virtual object. Looks like your patch again splits the common
functionality into multiple drivers. That is kind of backwards attitude.
I don't get it. We should rather focus on fixing the mess the
introduction of netvsc-bonding caused and switch netvsc to 3-netdev
model.


Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Samudrala, Sridhar




On 6/5/2018 11:00 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:39:12 -0700
"Samudrala, Sridhar"  wrote:


On 6/5/2018 8:51 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:
  

On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:
 

See:
  https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.

I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?

  

The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.

In the net_failover_slave_register() we do a dev_open() and ignore any failure 
due to
EBUSY and do another dev_open() in the name change event handler.
If the name is not expected to change, i would think the dev_open() at the time 
of
register will succeed.

The problem is your first dev_open will bring device up and lockout
udev from changing the network device name.


I have tried with/without udev and didn't see any issue with the naming of the 
primary/standby
devices in my testing.

With the 3-netdev failover model, we are only interested in setting the right 
name for the failover
netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it 
really matter if udev fails
to rename the lower primary/standby netdevs, i don't think it will matter? The 
user is not expected
to touch the lower netdevs.




Re: [PATCH net] failover: eliminate callback hell

2018-06-06 Thread Stephen Hemminger
On Tue, 5 Jun 2018 22:39:12 -0700
"Samudrala, Sridhar"  wrote:

> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 16:52:22 -0700
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> >>> On Tue, 5 Jun 2018 22:38:43 +0300
> >>> "Michael S. Tsirkin"  wrote:
> >>> 
> > See:
> >  https://patchwork.ozlabs.org/patch/851711/  
>  Let me try to summarize that:
> 
>   You wanted to speed up the delayed link up.  You had an idea to
>   additionally take link up when userspace renames the interface (standby
>   one which is also the failover for netvsc).
> 
>   But userspace might not do any renames, in which case there will
>   still be the delay, and so this never got applied.
> 
>   Is this a good summary?
> 
>  Davem said delay should go away completely as it's not robust, and I
>  think I agree.  So I don't think we should make all failover users use
>  delay. IIUC failover kept a delay option especially for netvsc to
>  minimize the surprise factor. Hopefully we can come up with
>  something more robust and drop that option completely.  
> >>> The timeout was the original solution to how to complete setup after
> >>> userspace has had a chance to rename the device. Unfortunately, the whole 
> >>> network
> >>> device initialization (cooperation with udev and userspace) is a a mess 
> >>> because
> >>> there is no well defined specification, and there are multiple ways 
> >>> userspace
> >>> does this in old and new distributions.  The timeout has its own issues
> >>> (how long, handling errors during that window, what if userspace modifies 
> >>> other
> >>> device state); and open to finding a better solution.
> >>>
> >>> My point was that if name change can not be relied on (or used) by netvsc,
> >>> then we can't allow it for net_failover either.  
> >> I think the push back was with the usage of the delay, not bringing up the 
> >> primary/standby
> >> device in the name change event handler.
> >> Can't netvsc use this mechanism instead of depending on the delay?
> >>
> >>  
> > The patch that was rejected for netvsc was about using name change.
> > Also, you can't depend on name change; you still need a timer. Not all 
> > distributions
> > change name of devices. Or user has blocked that by udev rules.  
> 
> In the net_failover_slave_register() we do a dev_open() and ignore any 
> failure due to
> EBUSY and do another dev_open() in the name change event handler.
> If the name is not expected to change, i would think the dev_open() at the 
> time of
> register will succeed.

The problem is your first dev_open will bring device up and lockout
udev from changing the network device name.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar

On 6/5/2018 8:51 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:


On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:
  

See:
 https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.

I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?



The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.


In the net_failover_slave_register() we do a dev_open() and ignore any failure 
due to
EBUSY and do another dev_open() in the name change event handler.
If the name is not expected to change, i would think the dev_open() at the time 
of
register will succeed.




Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger
On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:

> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 22:38:43 +0300
> > "Michael S. Tsirkin"  wrote:
> >  
> >>> See:
> >>> https://patchwork.ozlabs.org/patch/851711/  
> >> Let me try to summarize that:
> >>
> >>You wanted to speed up the delayed link up.  You had an idea to
> >>additionally take link up when userspace renames the interface (standby
> >>one which is also the failover for netvsc).
> >>
> >>But userspace might not do any renames, in which case there will
> >>still be the delay, and so this never got applied.
> >>
> >>Is this a good summary?
> >>
> >> Davem said delay should go away completely as it's not robust, and I
> >> think I agree.  So I don't think we should make all failover users use
> >> delay. IIUC failover kept a delay option especially for netvsc to
> >> minimize the surprise factor. Hopefully we can come up with
> >> something more robust and drop that option completely.  
> > The timeout was the original solution to how to complete setup after
> > userspace has had a chance to rename the device. Unfortunately, the whole 
> > network
> > device initialization (cooperation with udev and userspace) is a a mess 
> > because
> > there is no well defined specification, and there are multiple ways 
> > userspace
> > does this in old and new distributions.  The timeout has its own issues
> > (how long, handling errors during that window, what if userspace modifies 
> > other
> > device state); and open to finding a better solution.
> >
> > My point was that if name change can not be relied on (or used) by netvsc,
> > then we can't allow it for net_failover either.  
> 
> I think the push back was with the usage of the delay, not bringing up the 
> primary/standby
> device in the name change event handler.
> Can't netvsc use this mechanism instead of depending on the delay?
> 
> 

The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar

On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:


See:
https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.


I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?




Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger
On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:

> > 
> > See:
> >https://patchwork.ozlabs.org/patch/851711/  
> 
> Let me try to summarize that:
> 
>   You wanted to speed up the delayed link up.  You had an idea to
>   additionally take link up when userspace renames the interface (standby
>   one which is also the failover for netvsc).
> 
>   But userspace might not do any renames, in which case there will
>   still be the delay, and so this never got applied.
> 
>   Is this a good summary?
> 
> Davem said delay should go away completely as it's not robust, and I
> think I agree.  So I don't think we should make all failover users use
> delay. IIUC failover kept a delay option especially for netvsc to
> minimize the surprise factor. Hopefully we can come up with
> something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Michael S. Tsirkin
On Tue, Jun 05, 2018 at 11:53:05AM -0700, Stephen Hemminger wrote:
> > >   * Now, netvsc and net_failover use the same delayed work type
> > > mechanism for setup. Previously, net_failover code was triggering off
> > > name change but a similar policy was rejected for netvsc.
> > > "what is good for the goose is good for the gander"  
> > 
> > I don't really understand what you are saying here.  I think the delayed
> > hack is kind of ugly and seems racy.  Current failover code was rejected
> > by whom?  Why is new one good and for whom?  Did you want to do a name
> > change in netvsc but it was rejected? Could you clarify please?
> 
> See:
>https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.


> > >   * Set permanent and current address of net_failover device
> > > to match the primary.
> > > 
> > >   * Carrier should be marked off before registering device
> > > the net_failover device.  
> > 
> > Are above two bugfixes?
> 
> Yes.

Maybe fix these two as first patches in the set?

> > > Although this patch needs to go into 4.18 (linux-net),  
> > 
> > I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> > 4.19.
> >
> 
> Either we fix or revert the current code in 4.18.
> Sorry, I am not having callback hell code in any vendor or upstream kernel.

I agree callbacks add complexity which often isn't necessary, so
removing them where possible is a good cleanup.  But maybe a patch
shouldn't mix bugfixes, cleanups and behaviour changes all together.  If
nothing else it makes review harder.  Splitting patches up might make it
more likely they can go into 4.18 which seems to be what you want.

HTH,
-- 
MST


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger
On Tue, 5 Jun 2018 21:35:26 +0300
"Michael S. Tsirkin"  wrote:

> Thanks, I think this is nice patch but I wonder whether it can be split
> up somewhat. Not all of it is uncontroversial.

I started that way, but then I was fixing code that was later deleted.
The big change was eliminating the callbacks.

> 
> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
> >   * The matching of secondary device to primary device policy
> > is up to the network device. Both net_failover and netvsc
> > will use MAC for now but can change separately.  
> 
> I actually suspect both will change to a serial number
> down the road.
> 
> >   * The match policy is only used during initial discovery; after
> > that the secondary device knows what the upper device is because
> > of the parent/child relationship; no searching is required.  
> 
> That would obviously be an improvement - does it have to be tied with
> rest of changes?

This was not possible with the version of the common code that
is in net now.

> 
> >   * Now, netvsc and net_failover use the same delayed work type
> > mechanism for setup. Previously, net_failover code was triggering off
> > name change but a similar policy was rejected for netvsc.
> > "what is good for the goose is good for the gander"  
> 
> I don't really understand what you are saying here.  I think the delayed
> hack is kind of ugly and seems racy.  Current failover code was rejected
> by whom?  Why is new one good and for whom?  Did you want to do a name
> change in netvsc but it was rejected? Could you clarify please?

See:
   https://patchwork.ozlabs.org/patch/851711/

> 
> >   * The net_failover private device info 'struct net_failover_info'
> > should have been private to the driver file, not a visible
> > API.
> > 
> >   * The net_failover device should use SET_NETDEV_DEV
> > that is intended only for physical devices not virtual devices.  
> 
> You mean should not.

Yes. Virtual device should not set device parent.

> 
> >   * No point in having DocBook style comments on a driver file.
> > They only make sense on an external exposed API.
> > 
> >   * net_failover only supports Ethernet, so use ether_addr_copy.  
> 
> It is since you need to know about all the things you need to copy, and
> because of mac matching.  But it isn't too much effort to add more
> transports and I don't see value in going in the reverse direction and
> making it more ethernet specific that it already is.

Sure, then do memcpy base on addr_len

> 
> >   * Set permanent and current address of net_failover device
> > to match the primary.
> > 
> >   * Carrier should be marked off before registering device
> > the net_failover device.  
> 
> Are above two bugfixes?

Yes.

> 
> >   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> > 
> >   * Since failover infrastructure is about linking devices just
> > use RTNL no need for other locking in init and teardown.
> > 
> >   * Don't bother with ERR_PTR() style return if only possible
> > return is success or no memory.
> > 
> >   * As much as possible, the terms master and slave should be avoided
> > because of their cultural connotations.  
> 
> Also for consistency, failover is calling these primary and standby now.

Good, let's standardize on that. 

> 
> > Note; this code has been tested on Hyper-V
> > but is compile tested only on virtio.
> > 
> > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> > Signed-off-by: Stephen Hemminger 
> > ---
> > 
> > Although this patch needs to go into 4.18 (linux-net),  
> 
> I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> 4.19.
>

Either we fix or revert the current code in 4.18.
Sorry, I am not having callback hell code in any vendor or upstream kernel.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Michael S. Tsirkin
Thanks, I think this is nice patch but I wonder whether it can be split
up somewhat. Not all of it is uncontroversial.

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * The matching of secondary device to primary device policy
> is up to the network device. Both net_failover and netvsc
> will use MAC for now but can change separately.

I actually suspect both will change to a serial number
down the road.

>   * The match policy is only used during initial discovery; after
> that the secondary device knows what the upper device is because
> of the parent/child relationship; no searching is required.

That would obviously be an improvement - does it have to be tied with
rest of changes?

>   * Now, netvsc and net_failover use the same delayed work type
> mechanism for setup. Previously, net_failover code was triggering off
> name change but a similar policy was rejected for netvsc.
> "what is good for the goose is good for the gander"

I don't really understand what you are saying here.  I think the delayed
hack is kind of ugly and seems racy.  Current failover code was rejected
by whom?  Why is new one good and for whom?  Did you want to do a name
change in netvsc but it was rejected? Could you clarify please?

>   * The net_failover private device info 'struct net_failover_info'
> should have been private to the driver file, not a visible
> API.
> 
>   * The net_failover device should use SET_NETDEV_DEV
> that is intended only for physical devices not virtual devices.

You mean should not.

>   * No point in having DocBook style comments on a driver file.
> They only make sense on an external exposed API.
> 
>   * net_failover only supports Ethernet, so use ether_addr_copy.

It is since you need to know about all the things you need to copy, and
because of mac matching.  But it isn't too much effort to add more
transports and I don't see value in going in the reverse direction and
making it more ethernet specific that it already is.

>   * Set permanent and current address of net_failover device
> to match the primary.
> 
>   * Carrier should be marked off before registering device
> the net_failover device.

Are above two bugfixes?

>   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> 
>   * Since failover infrastructure is about linking devices just
> use RTNL no need for other locking in init and teardown.
> 
>   * Don't bother with ERR_PTR() style return if only possible
> return is success or no memory.
> 
>   * As much as possible, the terms master and slave should be avoided
> because of their cultural connotations.

Also for consistency, failover is calling these primary and standby now.

> Note; this code has been tested on Hyper-V
> but is compile tested only on virtio.
> 
> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> Signed-off-by: Stephen Hemminger 
> ---
> 
> Although this patch needs to go into 4.18 (linux-net),

I'd rather we focused on fixing bugs in 4.18, and left refactoring to
4.19.

At some point you said refactoring is needed to support matching using
the serial number, but I see this didn't make 4.18. So no rush IMHO.

> this version is based against net-next because net-next
> hasn't been merged into linux-net yet.
> 
> 
>  drivers/net/hyperv/hyperv_net.h |   3 +-
>  drivers/net/hyperv/netvsc_drv.c | 173 +++--
>  drivers/net/net_failover.c  | 312 ---
>  drivers/net/virtio_net.c|   9 +-
>  include/net/failover.h  |  31 +---
>  include/net/net_failover.h  |  32 +---
>  net/Kconfig |  13 +-
>  net/core/failover.c | 316 
>  8 files changed, 373 insertions(+), 516 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 99d8e7398a5b..c7d25d10765e 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -902,6 +902,8 @@ struct net_device_context {
>   struct hv_device *device_ctx;
>   /* netvsc_device */
>   struct netvsc_device __rcu *nvdev;
> + /* list of netvsc net_devices */
> + struct list_head list;
>   /* reconfigure work */
>   struct delayed_work dwork;
>   /* last reconfig time */
> @@ -933,7 +935,6 @@ struct net_device_context {
>   /* Serial number of the VF to team with */
>   u32 vf_serial;
>  
> - struct failover *failover;
>  };
>  
>  /* Per channel data */
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index bef4d55a108c..074e6b8578df 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -70,6 +70,8 @@ static int debug = -1;
>  module_param(debug, int, 0444);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +static LIST_HEAD(netvsc_dev_list);
> +
>  static void netvsc_change_rx_flags(struct net_device 

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread David Miller
From: Stephen Hemminger 
Date: Tue, 5 Jun 2018 10:45:10 -0700

> I said it wasn't tested. Not surprising. Don't have a version of KVM
> that supports standby (and not going to build KVM from scratch for
> this).

It would definitely help me if you put "RFC" in the subject line
for patches which aren't tested :-)

Thanks.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger
On Tue, 5 Jun 2018 10:22:13 -0700
"Samudrala, Sridhar"  wrote:

> On 6/4/2018 8:42 PM, Stephen Hemminger wrote:
> > The net failover should be a simple library, not a virtual
> > object with function callbacks (see callback hell).
> > The code is simpler is smaller both for the netvsc and virtio use case.  
> 
> I quickly tried this patch and it breaks virtio-net in standby mode.
> I don't see failover netdev, unloading virtio-net causes a crash.

I said it wasn't tested. Not surprising. Don't have a version of KVM
that supports standby (and not going to build KVM from scratch for this).

> 
> With these changes, there is very minimal code that is shared between
> netvsc and virtio-net. The notifier and event handling code and the
> lookup_bymac routines are now duplicated in both the drivers. I thought
> we wanted to keep this code common between the 2 drivers and we went through
> multiple revisions to make sure that it works with both netvsc's 2 netdev
> and virtio-net's 3 netdev models.

The sharing is what needs to be shared; it turns out not much
is common really. The total code size is less with this version.
Also, since even with nested virtualization, both drivers will not
be present on same guest at same time.

> The reason for the indirect ops is to support these 2 different models and
> i am not sure if the overhead of the callbacks is that significant considering
> that they are not called in the hot path.

The callbacks invert the functionality. It makes everything bottom up.


Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar

On 6/4/2018 8:42 PM, Stephen Hemminger wrote:

The net failover should be a simple library, not a virtual
object with function callbacks (see callback hell).
The code is simpler is smaller both for the netvsc and virtio use case.


I quickly tried this patch and it breaks virtio-net in standby mode.
I don't see failover netdev, unloading virtio-net causes a crash.

With these changes, there is very minimal code that is shared between
netvsc and virtio-net. The notifier and event handling code and the
lookup_bymac routines are now duplicated in both the drivers. I thought
we wanted to keep this code common between the 2 drivers and we went through
multiple revisions to make sure that it works with both netvsc's 2 netdev
and virtio-net's 3 netdev models.

The reason for the indirect ops is to support these 2 different models and
i am not sure if the overhead of the callbacks is that significant considering
that they are not called in the hot path.






The code is restructured in many ways. I should have given these
as review comments to net_failover during review
but did not want to overwhelm the original submitter.
Therefore it was merged prematurely.

Some of the many items changed are:

   * The support routines should just be selected as needed in
 kernel config, no need for them to be visible config items.

   * Both netvsc and net_failover should keep their list of their
 own devices. Not a common list.

  * The matching of secondary device to primary device policy
 is up to the network device. Both net_failover and netvsc
 will use MAC for now but can change separately.

   * The match policy is only used during initial discovery; after
 that the secondary device knows what the upper device is because
 of the parent/child relationship; no searching is required.

   * Now, netvsc and net_failover use the same delayed work type
 mechanism for setup. Previously, net_failover code was triggering off
 name change but a similar policy was rejected for netvsc.
 "what is good for the goose is good for the gander"

   * The net_failover private device info 'struct net_failover_info'
 should have been private to the driver file, not a visible
 API.

   * The net_failover device should use SET_NETDEV_DEV
 that is intended only for physical devices not virtual devices.

   * No point in having DocBook style comments on a driver file.
 They only make sense on an external exposed API.

   * net_failover only supports Ethernet, so use ether_addr_copy.

   * Set permanent and current address of net_failover device
 to match the primary.

   * Carrier should be marked off before registering device
 the net_failover device.

   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)

   * Since failover infrastructure is about linking devices just
 use RTNL no need for other locking in init and teardown.

   * Don't bother with ERR_PTR() style return if only possible
 return is success or no memory.

   * As much as possible, the terms master and slave should be avoided
 because of their cultural connotations.

Note; this code has been tested on Hyper-V
but is compile tested only on virtio.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Stephen Hemminger 
---

Although this patch needs to go into 4.18 (linux-net),
this version is based against net-next because net-next
hasn't been merged into linux-net yet.


  drivers/net/hyperv/hyperv_net.h |   3 +-
  drivers/net/hyperv/netvsc_drv.c | 173 +++--
  drivers/net/net_failover.c  | 312 ---
  drivers/net/virtio_net.c|   9 +-
  include/net/failover.h  |  31 +---
  include/net/net_failover.h  |  32 +---
  net/Kconfig |  13 +-
  net/core/failover.c | 316 
  8 files changed, 373 insertions(+), 516 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 99d8e7398a5b..c7d25d10765e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -902,6 +902,8 @@ struct net_device_context {
struct hv_device *device_ctx;
/* netvsc_device */
struct netvsc_device __rcu *nvdev;
+   /* list of netvsc net_devices */
+   struct list_head list;
/* reconfigure work */
struct delayed_work dwork;
/* last reconfig time */
@@ -933,7 +935,6 @@ struct net_device_context {
/* Serial number of the VF to team with */
u32 vf_serial;
  
-	struct failover *failover;

  };
  
  /* Per channel data */

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index bef4d55a108c..074e6b8578df 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -70,6 +70,8 @@ static int debug = -1;
  module_param(debug, int, 0444);
  MODULE_PARM_DESC(debug, "Debug level