Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 19:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>>>> On Thu, 21 Mar 2019 15:04:37 +0200
>>>> Liran Alon  wrote:
>>>> 
>>>>>> 
>>>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>>>> to move the slaves too?  
>>>>> 
>>>>> No. Why would we move the slaves? The whole point is to make most 
>>>>> customer ignore the net-failover slaves and remain them “hidden” in their 
>>>>> dedicated netns.
>>>>> We won’t prevent customer from explicitly moving the net-failover slaves 
>>>>> out of this netns, but we will not move them out of there automatically.
>>>> 
>>>> 
>>>> The 2-device netvsc already handles case where master changes namespace.
>>> 
>>> Is it by moving slave with it?
>> 
>> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
>> It seems that when NetVSC master netdev changes netns, the VF is moved to 
>> the same netns by the NetVSC driver.
>> Kinda the opposite than what we are suggesting here to make sure that the 
>> net-failover master netdev is on a separate
>> netns than it’s slaves...
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST
> 
> Not exactly opposite I'd say.
> 
> If failover is in host ns, slaves in /primary and /standby, then moving
> failover to /container should move slaves to /container/primary and
> /container/standby.

Yes I agree.
I meant that they tried to keep the VF on the same netns as the NetVSC.
But of course what you just described is exactly the functionality I would have 
wanted in our net-failover mechanism.

-Liran

> 
> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:51, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>>>>>> 
>>>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a 
>>>>>>>>>>>> customer may attempt to analyse connectivity issue by checking the 
>>>>>>>>>>>> connectivity
>>>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity 
>>>>>>>>>>>> when in-fact checking the connectivity on the net-failover master 
>>>>>>>>>>>> netdev shows correct connectivity.
>>>>>>>>>>>> 
>>>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and 
>>>>>>>>>>>> managed by the kernel. But that user can enter to it and manage 
>>>>>>>>>>>> the netdevs there if wishes to do so explicitly.
>>>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead 
>>>>>>>>>>>> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>>>>>>>>>>>> interface to get PCI slot where the matching VF will be 
>>>>>>>>>>>> hot-plugged by hypervisor.
>>>>>>>>>>>> 3) Have an explicit virtio-net control message to command 
>>>>>>>>>>>> hypervisor to switch data-path from virtio-net to VF and 
>>>>>>>>>>>> vice-versa. Instead of relying on intercepting the PCI master 
>>>>>>>>>>>> enable-bit
>>>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as 
>>>>>>>>>>>> done in NetVSC).
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>>>> 
>>>>>>>>>>>> -Liran
>>>>>>>>>>> 
>>>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>>>> created by users?
>>>>>>>>>> 
>>>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 
>>>>>>>>>> groups: hidden and normal.
>>>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>>>> Hidden and normal netns names can collide as they will be maintained 
>>>>>>>>>> in different namespaces (Yes I’m overloading the term namespace 
>>>>>>>>>> here…).
>>>>>>>>> 
>>>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
>>>>>>>>> name?
>>>>>>>> 
>>>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Does this seems reasonable?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>>>> missed a trick or two.
>>>>>>>> 
>>>>>>>> BTW, from a practical point of view, I think that even until we figure 
>>>>>>>> out a solution on how to implement this,
>>>&

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>> On Thu, 21 Mar 2019 15:04:37 +0200
>> Liran Alon  wrote:
>> 
>>>> 
>>>> OK. Now what happens if master is moved to another namespace? Do we need
>>>> to move the slaves too?  
>>> 
>>> No. Why would we move the slaves? The whole point is to make most customer 
>>> ignore the net-failover slaves and remain them “hidden” in their dedicated 
>>> netns.
>>> We won’t prevent customer from explicitly moving the net-failover slaves 
>>> out of this netns, but we will not move them out of there automatically.
>> 
>> 
>> The 2-device netvsc already handles case where master changes namespace.
> 
> Is it by moving slave with it?

See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
It seems that when NetVSC master netdev changes netns, the VF is moved to the 
same netns by the NetVSC driver.
Kinda the opposite than what we are suggesting here to make sure that the 
net-failover master netdev is on a separate
netns than it’s slaves...

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v2] failover: allow name change on IFF_UP slave interfaces

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 16:04, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 06, 2019 at 10:08:32PM -0500, Si-Wei Liu wrote:
>> When a netdev appears through hot plug then gets enslaved by a failover
>> master that is already up and running, the slave will be opened
>> right away after getting enslaved. Today there's a race that userspace
>> (udev) may fail to rename the slave if the kernel (net_failover)
>> opens the slave earlier than when the userspace rename happens.
>> Unlike bond or team, the primary slave of failover can't be renamed by
>> userspace ahead of time, since the kernel initiated auto-enslavement is
>> unable to, or rather, is never meant to be synchronized with the rename
>> request from userspace.
>> 
>> As the failover slave interfaces are not designed to be operated
>> directly by userspace apps: IP configuration, filter rules with
>> regard to network traffic passing and etc., should all be done on master
>> interface. In general, userspace apps only care about the
>> name of master interface, while slave names are less important as long
>> as admin users can see reliable names that may carry
>> other information describing the netdev. For e.g., they can infer that
>> "ens3nsby" is a standby slave of "ens3", while for a
>> name like "eth0" they can't tell which master it belongs to.
>> 
>> Historically the name of IFF_UP interface can't be changed because
>> there might be admin script or management software that is already
>> relying on such behavior and assumes that the slave name can't be
>> changed once UP. But failover is special: with the in-kernel
>> auto-enslavement mechanism, the userspace expectation for device
>> enumeration and bring-up order is already broken. Previously initramfs
>> and various userspace config tools were modified to bypass failover
>> slaves because of auto-enslavement and duplicate MAC address. Similarly,
>> in case that users care about seeing reliable slave name, the new type
>> of failover slaves needs to be taken care of specifically in userspace
>> anyway.
>> 
>> It's less risky to lift up the rename restriction on failover slave
>> which is already UP. Although it's possible this change may potentially
>> break userspace component (most likely configuration scripts or
>> management software) that assumes slave name can't be changed while
>> UP, it's relatively a limited and controllable set among all userspace
>> components, which can be fixed specifically to work with the new naming
>> behavior of failover slaves. Userspace component interacting with
>> slaves should be changed to operate on failover master instead, as the
>> failover slave is dynamic in nature which may come and go at any point.
>> The goal is to make the role of failover slaves less relevant, and
>> all userspace should only deal with master in the long run.
>> 
>> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
>> Signed-off-by: Si-Wei Liu 
>> Reviewed-by: Liran Alon 
>> Acked-by: Michael S. Tsirkin 
> 
> I worry that userspace might have made a bunch of assumptions
> that names never change as long as interface is up.
> So listening for up events ensures that interface
> is not renamed.

That’s true. This is exactly what is described in 3rd paragraph of commit 
message.
However, as commit message claims, net-failover slaves can be treated specially
because userspace is already broken on their handling and they need to be 
modified
to behave specially in regards to those slaves. Therefore, it’s less risky to 
lift up the
rename restriction on failover slave which is already UP.

> 
> How about sending down and up events around such renames?

You mean that dev_change_name() will behave as proposed in this patch but also 
in addition
send fake DOWN and UP uevents to userspace?

-Liran

> 
> 
> 
>> ---
>> v1 -> v2:
>> - Drop configurable module parameter (Sridhar)
>> 
>> 
>> include/linux/netdevice.h | 3 +++
>> net/core/dev.c| 3 ++-
>> net/core/failover.c   | 6 +++---
>> 3 files changed, 8 insertions(+), 4 deletions(-)
>> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 857f8ab..6d9e4e0 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1487,6 +1487,7 @@ struct net_device_ops {
>>  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>>  * @IFF_FAILOVER: device is a failover master device
>>  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>> + * @IFF_SLAVE_RENAME_

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>>>> 2) It brings non-intuitive customer experience. For example, a 
>>>>>>>>>> customer may attempt to analyse connectivity issue by checking the 
>>>>>>>>>> connectivity
>>>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity 
>>>>>>>>>> when in-fact checking the connectivity on the net-failover master 
>>>>>>>>>> netdev shows correct connectivity.
>>>>>>>>>> 
>>>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed 
>>>>>>>>>> by the kernel. But that user can enter to it and manage the netdevs 
>>>>>>>>>> there if wishes to do so explicitly.
>>>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead 
>>>>>>>>>> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>>>>>>>>>> interface to get PCI slot where the matching VF will be hot-plugged 
>>>>>>>>>> by hypervisor.
>>>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor 
>>>>>>>>>> to switch data-path from virtio-net to VF and vice-versa. Instead of 
>>>>>>>>>> relying on intercepting the PCI master enable-bit
>>>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as 
>>>>>>>>>> done in NetVSC).
>>>>>>>>>> 
>>>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>>>> 
>>>>>>>>>> -Liran
>>>>>>>>> 
>>>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>>>> created by users?
>>>>>>>> 
>>>>>>>> This is kinda controversial, but maybe separate netns names into 2 
>>>>>>>> groups: hidden and normal.
>>>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>>>> Hidden and normal netns names can collide as they will be maintained 
>>>>>>>> in different namespaces (Yes I’m overloading the term namespace here…).
>>>>>>> 
>>>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>>>> 
>>>>>> This is also a good idea that will solve the issue. Yes.
>>>>>> 
>>>>>>> 
>>>>>>>> Does this seems reasonable?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>>>> missed a trick or two.
>>>>>> 
>>>>>> BTW, from a practical point of view, I think that even until we figure 
>>>>>> out a solution on how to implement this,
>>>>>> it was better to create an kernel auto-generated name (e.g. 
>>>>>> “kernel_net_failover_slaves")
>>>>>> that will break only userspace workloads that by a very rare-chance have 
>>>>>> a netns that collides with this then
>>>>>> the breakage we have today for the various userspace components.
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> It seems quite easy to supply that as a module parameter. Do we need two
>>>>> namespaces though? Won't some userspace still be confused by the two
>>>>> slaves sharing the MAC address?
>>>> 
>>>> That’s one reasonable option.
>&g

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>>>> 2) It brings non-intuitive customer experience. For example, a 
>>>>>>>> customer may attempt to analyse connectivity issue by checking the 
>>>>>>>> connectivity
>>>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity 
>>>>>>>> when in-fact checking the connectivity on the net-failover master 
>>>>>>>> netdev shows correct connectivity.
>>>>>>>> 
>>>>>>>> The set of changes I vision to fix our issues are:
>>>>>>>> 1) Hide net-failover slaves in a different netns created and managed 
>>>>>>>> by the kernel. But that user can enter to it and manage the netdevs 
>>>>>>>> there if wishes to do so explicitly.
>>>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of 
>>>>>>>> MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>>>>>>>> interface to get PCI slot where the matching VF will be hot-plugged by 
>>>>>>>> hypervisor.
>>>>>>>> 3) Have an explicit virtio-net control message to command hypervisor 
>>>>>>>> to switch data-path from virtio-net to VF and vice-versa. Instead of 
>>>>>>>> relying on intercepting the PCI master enable-bit
>>>>>>>> as an indicator on when VF is about to be set up. (Similar to as done 
>>>>>>>> in NetVSC).
>>>>>>>> 
>>>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>>>> 
>>>>>>>> -Liran
>>>>>>> 
>>>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>>>> created by users?
>>>>>> 
>>>>>> This is kinda controversial, but maybe separate netns names into 2 
>>>>>> groups: hidden and normal.
>>>>>> To reference a hidden netns, you need to do it explicitly. 
>>>>>> Hidden and normal netns names can collide as they will be maintained in 
>>>>>> different namespaces (Yes I’m overloading the term namespace here…).
>>>>> 
>>>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>>>> 
>>>> This is also a good idea that will solve the issue. Yes.
>>>> 
>>>>> 
>>>>>> Does this seems reasonable?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>>>> missed a trick or two.
>>>> 
>>>> BTW, from a practical point of view, I think that even until we figure out 
>>>> a solution on how to implement this,
>>>> it was better to create an kernel auto-generated name (e.g. 
>>>> “kernel_net_failover_slaves")
>>>> that will break only userspace workloads that by a very rare-chance have a 
>>>> netns that collides with this then
>>>> the breakage we have today for the various userspace components.
>>>> 
>>>> -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we 
>> determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to 
>> be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both 
>> seem reasonable to me and your suggestion is faster to implement from 
>> current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?

No. Why would we move the slaves? The whole point is to make most customer 
ignore the net-failover slaves and remain them “hidden” in their dedicated 
netns.
We won’t prevent customer from explicitly moving the net-failover slaves out of 
this netns, but we will not move them out of there automatically.

> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace…

I’m not sure actually. Isn't udev/systemd netns-aware?
I would expect it to be able to provide names also to netdevs in netns 
different than default netns.
If that’s the case, Si-Wei patch to be able to rename a net-failover slave when 
it is already open is still required. As the race-condition still exists.

-Liran

> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>>>>>> 2) It brings non-intuitive customer experience. For example, a customer 
>>>>>> may attempt to analyse connectivity issue by checking the connectivity
>>>>>> on a net-failover slave (e.g. the VF) but will see no connectivity when 
>>>>>> in-fact checking the connectivity on the net-failover master netdev 
>>>>>> shows correct connectivity.
>>>>>> 
>>>>>> The set of changes I vision to fix our issues are:
>>>>>> 1) Hide net-failover slaves in a different netns created and managed by 
>>>>>> the kernel. But that user can enter to it and manage the netdevs there 
>>>>>> if wishes to do so explicitly.
>>>>>> (E.g. Configure the net-failover VF slave in some special way).
>>>>>> 2) Match the virtio-net and the VF based on a PV attribute instead of 
>>>>>> MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface 
>>>>>> to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>>>>>> 3) Have an explicit virtio-net control message to command hypervisor to 
>>>>>> switch data-path from virtio-net to VF and vice-versa. Instead of 
>>>>>> relying on intercepting the PCI master enable-bit
>>>>>> as an indicator on when VF is about to be set up. (Similar to as done in 
>>>>>> NetVSC).
>>>>>> 
>>>>>> Is there any clear issue we see regarding the above suggestion?
>>>>>> 
>>>>>> -Liran
>>>>> 
>>>>> The issue would be this: how do we avoid conflicting with namespaces
>>>>> created by users?
>>>> 
>>>> This is kinda controversial, but maybe separate netns names into 2 groups: 
>>>> hidden and normal.
>>>> To reference a hidden netns, you need to do it explicitly. 
>>>> Hidden and normal netns names can collide as they will be maintained in 
>>>> different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
>>>> Does this seems reasonable?
>>>> 
>>>> -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure out a 
>> solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. 
>> “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have a 
>> netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?

That’s one reasonable option.
Another one is that we will indeed change the mechanism by which we determine a 
VF should be bonded with a virtio-net device.
i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be 
bonded with.

The second seems cleaner but I don’t have a strong opinion on this. Both seem 
reasonable to me and your suggestion is faster to implement from current state 
of things.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>>>>>> 
>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>> Liran Alon  wrote:
>>>>>>>> 
>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>>>>>>>> attempts to configure all available netdevs. It should avoid however 
>>>>>>>>> doing so on net-failover slaves.
>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>>>>>>>> blacklist a netdev from being configured in case it is owned by a 
>>>>>>>>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
>>>>>>>>> However, this technique doesn’t work for the net-failover mechanism 
>>>>>>>>> because both the net-failover netdev and the virtio-net netdev are 
>>>>>>>>> owned by the virtio-net PCI driver).
>>>>>>>> 
>>>>>>>> Cloud-init should really just ignore all devices that have a master 
>>>>>>>> device.
>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>> 
>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>> down? And add a special attribute for the actual link state.
>>>>>> 
>>>>>> I think this may be problematic as it would also break legit use case
>>>>>> of userspace attempt to set various config on VF slave.
>>>>>> In general, lying to userspace usually leads to problems.
>>>>> 
>>>>> I hear you on this. So how about instead of lying,
>>>>> we basically just fail some accesses to slaves
>>>>> unless a flag is set e.g. in ethtool.
>>>>> 
>>>>> Some userspace will need to change to set it but in a minor way.
>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>> failure.
>>>> 
>>>> Once userspace will set this new flag by ethtool, all operations done by 
>>>> other userspace components will still work.
>>> 
>>> Sorry about being unclear, the idea would be to require the flag on each 
>>> ethtool operation.
>> 
>> Oh. I have indeed misunderstood your previous email then. :)
>> Thanks for clarifying.
>> 
>>> 
>>>> E.g. Running dhclient without parameters, after this flag was set, will 
>>>> still attempt to perform DHCP on it and will now succeed.
>>> 
>>> I think sending/receiving should probably just fail unconditionally.
>> 
>> You mean that you wish that somehow kernel will prevent Tx on net-failover 
>> slave netdev
>> unless skb is marked with some flag to indicate it has been sent via the 
>> net-failover master?
> 
> We can maybe avoid binding a protocol socket to the device?

That is indeed another possibility that would work to avoid the DHCP issues.
And will still allow checking connectivity. So it is better.
However, I still think it provides an non-intuitive customer experience.
In addition, I also want to take into account that most customers are expected 
a 1:1 mapping between a vNIC and a netdev.
i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
defined.
Customers usually don’t care how they get accelerated networking. They just 
care they do.

> 
>> This indeed resolves the group of userspace issues around performing DHCP on 
>> net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>> 
>> However, I see a couple of down-sides to it:
>> 1) It doesn’t resolve all userspace issues listed in this email thread. For 
>> example, c

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 10:58, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
>>>>> 
>>>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>>>>>> 
>>>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>>>>>> Liran Alon  wrote:
>>>>>>>>>> 
>>>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>>>>>>>>>> attempts to configure all available netdevs. It should avoid 
>>>>>>>>>>> however doing so on net-failover slaves.
>>>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>>>>>>>>>> blacklist a netdev from being configured in case it is owned by a 
>>>>>>>>>>> specific PCI driver. Specifically, they blacklist Mellanox VF 
>>>>>>>>>>> driver. However, this technique doesn’t work for the net-failover 
>>>>>>>>>>> mechanism because both the net-failover netdev and the virtio-net 
>>>>>>>>>>> netdev are owned by the virtio-net PCI driver).
>>>>>>>>>> 
>>>>>>>>>> Cloud-init should really just ignore all devices that have a master 
>>>>>>>>>> device.
>>>>>>>>>> That would have been more general, and safer for other use cases.
>>>>>>>>> 
>>>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>>>>>> down? And add a special attribute for the actual link state.
>>>>>>>> 
>>>>>>>> I think this may be problematic as it would also break legit use case
>>>>>>>> of userspace attempt to set various config on VF slave.
>>>>>>>> In general, lying to userspace usually leads to problems.
>>>>>>> 
>>>>>>> I hear you on this. So how about instead of lying,
>>>>>>> we basically just fail some accesses to slaves
>>>>>>> unless a flag is set e.g. in ethtool.
>>>>>>> 
>>>>>>> Some userspace will need to change to set it but in a minor way.
>>>>>>> Arguably/hopefully failure to set config would generally be a safer
>>>>>>> failure.
>>>>>> 
>>>>>> Once userspace will set this new flag by ethtool, all operations done by 
>>>>>> other userspace components will still work.
>>>>> 
>>>>> Sorry about being unclear, the idea would be to require the flag on each 
>>>>> ethtool operation.
>>>> 
>>>> Oh. I have indeed misunderstood your previous email then. :)
>>>> Thanks for clarifying.
>>>> 
>>>>> 
>>>>>> E.g. Running dhclient without parameters, after this flag was set, will 
>>>>>> still attempt to perform DHCP on it and will now succeed.
>>>>> 
>>>>> I think sending/receiving should probably just fail unconditionally.
>>>> 
>>>> You mean that you wish that somehow kernel will prevent Tx on net-failover 
>>>> slave netdev
>>>> unless skb is marked with some flag to indicate it has been sent via the 
>>>> net-failover master?
>>> 
>>> We can maybe avoid binding a protocol socket to the device?
>> 
>> That is indeed another possibility that would work to avoid the DHCP issues.
>> And will still allow c

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>> Liran Alon  wrote:
>>>> 
>>>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>>>> attempts to configure all available netdevs. It should avoid however 
>>>>> doing so on net-failover slaves.
>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>>>> blacklist a netdev from being configured in case it is owned by a 
>>>>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
>>>>> However, this technique doesn’t work for the net-failover mechanism 
>>>>> because both the net-failover netdev and the virtio-net netdev are owned 
>>>>> by the virtio-net PCI driver).
>>>> 
>>>> Cloud-init should really just ignore all devices that have a master device.
>>>> That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.

Once userspace will set this new flag by ethtool, all operations done by other 
userspace components will still work.
E.g. Running dhclient without parameters, after this flag was set, will still 
attempt to perform DHCP on it and will now succeed.

Therefore, this proposal just effectively delays when the net-failover slave 
can be operated on by userspace.
But what we actually want is to never allow a net-failover slave to be operated 
by userspace unless it is explicitly stated
by userspace that it wishes to perform a set of actions on the net-failover 
slave.

Something that was achieved if, for example, the net-failover slaves were in a 
different netns than default netns.
This also aligns with expected customer experience that most customers just 
want to see a 1:1 mapping between a vNIC and a visible netdev.
But of course maybe there are other ideas that can achieve similar behaviour.

-Liran

> 
> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> More?
> 
>> If we reach
>> to a scenario where we try to avoid userspace issues generically and
>> not on a userspace component basis, I believe the right path should be
>> to hide the net-failover slaves such that explicit action is required
>> to actually manipulate them (As described in blog-post). E.g.
>> Automatically move net-failover slaves by kernel to a different netns.
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid however doing 
>>> so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a specific 
>>> PCI driver. Specifically, they blacklist Mellanox VF driver. However, this 
>>> technique doesn’t work for the net-failover mechanism because both the 
>>> net-failover netdev and the virtio-net netdev are owned by the virtio-net 
>>> PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.

I think this may be problematic as it would also break legit use case of 
userspace attempt to set various config on VF slave.
In general, lying to userspace usually leads to problems. If we reach to a 
scenario where we try to avoid userspace issues generically and not
on a userspace component basis, I believe the right path should be to hide the 
net-failover slaves such that explicit action is required
to actually manipulate them (As described in blog-post). E.g. Automatically 
move net-failover slaves by kernel to a different netns.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>>>> 
>>>> 
>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>>>> 
>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
>>>>>> Liran Alon  wrote:
>>>>>> 
>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>>>>>> attempts to configure all available netdevs. It should avoid however 
>>>>>>> doing so on net-failover slaves.
>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>>>>>> blacklist a netdev from being configured in case it is owned by a 
>>>>>>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
>>>>>>> However, this technique doesn’t work for the net-failover mechanism 
>>>>>>> because both the net-failover netdev and the virtio-net netdev are 
>>>>>>> owned by the virtio-net PCI driver).
>>>>>> 
>>>>>> Cloud-init should really just ignore all devices that have a master 
>>>>>> device.
>>>>>> That would have been more general, and safer for other use cases.
>>>>> 
>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>>>> safer to just somehow pretend to userspace that the slave links are
>>>>> down? And add a special attribute for the actual link state.
>>>> 
>>>> I think this may be problematic as it would also break legit use case
>>>> of userspace attempt to set various config on VF slave.
>>>> In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by 
>> other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each 
> ethtool operation.

Oh. I have indeed misunderstood your previous email then. :)
Thanks for clarifying.

> 
>> E.g. Running dhclient without parameters, after this flag was set, will 
>> still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.

You mean that you wish that somehow kernel will prevent Tx on net-failover 
slave netdev
unless skb is marked with some flag to indicate it has been sent via the 
net-failover master?

This indeed resolves the group of userspace issues around performing DHCP on 
net-failover slaves directly (By dracut/initramfs, dhclient and etc.).

However, I see a couple of down-sides to it:
1) It doesn’t resolve all userspace issues listed in this email thread. For 
example, cloud-init will still attempt to perform network config on 
net-failover slaves.
It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
rules that match only by MAC.
2) It brings non-intuitive customer experience. For example, a customer may 
attempt to analyse connectivity issue by checking the connectivity
on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact 
checking the connectivity on the net-failover master netdev shows correct 
connectivity.

The set of changes I vision to fix our issues are:
1) Hide net-failover slaves in a different netns created and managed by the 
kernel. But that user can enter to it and manage the netdevs there if wishes to 
do so explicitly.
(E.g. Configure the net-failover VF slave in some special way).
2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
(Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI 
slot where the matching VF will be hot-plugged by hypervisor.
3) Have an explicit virtio-net control message to command hypervisor to switch 
data-path from virtio-net to VF and vice-versa. Instead of relying on 
intercepting the PCI master enable-bit
as an indicator on when VF is about to be set up. (Similar to as done in 
NetVSC).

Is there any clear issue

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon
Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term “1-netdev model” on community 
discussion, we tend to refer to what you have defined in blog-post as "3-device 
model with hidden slaves”.
Therefore, I would suggest to just remove the “1-netdev model” section and 
rename the "3-device model with hidden slaves” section to “1-netdev model”.

2) The userspace issues result both from using “2-netdev model” and “3-netdev 
model”. However, they are described in blog-post as they only exist on 
“3-netdev model”.
The reason these issues are not seen in Azure environment is because these 
issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the 
userspace issues which results in models different than “1-netdev model”.
The issues that I’m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also 
opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
for the net-failover netdev and only then for the virtio-net netdev. This means 
that if userspace will respond to first event by open the net-failover, then 
any attempt of userspace to rename virtio-net netdev as a response to the 
second event will fail because the virtio-net netdev is already opened. Also 
note that this udev rename rule is useful because we would like to add rules 
that renames virtio-net netdev to clearly signal that it’s used as the standby 
interface of another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open 
done on slave-VF from the open of the NetVSC netdev. However, this is still a 
race and thus a hacky solution. It was accepted by community only because it’s 
internal to the NetVSC driver. However, similar solution was rejected by 
community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was 
to change the rename kernel handling to allow a net-failover slave to be 
renamed even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover 
slaves: DHCP of course should only be done on the net-failover netdev. 
Attempting to DHCP on net-failover slaves as-well will cause networking issues. 
Therefore, userspace components should be taught to avoid doing DHCP on the 
net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs 
and attempt to DHCP them all.
(I don’t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven’t handled (b.2) because they don’t have images which perform 
iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to 
configure all available netdevs. It should avoid however doing so on 
net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a 
netdev from being configured in case it is owned by a specific PCI driver. 
Specifically, they blacklist Mellanox VF driver. However, this technique 
doesn’t work for the net-failover mechanism because both the net-failover 
netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on 
net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is 
for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which 
lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests 
running on host (which will make all guests now use the virtio-net netdev which 
is backed by a netdev that eventually is on top of PF). Therefore, networking 
will be restored to guests once the PCIe reset is completed and the PF is 
functional again. To re-acceelrate the guests network, hypervisor can just 
hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items 
written in (3). Which currently prevents using this net-failover mechanism in 
real production use-cases.

Regards,
-Liran

> On 17 Mar 2019, at 15:55, Michael S. Tsirkin  wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
> 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 19 Mar 2019, at 23:06, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
>> Hi Michael,
>> 
>> Great blog-post which summarise everything very well!
>> 
>> Some comments I have:
> 
> Thanks!
> I'll try to update everything in the post when I'm not so jet-lagged.
> 
>> 1) I think that when we are using the term “1-netdev model” on community 
>> discussion, we tend to refer to what you have defined in blog-post as 
>> "3-device model with hidden slaves”.
>> Therefore, I would suggest to just remove the “1-netdev model” section and 
>> rename the "3-device model with hidden slaves” section to “1-netdev model”.
>> 
>> 2) The userspace issues result both from using “2-netdev model” and 
>> “3-netdev model”. However, they are described in blog-post as they only 
>> exist on “3-netdev model”.
>> The reason these issues are not seen in Azure environment is because these 
>> issues were partially handled by Microsoft for their specific 2-netdev model.
>> Which leads me to the next comment.
>> 
>> 3) I suggest that blog-post will also elaborate on what exactly are the 
>> userspace issues which results in models different than “1-netdev model”.
>> The issues that I’m aware of are (Please tell me if you are aware of 
>> others!):
>> (a) udev rename race-condition: When net-failover device is opened, it also 
>> opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
>> for the net-failover netdev and only then for the virtio-net netdev. This 
>> means that if userspace will respond to first event by open the 
>> net-failover, then any attempt of userspace to rename virtio-net netdev as a 
>> response to the second event will fail because the virtio-net netdev is 
>> already opened. Also note that this udev rename rule is useful because we 
>> would like to add rules that renames virtio-net netdev to clearly signal 
>> that it’s used as the standby interface of another net-failover netdev.
>> The way this problem was workaround by Microsoft in NetVSC is to delay the 
>> open done on slave-VF from the open of the NetVSC netdev. However, this is 
>> still a race and thus a hacky solution. It was accepted by community only 
>> because it’s internal to the NetVSC driver. However, similar solution was 
>> rejected by community for the net-failover driver.
>> The solution that we currently proposed to address this (Patch by Si-Wei) 
>> was to change the rename kernel handling to allow a net-failover slave to be 
>> renamed even if it is already opened. Patch is still not accepted.
>> (b) Issues caused because of various userspace components DHCP the 
>> net-failover slaves: DHCP of course should only be done on the net-failover 
>> netdev. Attempting to DHCP on net-failover slaves as-well will cause 
>> networking issues. Therefore, userspace components should be taught to avoid 
>> doing DHCP on the net-failover slaves. The various userspace components 
>> include:
>> b.1) dhclient: If run without parameters, it by default just enum all 
>> netdevs and attempt to DHCP them all.
>> (I don’t think Microsoft has handled this)
>> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
>> these components needs networking and therefore DHCP on all netdevs.
>> (Microsoft haven’t handled (b.2) because they don’t have images which 
>> perform iSCSI boot in their Azure setup. Still an open issue)
>> b.3) cloud-init: If configured to perform network-configuration, it attempts 
>> to configure all available netdevs. It should avoid however doing so on 
>> net-failover slaves.
>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist 
>> a netdev from being configured in case it is owned by a specific PCI driver. 
>> Specifically, they blacklist Mellanox VF driver. However, this technique 
>> doesn’t work for the net-failover mechanism because both the net-failover 
>> netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
>> b.4) Various distros network-manager need to be updated to avoid DHCP on 
>> net-failover slaves? (Not sure. Asking...)
>> 
>> 4) Another interesting use-case where the net-failover mechanism is useful 
>> is for handling NIC firmware failures or NIC firmware Live-Upgrade.
>> In both cases, there is a need to perform a full PCIe reset of the NIC. 
>> Which lose all the NIC eSwitch configuration of the various VFs.
> 
> In this setup, how does VF keep going? If it doesn't keep going, why is
> it helpful?

Let me attempt to clarify.

First,

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-04-19 Thread Liran Alon



> On 6 Mar 2019, at 23:42, si-wei liu  wrote:
> 
> 
> 
> On 3/6/2019 1:36 PM, Samudrala, Sridhar wrote:
>> 
>> On 3/6/2019 1:26 PM, si-wei liu wrote:
>>> 
>>> 
>>> On 3/6/2019 4:04 AM, Jiri Pirko wrote:
> --- a/net/core/failover.c
> +++ b/net/core/failover.c
> @@ -16,6 +16,11 @@
> 
> static LIST_HEAD(failover_list);
> static DEFINE_SPINLOCK(failover_lock);
> +static bool slave_rename_ok = true;
> +
> +module_param(slave_rename_ok, bool, (S_IRUGO | S_IWUSR));
> +MODULE_PARM_DESC(slave_rename_ok,
> +  "If set allow renaming the slave when failover master is up");
> 
 No module parameters please. If you need to set something do it using
 rtnl_link_ops. Thanks.
 
 
>>> I understand what you ask for, but without module parameters userspace 
>>> don't work. During boot (dracut) the virtio netdev gets enslaved earlier 
>>> than when userspace comes up, so failover has to determine the setting 
>>> during initialization/creation. This config is not dynamic, at least for 
>>> the life cycle of a particular failover link it shouldn't be changed. 
>>> Without module parameter, how does the userspace specify this value during 
>>> kernel initialization? 
>>> 
>> Can we enable this by default and not make it configurable via module 
>> parameter?
>> Is there any  usecase where someone expects rename to fail with failover 
>> slaves?
> Probably just cater for those application that assumes fixed name on UP 
> interface?
> 
> It's already the default for the configurable. I myself don't think that's a 
> big problem for failover users. So far there's not even QEMU support I think 
> everything can be changed. I don't feel strong to just fix it without 
> introducing configurable. But maybe Michael or others think it differently...
> 
> If no one objects, I don't feel strong to make it fixed behavior.
> 
> -Siwei
> 

I agree we should just remove the module parameter.

-Liran


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-04-19 Thread Liran Alon


> On 28 Feb 2019, at 1:50, Michael S. Tsirkin  wrote:
> 
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>> 
>> 
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
 
 On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
 On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> On 2/21/2019 7:33 PM, si-wei liu wrote:
>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
 Sorry for replying to this ancient thread. There was some remaining
 issue that I don't think the initial net_failover patch got 
 addressed
 cleanly, see:
 
 https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268=DwIBAg=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM=
 
 The renaming of 'eth0' to 'ens4' fails because the udev userspace 
 was
 not specifically writtten for such kernel automatic enslavement.
 Specifically, if it is a bond or team, the slave would typically 
 get
 renamed *before* virtual device gets created, that's what udev can
 control (without getting netdev opened early by the other part of
 kernel) and other userspace components for e.g. initramfs,
 init-scripts can coordinate well in between. The in-kernel
 auto-enslavement of net_failover breaks this userspace convention,
 which don't provides a solution if user care about consistent 
 naming
 on the slave netdevs specifically.
 
 Previously this issue had been specifically called out when 
 IFF_HIDDEN
 and the 1-netdev was proposed, but no one gives out a solution to 
 this
 problem ever since. Please share your mind how to proceed and solve
 this userspace issue if netdev does not welcome a 1-netdev model.
>>> Above says:
>>> 
>>>   there's no motivation in the systemd/udevd community at
>>>   this point to refactor the rename logic and make it work well 
>>> with
>>>   3-netdev.
>>> 
>>> What would the fix be? Skip slave devices?
>>> 
>> There's nothing user can get if just skipping slave devices - the
>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>> next reboot, while the rest may conform to the naming scheme (ens3
>> and such). There's no way one can fix this in userspace alone - when
>> the failover is created the enslaved netdev was opened by the kernel
>> earlier than the userspace is made aware of, and there's no
>> negotiation protocol for kernel to know when userspace has done
>> initial renaming of the interface. I would expect netdev list should
>> at least provide the direction in general for how this can be
>> solved...
>>> I was just wondering what did you mean when you said
>>> "refactor the rename logic and make it work well with 3-netdev" -
>>> was there a proposal udev rejected?
>> No. I never believed this particular issue can be fixed in userspace 
>> alone.
>> Previously someone had said it could be, but I never see any work or
>> relevant discussion ever happened in various userspace communities (for 
>> e.g.
>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the 
>> root
>> of the issue derives from the kernel, it makes more sense to start from
>> netdev, work out and decide on a solution: see what can be done in the
>> kernel in order to fix it, then after that engage userspace community for
>> the feasibility...
>> 
>>> Anyway, can we write a time diagram for what happens in which order that
>>> leads to failure?  That would help look for triggers that we can tie
>>> into, or add new ones.
>>> 
>> See attached diagram.
>> 
>>> 
>>> 
> Is there an issue if slave device names are not predictable? The 
> user/admin scripts are expected
> to only work with the master failover device.
 Where does this expectation come from?
 
 Admin users may have ethtool or tc configurations that need to deal 
 with
 predictable interface name. Third-party app which was built upon 
 specifying
 certain interface name can't be modified to chase