Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 19:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
 On Thu, 21 Mar 2019 15:04:37 +0200
 Liran Alon  wrote:
 
>> 
>> OK. Now what happens if master is moved to another namespace? Do we need
>> to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most 
> customer ignore the net-failover slaves and remain them “hidden” in their 
> dedicated netns.
> We won’t prevent customer from explicitly moving the net-failover slaves 
> out of this netns, but we will not move them out of there automatically.
 
 
 The 2-device netvsc already handles case where master changes namespace.
>>> 
>>> Is it by moving slave with it?
>> 
>> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
>> It seems that when NetVSC master netdev changes netns, the VF is moved to 
>> the same netns by the NetVSC driver.
>> Kinda the opposite than what we are suggesting here to make sure that the 
>> net-failover master netdev is on a separate
>> netns than it’s slaves...
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST
> 
> Not exactly opposite I'd say.
> 
> If failover is in host ns, slaves in /primary and /standby, then moving
> failover to /container should move slaves to /container/primary and
> /container/standby.

Yes I agree.
I meant that they tried to keep the VF on the same netns as the NetVSC.
But of course what you just described is exactly the functionality I would have 
wanted in our net-failover mechanism.

-Liran

> 
> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:51, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
 
 
> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
 2) It brings non-intuitive customer experience. For example, a 
 customer may attempt to analyse connectivity issue by checking the 
 connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity 
 when in-fact checking the connectivity on the net-failover master 
 netdev shows correct connectivity.
 
 The set of changes I vision to fix our issues are:
 1) Hide net-failover slaves in a different netns created and 
 managed by the kernel. But that user can enter to it and manage 
 the netdevs there if wishes to do so explicitly.
 (E.g. Configure the net-failover VF slave in some special way).
 2) Match the virtio-net and the VF based on a PV attribute instead 
 of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
 interface to get PCI slot where the matching VF will be 
 hot-plugged by hypervisor.
 3) Have an explicit virtio-net control message to command 
 hypervisor to switch data-path from virtio-net to VF and 
 vice-versa. Instead of relying on intercepting the PCI master 
 enable-bit
 as an indicator on when VF is about to be set up. (Similar to as 
 done in NetVSC).
 
 Is there any clear issue we see regarding the above suggestion?
 
 -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 
>> groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained 
>> in different namespaces (Yes I’m overloading the term namespace 
>> here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> name?
 
 This is also a good idea that will solve the issue. Yes.
 
> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.
 
 BTW, from a practical point of view, I think that even until we figure 
 out a solution on how to implement this,
 it was better to create an kernel auto-generated name (e.g. 
 “kernel_net_failover_slaves")
 that will break only userspace workloads that by a very rare-chance 
 have a netns that collides with this then
 the breakage we have today for the various userspace components.
 
 -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we 
>> determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the 
>> VF to be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both 
>> seem reasonable to me and your suggestion is faster to implement from 
>> current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
 
 No. Why would we move the slaves?
>>> 
>>> 
>>> The reason we have 3 device model at all is so users can fine tune the
>>> slaves.
>> 
>> I Agree.
>> 
>>> I don't see why this applies to the root namespace but not
>>> a container. If it has access to failover it should have access
>>> to slaves.
>> 
>> Oh now I see your point. I haven’t thought about the containers usage.
>> My thinking was that customer can always just enter to the “hidden” netns 
>> and configure there whatever he wants.
>> 
>> Do you have a suggestion how to handle this?
>> 
>> One option can be that every "visible" netns on system will have a “hidden” 
>> unnamed netns where the net-failover slaves reside in.
>> If customer wishes to 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
>> On Thu, 21 Mar 2019 15:04:37 +0200
>> Liran Alon  wrote:
>> 
 
 OK. Now what happens if master is moved to another namespace? Do we need
 to move the slaves too?  
>>> 
>>> No. Why would we move the slaves? The whole point is to make most customer 
>>> ignore the net-failover slaves and remain them “hidden” in their dedicated 
>>> netns.
>>> We won’t prevent customer from explicitly moving the net-failover slaves 
>>> out of this netns, but we will not move them out of there automatically.
>> 
>> 
>> The 2-device netvsc already handles case where master changes namespace.
> 
> Is it by moving slave with it?

See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
It seems that when NetVSC master netdev changes netns, the VF is moved to the 
same netns by the NetVSC driver.
Kinda the opposite than what we are suggesting here to make sure that the 
net-failover master netdev is on a separate
netns than it’s slaves...

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
 
 
> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>> 2) It brings non-intuitive customer experience. For example, a 
>> customer may attempt to analyse connectivity issue by checking the 
>> connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity 
>> when in-fact checking the connectivity on the net-failover master 
>> netdev shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed 
>> by the kernel. But that user can enter to it and manage the netdevs 
>> there if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead 
>> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>> interface to get PCI slot where the matching VF will be hot-plugged 
>> by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor 
>> to switch data-path from virtio-net to VF and vice-versa. Instead of 
>> relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as 
>> done in NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?
 
 This is kinda controversial, but maybe separate netns names into 2 
 groups: hidden and normal.
 To reference a hidden netns, you need to do it explicitly. 
 Hidden and normal netns names can collide as they will be maintained 
 in different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
 Does this seems reasonable?
 
 -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure 
>> out a solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. 
>> “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have 
>> a netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?
 
 That’s one reasonable option.
 Another one is that we will indeed change the mechanism by which we 
 determine a VF should be bonded with a virtio-net device.
 i.e. Expose a new virtio-net property that specify the PCI slot of the VF 
 to be bonded with.
 
 The second seems cleaner but I don’t have a strong opinion on this. Both 
 seem reasonable to me and your suggestion is faster to implement from 
 current state of things.
 
 -Liran
>>> 
>>> OK. Now what happens if master is moved to another namespace? Do we need
>>> to move the slaves too?
>> 
>> No. Why would we move the slaves?
> 
> 
> The reason we have 3 device model at all is so users can fine tune the
> slaves.

I Agree.

> I don't see why this applies to the root namespace but not
> a container. If it has access to failover it should have access
> to slaves.

Oh now I see your point. I haven’t thought about the containers usage.
My thinking was that customer can always just enter to the “hidden” netns and 
configure there whatever he wants.

Do you have a suggestion how to handle this?

One option can be that every "visible" netns on system will have a “hidden” 
unnamed netns where the net-failover slaves reside in.
If customer wishes to be able to enter to that netns and manage the 
net-failover slaves explicitly, it will need to have an updated iproute2
that knows how to enter to that hidden netns. For most customers, they won’t 
need to ever enter that netns and thus it is ok they don’t
have this updated iproute2.

> 
>> The whole point is to make most customer ignore the net-failover slaves and 
>> remain them “hidden” in 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
>>> 
>>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
 2) It brings non-intuitive customer experience. For example, a 
 customer may attempt to analyse connectivity issue by checking the 
 connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity 
 when in-fact checking the connectivity on the net-failover master 
 netdev shows correct connectivity.
 
 The set of changes I vision to fix our issues are:
 1) Hide net-failover slaves in a different netns created and managed 
 by the kernel. But that user can enter to it and manage the netdevs 
 there if wishes to do so explicitly.
 (E.g. Configure the net-failover VF slave in some special way).
 2) Match the virtio-net and the VF based on a PV attribute instead of 
 MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
 interface to get PCI slot where the matching VF will be hot-plugged by 
 hypervisor.
 3) Have an explicit virtio-net control message to command hypervisor 
 to switch data-path from virtio-net to VF and vice-versa. Instead of 
 relying on intercepting the PCI master enable-bit
 as an indicator on when VF is about to be set up. (Similar to as done 
 in NetVSC).
 
 Is there any clear issue we see regarding the above suggestion?
 
 -Liran
>>> 
>>> The issue would be this: how do we avoid conflicting with namespaces
>>> created by users?
>> 
>> This is kinda controversial, but maybe separate netns names into 2 
>> groups: hidden and normal.
>> To reference a hidden netns, you need to do it explicitly. 
>> Hidden and normal netns names can collide as they will be maintained in 
>> different namespaces (Yes I’m overloading the term namespace here…).
> 
> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
 
 This is also a good idea that will solve the issue. Yes.
 
> 
>> Does this seems reasonable?
>> 
>> -Liran
> 
> Reasonable I'd say yes, easy to implement probably no. But maybe I
> missed a trick or two.
 
 BTW, from a practical point of view, I think that even until we figure out 
 a solution on how to implement this,
 it was better to create an kernel auto-generated name (e.g. 
 “kernel_net_failover_slaves")
 that will break only userspace workloads that by a very rare-chance have a 
 netns that collides with this then
 the breakage we have today for the various userspace components.
 
 -Liran
>>> 
>>> It seems quite easy to supply that as a module parameter. Do we need two
>>> namespaces though? Won't some userspace still be confused by the two
>>> slaves sharing the MAC address?
>> 
>> That’s one reasonable option.
>> Another one is that we will indeed change the mechanism by which we 
>> determine a VF should be bonded with a virtio-net device.
>> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to 
>> be bonded with.
>> 
>> The second seems cleaner but I don’t have a strong opinion on this. Both 
>> seem reasonable to me and your suggestion is faster to implement from 
>> current state of things.
>> 
>> -Liran
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?

No. Why would we move the slaves? The whole point is to make most customer 
ignore the net-failover slaves and remain them “hidden” in their dedicated 
netns.
We won’t prevent customer from explicitly moving the net-failover slaves out of 
this netns, but we will not move them out of there automatically.

> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace…

I’m not sure actually. Isn't udev/systemd netns-aware?
I would expect it to be able to provide names also to netdevs in netns 
different than default netns.
If that’s the case, Si-Wei patch to be able to rename a net-failover slave when 
it is already open is still required. As the race-condition still exists.

-Liran

> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>> 2) It brings non-intuitive customer experience. For example, a customer 
>> may attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when 
>> in-fact checking the connectivity on the net-failover master netdev 
>> shows correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by 
>> the kernel. But that user can enter to it and manage the netdevs there 
>> if wishes to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of 
>> MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface 
>> to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>> 3) Have an explicit virtio-net control message to command hypervisor to 
>> switch data-path from virtio-net to VF and vice-versa. Instead of 
>> relying on intercepting the PCI master enable-bit
>> as an indicator on when VF is about to be set up. (Similar to as done in 
>> NetVSC).
>> 
>> Is there any clear issue we see regarding the above suggestion?
>> 
>> -Liran
> 
> The issue would be this: how do we avoid conflicting with namespaces
> created by users?
 
 This is kinda controversial, but maybe separate netns names into 2 groups: 
 hidden and normal.
 To reference a hidden netns, you need to do it explicitly. 
 Hidden and normal netns names can collide as they will be maintained in 
 different namespaces (Yes I’m overloading the term namespace here…).
>>> 
>>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>> 
>> This is also a good idea that will solve the issue. Yes.
>> 
>>> 
 Does this seems reasonable?
 
 -Liran
>>> 
>>> Reasonable I'd say yes, easy to implement probably no. But maybe I
>>> missed a trick or two.
>> 
>> BTW, from a practical point of view, I think that even until we figure out a 
>> solution on how to implement this,
>> it was better to create an kernel auto-generated name (e.g. 
>> “kernel_net_failover_slaves")
>> that will break only userspace workloads that by a very rare-chance have a 
>> netns that collides with this then
>> the breakage we have today for the various userspace components.
>> 
>> -Liran
> 
> It seems quite easy to supply that as a module parameter. Do we need two
> namespaces though? Won't some userspace still be confused by the two
> slaves sharing the MAC address?

That’s one reasonable option.
Another one is that we will indeed change the mechanism by which we determine a 
VF should be bonded with a virtio-net device.
i.e. Expose a new virtio-net property that specify the PCI slot of the VF to be 
bonded with.

The second seems cleaner but I don’t have a strong opinion on this. Both seem 
reasonable to me and your suggestion is faster to implement from current state 
of things.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
 
 
> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
 On Tue, 19 Mar 2019 14:38:06 +0200
 Liran Alon  wrote:
 
> b.3) cloud-init: If configured to perform network-configuration, it 
> attempts to configure all available netdevs. It should avoid however 
> doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to 
> blacklist a netdev from being configured in case it is owned by a 
> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> However, this technique doesn’t work for the net-failover mechanism 
> because both the net-failover netdev and the virtio-net netdev are 
> owned by the virtio-net PCI driver).
 
 Cloud-init should really just ignore all devices that have a master 
 device.
 That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.
 
 Once userspace will set this new flag by ethtool, all operations done by 
 other userspace components will still work.
>>> 
>>> Sorry about being unclear, the idea would be to require the flag on each 
>>> ethtool operation.
>> 
>> Oh. I have indeed misunderstood your previous email then. :)
>> Thanks for clarifying.
>> 
>>> 
 E.g. Running dhclient without parameters, after this flag was set, will 
 still attempt to perform DHCP on it and will now succeed.
>>> 
>>> I think sending/receiving should probably just fail unconditionally.
>> 
>> You mean that you wish that somehow kernel will prevent Tx on net-failover 
>> slave netdev
>> unless skb is marked with some flag to indicate it has been sent via the 
>> net-failover master?
> 
> We can maybe avoid binding a protocol socket to the device?

That is indeed another possibility that would work to avoid the DHCP issues.
And will still allow checking connectivity. So it is better.
However, I still think it provides an non-intuitive customer experience.
In addition, I also want to take into account that most customers are expected 
a 1:1 mapping between a vNIC and a netdev.
i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
defined.
Customers usually don’t care how they get accelerated networking. They just 
care they do.

> 
>> This indeed resolves the group of userspace issues around performing DHCP on 
>> net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
>> 
>> However, I see a couple of down-sides to it:
>> 1) It doesn’t resolve all userspace issues listed in this email thread. For 
>> example, cloud-init will still attempt to perform network config on 
>> net-failover slaves.
>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
>> rules that match only by MAC.
> 
> 
> How about we fail to retrieve mac from the slave?

That would work but I think it is cleaner to just not bind PV and VF based on 
having the same MAC.

> 
>> 2) It brings non-intuitive customer experience. For example, a customer may 
>> attempt to analyse connectivity issue by checking the connectivity
>> on a net-failover slave (e.g. the VF) but will see no connectivity when 
>> in-fact checking the connectivity on the net-failover master netdev shows 
>> correct connectivity.
>> 
>> The set of changes I vision to fix our issues are:
>> 1) Hide net-failover slaves in a different netns created and managed by the 
>> kernel. But that user can enter to it and manage the netdevs there if wishes 
>> to do so explicitly.
>> (E.g. Configure the net-failover VF slave in some special way).
>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
>> (Similar to as done in NetVSC). E.g. 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 21 Mar 2019, at 10:58, Michael S. Tsirkin  wrote:
> 
> On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
 
 
> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
 
 
> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid 
>>> however doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a 
>>> specific PCI driver. Specifically, they blacklist Mellanox VF 
>>> driver. However, this technique doesn’t work for the net-failover 
>>> mechanism because both the net-failover netdev and the virtio-net 
>>> netdev are owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master 
>> device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.
 
 I think this may be problematic as it would also break legit use case
 of userspace attempt to set various config on VF slave.
 In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by 
>> other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each 
> ethtool operation.
 
 Oh. I have indeed misunderstood your previous email then. :)
 Thanks for clarifying.
 
> 
>> E.g. Running dhclient without parameters, after this flag was set, will 
>> still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.
 
 You mean that you wish that somehow kernel will prevent Tx on net-failover 
 slave netdev
 unless skb is marked with some flag to indicate it has been sent via the 
 net-failover master?
>>> 
>>> We can maybe avoid binding a protocol socket to the device?
>> 
>> That is indeed another possibility that would work to avoid the DHCP issues.
>> And will still allow checking connectivity. So it is better.
>> However, I still think it provides an non-intuitive customer experience.
>> In addition, I also want to take into account that most customers are 
>> expected a 1:1 mapping between a vNIC and a netdev.
>> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
>> defined.
>> Customers usually don’t care how they get accelerated networking. They just 
>> care they do.
>> 
>>> 
 This indeed resolves the group of userspace issues around performing DHCP 
 on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
 
 However, I see a couple of down-sides to it:
 1) It doesn’t resolve all userspace issues listed in this email thread. 
 For example, cloud-init will still attempt to perform network config on 
 net-failover slaves.
 It also doesn’t help with regard to Ubuntu’s netplan issue that creates 
 udev rules that match only by MAC.
>>> 
>>> 
>>> How about we fail to retrieve mac from the slave?
>> 
>> That would work but I think it is cleaner to just not bind PV and VF based 
>> on having the same MAC.
> 
> There's a reference to that under "Non-MAC based pairing".
> 
> I'll look into making it more explicit.

Yes I know. I was referring to what you described in that section.

> 
>>> 
 2) It brings non-intuitive customer experience. For example, a customer 
 may attempt to analyse connectivity issue by checking the connectivity
 on a net-failover slave (e.g. the VF) but will see no connectivity when 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>> 
>> 
>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
>>> 
>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
 On Tue, 19 Mar 2019 14:38:06 +0200
 Liran Alon  wrote:
 
> b.3) cloud-init: If configured to perform network-configuration, it 
> attempts to configure all available netdevs. It should avoid however 
> doing so on net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to 
> blacklist a netdev from being configured in case it is owned by a 
> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> However, this technique doesn’t work for the net-failover mechanism 
> because both the net-failover netdev and the virtio-net netdev are owned 
> by the virtio-net PCI driver).
 
 Cloud-init should really just ignore all devices that have a master device.
 That would have been more general, and safer for other use cases.
>>> 
>>> Given lots of userspace doesn't do this, I wonder whether it would be
>>> safer to just somehow pretend to userspace that the slave links are
>>> down? And add a special attribute for the actual link state.
>> 
>> I think this may be problematic as it would also break legit use case
>> of userspace attempt to set various config on VF slave.
>> In general, lying to userspace usually leads to problems.
> 
> I hear you on this. So how about instead of lying,
> we basically just fail some accesses to slaves
> unless a flag is set e.g. in ethtool.
> 
> Some userspace will need to change to set it but in a minor way.
> Arguably/hopefully failure to set config would generally be a safer
> failure.

Once userspace will set this new flag by ethtool, all operations done by other 
userspace components will still work.
E.g. Running dhclient without parameters, after this flag was set, will still 
attempt to perform DHCP on it and will now succeed.

Therefore, this proposal just effectively delays when the net-failover slave 
can be operated on by userspace.
But what we actually want is to never allow a net-failover slave to be operated 
by userspace unless it is explicitly stated
by userspace that it wishes to perform a set of actions on the net-failover 
slave.

Something that was achieved if, for example, the net-failover slaves were in a 
different netns than default netns.
This also aligns with expected customer experience that most customers just 
want to see a 1:1 mapping between a vNIC and a visible netdev.
But of course maybe there are other ideas that can achieve similar behaviour.

-Liran

> 
> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> More?
> 
>> If we reach
>> to a scenario where we try to avoid userspace issues generically and
>> not on a userspace component basis, I believe the right path should be
>> to hide the net-failover slaves such that explicit action is required
>> to actually manipulate them (As described in blog-post). E.g.
>> Automatically move net-failover slaves by kernel to a different netns.
>> 
>> -Liran
>> 
>>> 
>>> -- 
>>> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid however doing 
>>> so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a specific 
>>> PCI driver. Specifically, they blacklist Mellanox VF driver. However, this 
>>> technique doesn’t work for the net-failover mechanism because both the 
>>> net-failover netdev and the virtio-net netdev are owned by the virtio-net 
>>> PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.

I think this may be problematic as it would also break legit use case of 
userspace attempt to set various config on VF slave.
In general, lying to userspace usually leads to problems. If we reach to a 
scenario where we try to avoid userspace issues generically and not
on a userspace component basis, I believe the right path should be to hide the 
net-failover slaves such that explicit action is required
to actually manipulate them (As described in blog-post). E.g. Automatically 
move net-failover slaves by kernel to a different netns.

-Liran

> 
> -- 
> MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon


> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> 
> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>> 
>> 
>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
>>> 
>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
 
 
> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> 
> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>> On Tue, 19 Mar 2019 14:38:06 +0200
>> Liran Alon  wrote:
>> 
>>> b.3) cloud-init: If configured to perform network-configuration, it 
>>> attempts to configure all available netdevs. It should avoid however 
>>> doing so on net-failover slaves.
>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
>>> blacklist a netdev from being configured in case it is owned by a 
>>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
>>> However, this technique doesn’t work for the net-failover mechanism 
>>> because both the net-failover netdev and the virtio-net netdev are 
>>> owned by the virtio-net PCI driver).
>> 
>> Cloud-init should really just ignore all devices that have a master 
>> device.
>> That would have been more general, and safer for other use cases.
> 
> Given lots of userspace doesn't do this, I wonder whether it would be
> safer to just somehow pretend to userspace that the slave links are
> down? And add a special attribute for the actual link state.
 
 I think this may be problematic as it would also break legit use case
 of userspace attempt to set various config on VF slave.
 In general, lying to userspace usually leads to problems.
>>> 
>>> I hear you on this. So how about instead of lying,
>>> we basically just fail some accesses to slaves
>>> unless a flag is set e.g. in ethtool.
>>> 
>>> Some userspace will need to change to set it but in a minor way.
>>> Arguably/hopefully failure to set config would generally be a safer
>>> failure.
>> 
>> Once userspace will set this new flag by ethtool, all operations done by 
>> other userspace components will still work.
> 
> Sorry about being unclear, the idea would be to require the flag on each 
> ethtool operation.

Oh. I have indeed misunderstood your previous email then. :)
Thanks for clarifying.

> 
>> E.g. Running dhclient without parameters, after this flag was set, will 
>> still attempt to perform DHCP on it and will now succeed.
> 
> I think sending/receiving should probably just fail unconditionally.

You mean that you wish that somehow kernel will prevent Tx on net-failover 
slave netdev
unless skb is marked with some flag to indicate it has been sent via the 
net-failover master?

This indeed resolves the group of userspace issues around performing DHCP on 
net-failover slaves directly (By dracut/initramfs, dhclient and etc.).

However, I see a couple of down-sides to it:
1) It doesn’t resolve all userspace issues listed in this email thread. For 
example, cloud-init will still attempt to perform network config on 
net-failover slaves.
It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
rules that match only by MAC.
2) It brings non-intuitive customer experience. For example, a customer may 
attempt to analyse connectivity issue by checking the connectivity
on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact 
checking the connectivity on the net-failover master netdev shows correct 
connectivity.

The set of changes I vision to fix our issues are:
1) Hide net-failover slaves in a different netns created and managed by the 
kernel. But that user can enter to it and manage the netdevs there if wishes to 
do so explicitly.
(E.g. Configure the net-failover VF slave in some special way).
2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
(Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI 
slot where the matching VF will be hot-plugged by hypervisor.
3) Have an explicit virtio-net control message to command hypervisor to switch 
data-path from virtio-net to VF and vice-versa. Instead of relying on 
intercepting the PCI master enable-bit
as an indicator on when VF is about to be set up. (Similar to as done in 
NetVSC).

Is there any clear issue we see regarding the above suggestion?

-Liran

> 
>> Therefore, this proposal just effectively delays when the net-failover slave 
>> can be operated on by userspace.
>> But what we actually want is to never allow a net-failover slave to be 
>> operated by userspace unless it is explicitly stated
>> by userspace that it wishes to perform a set of actions on the net-failover 
>> slave.
>> 
>> Something that was achieved if, for example, the net-failover slaves were in 
>> a different netns than default netns.
>> This also aligns with expected customer experience that most customers just 
>> want to see a 1:1 mapping between a vNIC and a visible netdev.
>> But of 

Re: [summary] virtio network device failover writeup

2019-04-19 Thread Liran Alon
Hi Michael,

Great blog-post which summarise everything very well!

Some comments I have:

1) I think that when we are using the term “1-netdev model” on community 
discussion, we tend to refer to what you have defined in blog-post as "3-device 
model with hidden slaves”.
Therefore, I would suggest to just remove the “1-netdev model” section and 
rename the "3-device model with hidden slaves” section to “1-netdev model”.

2) The userspace issues result both from using “2-netdev model” and “3-netdev 
model”. However, they are described in blog-post as they only exist on 
“3-netdev model”.
The reason these issues are not seen in Azure environment is because these 
issues were partially handled by Microsoft for their specific 2-netdev model.
Which leads me to the next comment.

3) I suggest that blog-post will also elaborate on what exactly are the 
userspace issues which results in models different than “1-netdev model”.
The issues that I’m aware of are (Please tell me if you are aware of others!):
(a) udev rename race-condition: When net-failover device is opened, it also 
opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
for the net-failover netdev and only then for the virtio-net netdev. This means 
that if userspace will respond to first event by open the net-failover, then 
any attempt of userspace to rename virtio-net netdev as a response to the 
second event will fail because the virtio-net netdev is already opened. Also 
note that this udev rename rule is useful because we would like to add rules 
that renames virtio-net netdev to clearly signal that it’s used as the standby 
interface of another net-failover netdev.
The way this problem was workaround by Microsoft in NetVSC is to delay the open 
done on slave-VF from the open of the NetVSC netdev. However, this is still a 
race and thus a hacky solution. It was accepted by community only because it’s 
internal to the NetVSC driver. However, similar solution was rejected by 
community for the net-failover driver.
The solution that we currently proposed to address this (Patch by Si-Wei) was 
to change the rename kernel handling to allow a net-failover slave to be 
renamed even if it is already opened. Patch is still not accepted.
(b) Issues caused because of various userspace components DHCP the net-failover 
slaves: DHCP of course should only be done on the net-failover netdev. 
Attempting to DHCP on net-failover slaves as-well will cause networking issues. 
Therefore, userspace components should be taught to avoid doing DHCP on the 
net-failover slaves. The various userspace components include:
b.1) dhclient: If run without parameters, it by default just enum all netdevs 
and attempt to DHCP them all.
(I don’t think Microsoft has handled this)
b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
these components needs networking and therefore DHCP on all netdevs.
(Microsoft haven’t handled (b.2) because they don’t have images which perform 
iSCSI boot in their Azure setup. Still an open issue)
b.3) cloud-init: If configured to perform network-configuration, it attempts to 
configure all available netdevs. It should avoid however doing so on 
net-failover slaves.
(Microsoft has handled this by adding a mechanism in cloud-init to blacklist a 
netdev from being configured in case it is owned by a specific PCI driver. 
Specifically, they blacklist Mellanox VF driver. However, this technique 
doesn’t work for the net-failover mechanism because both the net-failover 
netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
b.4) Various distros network-manager need to be updated to avoid DHCP on 
net-failover slaves? (Not sure. Asking...)

4) Another interesting use-case where the net-failover mechanism is useful is 
for handling NIC firmware failures or NIC firmware Live-Upgrade.
In both cases, there is a need to perform a full PCIe reset of the NIC. Which 
lose all the NIC eSwitch configuration of the various VFs.
To handle these cases gracefully, one could just hot-unplug all VFs from guests 
running on host (which will make all guests now use the virtio-net netdev which 
is backed by a netdev that eventually is on top of PF). Therefore, networking 
will be restored to guests once the PCIe reset is completed and the PF is 
functional again. To re-acceelrate the guests network, hypervisor can just 
hot-plug new VFs to guests.

P.S:
I would very appreciate all this forum help in closing on the pending items 
written in (3). Which currently prevents using this net-failover mechanism in 
real production use-cases.

Regards,
-Liran

> On 17 Mar 2019, at 15:55, Michael S. Tsirkin  wrote:
> 
> Hi all,
> I've put up a blog post with a summary of where network
> device failover stands and some open issues.
> Not sure where best to host it, I just put it up on blogspot:
> 

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 06:31:35PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 17:50, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> >> On Thu, 21 Mar 2019 15:04:37 +0200
> >> Liran Alon  wrote:
> >> 
>  
>  OK. Now what happens if master is moved to another namespace? Do we need
>  to move the slaves too?  
> >>> 
> >>> No. Why would we move the slaves? The whole point is to make most 
> >>> customer ignore the net-failover slaves and remain them “hidden” in their 
> >>> dedicated netns.
> >>> We won’t prevent customer from explicitly moving the net-failover slaves 
> >>> out of this netns, but we will not move them out of there automatically.
> >> 
> >> 
> >> The 2-device netvsc already handles case where master changes namespace.
> > 
> > Is it by moving slave with it?
> 
> See c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device”).
> It seems that when NetVSC master netdev changes netns, the VF is moved to the 
> same netns by the NetVSC driver.
> Kinda the opposite than what we are suggesting here to make sure that the 
> net-failover master netdev is on a separate
> netns than it’s slaves...
> 
> -Liran
> 
> > 
> > -- 
> > MST

Not exactly opposite I'd say.

If failover is in host ns, slaves in /primary and /standby, then moving
failover to /container should move slaves to /container/primary and
/container/standby.


-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 08:45:17AM -0700, Stephen Hemminger wrote:
> On Thu, 21 Mar 2019 15:04:37 +0200
> Liran Alon  wrote:
> 
> > > 
> > > OK. Now what happens if master is moved to another namespace? Do we need
> > > to move the slaves too?  
> > 
> > No. Why would we move the slaves? The whole point is to make most customer 
> > ignore the net-failover slaves and remain them “hidden” in their dedicated 
> > netns.
> > We won’t prevent customer from explicitly moving the net-failover slaves 
> > out of this netns, but we will not move them out of there automatically.
> 
> 
> The 2-device netvsc already handles case where master changes namespace.

Is it by moving slave with it?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Stephen Hemminger
On Thu, 21 Mar 2019 15:04:37 +0200
Liran Alon  wrote:

> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most customer 
> ignore the net-failover slaves and remain them “hidden” in their dedicated 
> netns.
> We won’t prevent customer from explicitly moving the net-failover slaves out 
> of this netns, but we will not move them out of there automatically.


The 2-device netvsc already handles case where master changes namespace.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Stephen Hemminger
On Thu, 21 Mar 2019 08:57:03 -0400
"Michael S. Tsirkin"  wrote:

> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> > 
> >   
> > > On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> > > 
> > > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:  
> > >> 2) It brings non-intuitive customer experience. For example, a 
> > >> customer may attempt to analyse connectivity issue by checking the 
> > >> connectivity
> > >> on a net-failover slave (e.g. the VF) but will see no connectivity 
> > >> when in-fact checking the connectivity on the net-failover master 
> > >> netdev shows correct connectivity.
> > >> 
> > >> The set of changes I vision to fix our issues are:
> > >> 1) Hide net-failover slaves in a different netns created and managed 
> > >> by the kernel. But that user can enter to it and manage the netdevs 
> > >> there if wishes to do so explicitly.
> > >> (E.g. Configure the net-failover VF slave in some special way).
> > >> 2) Match the virtio-net and the VF based on a PV attribute instead 
> > >> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
> > >> interface to get PCI slot where the matching VF will be hot-plugged 
> > >> by hypervisor.
> > >> 3) Have an explicit virtio-net control message to command hypervisor 
> > >> to switch data-path from virtio-net to VF and vice-versa. Instead of 
> > >> relying on intercepting the PCI master enable-bit
> > >> as an indicator on when VF is about to be set up. (Similar to as 
> > >> done in NetVSC).
> > >> 
> > >> Is there any clear issue we see regarding the above suggestion?
> > >> 
> > >> -Liran  
> > > 
> > > The issue would be this: how do we avoid conflicting with namespaces
> > > created by users?  
> >  
> >  This is kinda controversial, but maybe separate netns names into 2 
> >  groups: hidden and normal.
> >  To reference a hidden netns, you need to do it explicitly. 
> >  Hidden and normal netns names can collide as they will be maintained 
> >  in different namespaces (Yes I’m overloading the term namespace 
> >  here…).  
> > >>> 
> > >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> > >>> name?  
> > >> 
> > >> This is also a good idea that will solve the issue. Yes.
> > >>   
> > >>>   
> >  Does this seems reasonable?
> >  
> >  -Liran  
> > >>> 
> > >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> > >>> missed a trick or two.  
> > >> 
> > >> BTW, from a practical point of view, I think that even until we figure 
> > >> out a solution on how to implement this,
> > >> it was better to create an kernel auto-generated name (e.g. 
> > >> “kernel_net_failover_slaves")
> > >> that will break only userspace workloads that by a very rare-chance have 
> > >> a netns that collides with this then
> > >> the breakage we have today for the various userspace components.
> > >> 
> > >> -Liran  
> > > 
> > > It seems quite easy to supply that as a module parameter. Do we need two
> > > namespaces though? Won't some userspace still be confused by the two
> > > slaves sharing the MAC address?  
> > 
> > That’s one reasonable option.
> > Another one is that we will indeed change the mechanism by which we 
> > determine a VF should be bonded with a virtio-net device.
> > i.e. Expose a new virtio-net property that specify the PCI slot of the VF 
> > to be bonded with.
> > 
> > The second seems cleaner but I don’t have a strong opinion on this. Both 
> > seem reasonable to me and your suggestion is faster to implement from 
> > current state of things.
> > 
> > -Liran  
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace...

I did try moving slave device into a namespace at one point.
The problem is that introduces all sorts of locking problems in the code
because you can't do it directly in the context of when the callback
happens that a new slave device is discovered.

Since you can't safely change device namespace in the notifier,
it requires a work queue. Then you add more complexity and error cases
because the slave is exposed for a short period, and handling all the
state race unwinds...

Good idea but hard to implement
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 04:16:14PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:51, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
>  
>  
> > On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>  2) It brings non-intuitive customer experience. For example, a 
>  customer may attempt to analyse connectivity issue by checking 
>  the connectivity
>  on a net-failover slave (e.g. the VF) but will see no 
>  connectivity when in-fact checking the connectivity on the 
>  net-failover master netdev shows correct connectivity.
>  
>  The set of changes I vision to fix our issues are:
>  1) Hide net-failover slaves in a different netns created and 
>  managed by the kernel. But that user can enter to it and manage 
>  the netdevs there if wishes to do so explicitly.
>  (E.g. Configure the net-failover VF slave in some special way).
>  2) Match the virtio-net and the VF based on a PV attribute 
>  instead of MAC. (Similar to as done in NetVSC). E.g. Provide a 
>  virtio-net interface to get PCI slot where the matching VF will 
>  be hot-plugged by hypervisor.
>  3) Have an explicit virtio-net control message to command 
>  hypervisor to switch data-path from virtio-net to VF and 
>  vice-versa. Instead of relying on intercepting the PCI master 
>  enable-bit
>  as an indicator on when VF is about to be set up. (Similar to as 
>  done in NetVSC).
>  
>  Is there any clear issue we see regarding the above suggestion?
>  
>  -Liran
> >>> 
> >>> The issue would be this: how do we avoid conflicting with 
> >>> namespaces
> >>> created by users?
> >> 
> >> This is kinda controversial, but maybe separate netns names into 2 
> >> groups: hidden and normal.
> >> To reference a hidden netns, you need to do it explicitly. 
> >> Hidden and normal netns names can collide as they will be 
> >> maintained in different namespaces (Yes I’m overloading the term 
> >> namespace here…).
> > 
> > Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> > name?
>  
>  This is also a good idea that will solve the issue. Yes.
>  
> > 
> >> Does this seems reasonable?
> >> 
> >> -Liran
> > 
> > Reasonable I'd say yes, easy to implement probably no. But maybe I
> > missed a trick or two.
>  
>  BTW, from a practical point of view, I think that even until we 
>  figure out a solution on how to implement this,
>  it was better to create an kernel auto-generated name (e.g. 
>  “kernel_net_failover_slaves")
>  that will break only userspace workloads that by a very rare-chance 
>  have a netns that collides with this then
>  the breakage we have today for the various userspace components.
>  
>  -Liran
> >>> 
> >>> It seems quite easy to supply that as a module parameter. Do we need 
> >>> two
> >>> namespaces though? Won't some userspace still be confused by the two
> >>> slaves sharing the MAC address?
> >> 
> >> That’s one reasonable option.
> >> Another one is that we will indeed change the mechanism by which we 
> >> determine a VF should be bonded with a virtio-net device.
> >> i.e. Expose a new virtio-net property that specify the PCI slot of the 
> >> VF to be bonded with.
> >> 
> >> The second seems cleaner but I don’t have a strong opinion on this. 
> >> Both seem reasonable to me and your suggestion is faster to implement 
> >> from current state of things.
> >> 
> >> -Liran
> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?
>  
>  No. Why would we move the slaves?
> >>> 
> >>> 
> >>> The reason we have 3 device model at all is so users can fine tune the
> >>> slaves.
> >> 
> >> I Agree.
> >> 
> >>> I don't see why this applies to the root namespace but not
> >>> a container. If it has access to failover it should have access
> >>> to slaves.
> >> 
> >> Oh now I see your point. I haven’t thought about the containers usage.
> >> My thinking was that 

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 03:24:39PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 15:12, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
>  
>  
> > On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >> 2) It brings non-intuitive customer experience. For example, a 
> >> customer may attempt to analyse connectivity issue by checking the 
> >> connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity 
> >> when in-fact checking the connectivity on the net-failover master 
> >> netdev shows correct connectivity.
> >> 
> >> The set of changes I vision to fix our issues are:
> >> 1) Hide net-failover slaves in a different netns created and 
> >> managed by the kernel. But that user can enter to it and manage 
> >> the netdevs there if wishes to do so explicitly.
> >> (E.g. Configure the net-failover VF slave in some special way).
> >> 2) Match the virtio-net and the VF based on a PV attribute instead 
> >> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
> >> interface to get PCI slot where the matching VF will be 
> >> hot-plugged by hypervisor.
> >> 3) Have an explicit virtio-net control message to command 
> >> hypervisor to switch data-path from virtio-net to VF and 
> >> vice-versa. Instead of relying on intercepting the PCI master 
> >> enable-bit
> >> as an indicator on when VF is about to be set up. (Similar to as 
> >> done in NetVSC).
> >> 
> >> Is there any clear issue we see regarding the above suggestion?
> >> 
> >> -Liran
> > 
> > The issue would be this: how do we avoid conflicting with namespaces
> > created by users?
>  
>  This is kinda controversial, but maybe separate netns names into 2 
>  groups: hidden and normal.
>  To reference a hidden netns, you need to do it explicitly. 
>  Hidden and normal netns names can collide as they will be maintained 
>  in different namespaces (Yes I’m overloading the term namespace 
>  here…).
> >>> 
> >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> >>> name?
> >> 
> >> This is also a good idea that will solve the issue. Yes.
> >> 
> >>> 
>  Does this seems reasonable?
>  
>  -Liran
> >>> 
> >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>> missed a trick or two.
> >> 
> >> BTW, from a practical point of view, I think that even until we figure 
> >> out a solution on how to implement this,
> >> it was better to create an kernel auto-generated name (e.g. 
> >> “kernel_net_failover_slaves")
> >> that will break only userspace workloads that by a very rare-chance 
> >> have a netns that collides with this then
> >> the breakage we have today for the various userspace components.
> >> 
> >> -Liran
> > 
> > It seems quite easy to supply that as a module parameter. Do we need two
> > namespaces though? Won't some userspace still be confused by the two
> > slaves sharing the MAC address?
>  
>  That’s one reasonable option.
>  Another one is that we will indeed change the mechanism by which we 
>  determine a VF should be bonded with a virtio-net device.
>  i.e. Expose a new virtio-net property that specify the PCI slot of the 
>  VF to be bonded with.
>  
>  The second seems cleaner but I don’t have a strong opinion on this. Both 
>  seem reasonable to me and your suggestion is faster to implement from 
>  current state of things.
>  
>  -Liran
> >>> 
> >>> OK. Now what happens if master is moved to another namespace? Do we need
> >>> to move the slaves too?
> >> 
> >> No. Why would we move the slaves?
> > 
> > 
> > The reason we have 3 device model at all is so users can fine tune the
> > slaves.
> 
> I Agree.
> 
> > I don't see why this applies to the root namespace but not
> > a container. If it has access to failover it should have access
> > to slaves.
> 
> Oh now I see your point. I haven’t thought about the containers usage.
> My thinking was that customer can always just enter to the “hidden” netns and 
> configure there whatever he wants.
> 
> Do you have a suggestion how to handle this?
> 
> One option can be that every "visible" netns on system will have a “hidden” 
> unnamed netns where the net-failover slaves reside in.
> If customer wishes to be able to enter to that netns and manage the 
> net-failover slaves 

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 03:04:37PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:57, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>  2) It brings non-intuitive customer experience. For example, a 
>  customer may attempt to analyse connectivity issue by checking the 
>  connectivity
>  on a net-failover slave (e.g. the VF) but will see no connectivity 
>  when in-fact checking the connectivity on the net-failover master 
>  netdev shows correct connectivity.
>  
>  The set of changes I vision to fix our issues are:
>  1) Hide net-failover slaves in a different netns created and managed 
>  by the kernel. But that user can enter to it and manage the netdevs 
>  there if wishes to do so explicitly.
>  (E.g. Configure the net-failover VF slave in some special way).
>  2) Match the virtio-net and the VF based on a PV attribute instead 
>  of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
>  interface to get PCI slot where the matching VF will be hot-plugged 
>  by hypervisor.
>  3) Have an explicit virtio-net control message to command hypervisor 
>  to switch data-path from virtio-net to VF and vice-versa. Instead of 
>  relying on intercepting the PCI master enable-bit
>  as an indicator on when VF is about to be set up. (Similar to as 
>  done in NetVSC).
>  
>  Is there any clear issue we see regarding the above suggestion?
>  
>  -Liran
> >>> 
> >>> The issue would be this: how do we avoid conflicting with namespaces
> >>> created by users?
> >> 
> >> This is kinda controversial, but maybe separate netns names into 2 
> >> groups: hidden and normal.
> >> To reference a hidden netns, you need to do it explicitly. 
> >> Hidden and normal netns names can collide as they will be maintained 
> >> in different namespaces (Yes I’m overloading the term namespace here…).
> > 
> > Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
>  
>  This is also a good idea that will solve the issue. Yes.
>  
> > 
> >> Does this seems reasonable?
> >> 
> >> -Liran
> > 
> > Reasonable I'd say yes, easy to implement probably no. But maybe I
> > missed a trick or two.
>  
>  BTW, from a practical point of view, I think that even until we figure 
>  out a solution on how to implement this,
>  it was better to create an kernel auto-generated name (e.g. 
>  “kernel_net_failover_slaves")
>  that will break only userspace workloads that by a very rare-chance have 
>  a netns that collides with this then
>  the breakage we have today for the various userspace components.
>  
>  -Liran
> >>> 
> >>> It seems quite easy to supply that as a module parameter. Do we need two
> >>> namespaces though? Won't some userspace still be confused by the two
> >>> slaves sharing the MAC address?
> >> 
> >> That’s one reasonable option.
> >> Another one is that we will indeed change the mechanism by which we 
> >> determine a VF should be bonded with a virtio-net device.
> >> i.e. Expose a new virtio-net property that specify the PCI slot of the VF 
> >> to be bonded with.
> >> 
> >> The second seems cleaner but I don’t have a strong opinion on this. Both 
> >> seem reasonable to me and your suggestion is faster to implement from 
> >> current state of things.
> >> 
> >> -Liran
> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?
> 
> No. Why would we move the slaves?


The reason we have 3 device model at all is so users can fine tune the
slaves. I don't see why this applies to the root namespace but not
a container. If it has access to failover it should have access
to slaves.

> The whole point is to make most customer ignore the net-failover slaves and 
> remain them “hidden” in their dedicated netns.

So that makes the common case easy. That is good. My worry is it might
make some uncommon cases impossible.

> We won’t prevent customer from explicitly moving the net-failover slaves out 
> of this netns, but we will not move them out of there automatically.
> 
> > 
> > Also siwei's patch is then kind of extraneous right?
> > Attempts to rename a slave will now fail as it's in a namespace…
> 
> I’m not sure actually. Isn't udev/systemd netns-aware?
> I would expect it to be able to provide names also to netdevs in netns 
> different than default netns.

I think most people move devices after they are renamed.

> If that’s the case, Si-Wei patch to be able to rename a net-failover slave 
> when it is already open is still required. As 

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> > 
> > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
> >> 2) It brings non-intuitive customer experience. For example, a 
> >> customer may attempt to analyse connectivity issue by checking the 
> >> connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity 
> >> when in-fact checking the connectivity on the net-failover master 
> >> netdev shows correct connectivity.
> >> 
> >> The set of changes I vision to fix our issues are:
> >> 1) Hide net-failover slaves in a different netns created and managed 
> >> by the kernel. But that user can enter to it and manage the netdevs 
> >> there if wishes to do so explicitly.
> >> (E.g. Configure the net-failover VF slave in some special way).
> >> 2) Match the virtio-net and the VF based on a PV attribute instead of 
> >> MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
> >> interface to get PCI slot where the matching VF will be hot-plugged by 
> >> hypervisor.
> >> 3) Have an explicit virtio-net control message to command hypervisor 
> >> to switch data-path from virtio-net to VF and vice-versa. Instead of 
> >> relying on intercepting the PCI master enable-bit
> >> as an indicator on when VF is about to be set up. (Similar to as done 
> >> in NetVSC).
> >> 
> >> Is there any clear issue we see regarding the above suggestion?
> >> 
> >> -Liran
> > 
> > The issue would be this: how do we avoid conflicting with namespaces
> > created by users?
>  
>  This is kinda controversial, but maybe separate netns names into 2 
>  groups: hidden and normal.
>  To reference a hidden netns, you need to do it explicitly. 
>  Hidden and normal netns names can collide as they will be maintained in 
>  different namespaces (Yes I’m overloading the term namespace here…).
> >>> 
> >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> >> 
> >> This is also a good idea that will solve the issue. Yes.
> >> 
> >>> 
>  Does this seems reasonable?
>  
>  -Liran
> >>> 
> >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> >>> missed a trick or two.
> >> 
> >> BTW, from a practical point of view, I think that even until we figure out 
> >> a solution on how to implement this,
> >> it was better to create an kernel auto-generated name (e.g. 
> >> “kernel_net_failover_slaves")
> >> that will break only userspace workloads that by a very rare-chance have a 
> >> netns that collides with this then
> >> the breakage we have today for the various userspace components.
> >> 
> >> -Liran
> > 
> > It seems quite easy to supply that as a module parameter. Do we need two
> > namespaces though? Won't some userspace still be confused by the two
> > slaves sharing the MAC address?
> 
> That’s one reasonable option.
> Another one is that we will indeed change the mechanism by which we determine 
> a VF should be bonded with a virtio-net device.
> i.e. Expose a new virtio-net property that specify the PCI slot of the VF to 
> be bonded with.
> 
> The second seems cleaner but I don’t have a strong opinion on this. Both seem 
> reasonable to me and your suggestion is faster to implement from current 
> state of things.
> 
> -Liran

OK. Now what happens if master is moved to another namespace? Do we need
to move the slaves too?

Also siwei's patch is then kind of extraneous right?
Attempts to rename a slave will now fail as it's in a namespace...

> > 
> > -- 
> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:
>  2) It brings non-intuitive customer experience. For example, a customer 
>  may attempt to analyse connectivity issue by checking the connectivity
>  on a net-failover slave (e.g. the VF) but will see no connectivity when 
>  in-fact checking the connectivity on the net-failover master netdev 
>  shows correct connectivity.
>  
>  The set of changes I vision to fix our issues are:
>  1) Hide net-failover slaves in a different netns created and managed by 
>  the kernel. But that user can enter to it and manage the netdevs there 
>  if wishes to do so explicitly.
>  (E.g. Configure the net-failover VF slave in some special way).
>  2) Match the virtio-net and the VF based on a PV attribute instead of 
>  MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface 
>  to get PCI slot where the matching VF will be hot-plugged by hypervisor.
>  3) Have an explicit virtio-net control message to command hypervisor to 
>  switch data-path from virtio-net to VF and vice-versa. Instead of 
>  relying on intercepting the PCI master enable-bit
>  as an indicator on when VF is about to be set up. (Similar to as done in 
>  NetVSC).
>  
>  Is there any clear issue we see regarding the above suggestion?
>  
>  -Liran
> >>> 
> >>> The issue would be this: how do we avoid conflicting with namespaces
> >>> created by users?
> >> 
> >> This is kinda controversial, but maybe separate netns names into 2 groups: 
> >> hidden and normal.
> >> To reference a hidden netns, you need to do it explicitly. 
> >> Hidden and normal netns names can collide as they will be maintained in 
> >> different namespaces (Yes I’m overloading the term namespace here…).
> > 
> > Maybe it's an unnamed namespace. Hidden until userspace gives it a name?
> 
> This is also a good idea that will solve the issue. Yes.
> 
> > 
> >> Does this seems reasonable?
> >> 
> >> -Liran
> > 
> > Reasonable I'd say yes, easy to implement probably no. But maybe I
> > missed a trick or two.
> 
> BTW, from a practical point of view, I think that even until we figure out a 
> solution on how to implement this,
> it was better to create an kernel auto-generated name (e.g. 
> “kernel_net_failover_slaves")
> that will break only userspace workloads that by a very rare-chance have a 
> netns that collides with this then
> the breakage we have today for the various userspace components.
> 
> -Liran

It seems quite easy to supply that as a module parameter. Do we need two
namespaces though? Won't some userspace still be confused by the two
slaves sharing the MAC address?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Michael S. Tsirkin
On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 0:10, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
>  
>  
> > On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>  On Tue, 19 Mar 2019 14:38:06 +0200
>  Liran Alon  wrote:
>  
> > b.3) cloud-init: If configured to perform network-configuration, it 
> > attempts to configure all available netdevs. It should avoid 
> > however doing so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to 
> > blacklist a netdev from being configured in case it is owned by a 
> > specific PCI driver. Specifically, they blacklist Mellanox VF 
> > driver. However, this technique doesn’t work for the net-failover 
> > mechanism because both the net-failover netdev and the virtio-net 
> > netdev are owned by the virtio-net PCI driver).
>  
>  Cloud-init should really just ignore all devices that have a master 
>  device.
>  That would have been more general, and safer for other use cases.
> >>> 
> >>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>> safer to just somehow pretend to userspace that the slave links are
> >>> down? And add a special attribute for the actual link state.
> >> 
> >> I think this may be problematic as it would also break legit use case
> >> of userspace attempt to set various config on VF slave.
> >> In general, lying to userspace usually leads to problems.
> > 
> > I hear you on this. So how about instead of lying,
> > we basically just fail some accesses to slaves
> > unless a flag is set e.g. in ethtool.
> > 
> > Some userspace will need to change to set it but in a minor way.
> > Arguably/hopefully failure to set config would generally be a safer
> > failure.
>  
>  Once userspace will set this new flag by ethtool, all operations done by 
>  other userspace components will still work.
> >>> 
> >>> Sorry about being unclear, the idea would be to require the flag on each 
> >>> ethtool operation.
> >> 
> >> Oh. I have indeed misunderstood your previous email then. :)
> >> Thanks for clarifying.
> >> 
> >>> 
>  E.g. Running dhclient without parameters, after this flag was set, will 
>  still attempt to perform DHCP on it and will now succeed.
> >>> 
> >>> I think sending/receiving should probably just fail unconditionally.
> >> 
> >> You mean that you wish that somehow kernel will prevent Tx on net-failover 
> >> slave netdev
> >> unless skb is marked with some flag to indicate it has been sent via the 
> >> net-failover master?
> > 
> > We can maybe avoid binding a protocol socket to the device?
> 
> That is indeed another possibility that would work to avoid the DHCP issues.
> And will still allow checking connectivity. So it is better.
> However, I still think it provides an non-intuitive customer experience.
> In addition, I also want to take into account that most customers are 
> expected a 1:1 mapping between a vNIC and a netdev.
> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
> defined.
> Customers usually don’t care how they get accelerated networking. They just 
> care they do.
> 
> > 
> >> This indeed resolves the group of userspace issues around performing DHCP 
> >> on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> >> 
> >> However, I see a couple of down-sides to it:
> >> 1) It doesn’t resolve all userspace issues listed in this email thread. 
> >> For example, cloud-init will still attempt to perform network config on 
> >> net-failover slaves.
> >> It also doesn’t help with regard to Ubuntu’s netplan issue that creates 
> >> udev rules that match only by MAC.
> > 
> > 
> > How about we fail to retrieve mac from the slave?
> 
> That would work but I think it is cleaner to just not bind PV and VF based on 
> having the same MAC.

There's a reference to that under "Non-MAC based pairing".

I'll look into making it more explicit.

> > 
> >> 2) It brings non-intuitive customer experience. For example, a customer 
> >> may attempt to analyse connectivity issue by checking the connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity when 
> >> in-fact checking the connectivity on the net-failover master netdev shows 
> >> correct connectivity.
> >> 
> >> The set of changes I vision to fix our 

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 16:09, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
>  
>  
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon  wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it 
> >>> attempts to configure all available netdevs. It should avoid however 
> >>> doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to 
> >>> blacklist a netdev from being configured in case it is owned by a 
> >>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> >>> However, this technique doesn’t work for the net-failover mechanism 
> >>> because both the net-failover netdev and the virtio-net netdev are 
> >>> owned by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master 
> >> device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
>  
>  I think this may be problematic as it would also break legit use case
>  of userspace attempt to set various config on VF slave.
>  In general, lying to userspace usually leads to problems.
> >>> 
> >>> I hear you on this. So how about instead of lying,
> >>> we basically just fail some accesses to slaves
> >>> unless a flag is set e.g. in ethtool.
> >>> 
> >>> Some userspace will need to change to set it but in a minor way.
> >>> Arguably/hopefully failure to set config would generally be a safer
> >>> failure.
> >> 
> >> Once userspace will set this new flag by ethtool, all operations done by 
> >> other userspace components will still work.
> > 
> > Sorry about being unclear, the idea would be to require the flag on each 
> > ethtool operation.
> 
> Oh. I have indeed misunderstood your previous email then. :)
> Thanks for clarifying.
> 
> > 
> >> E.g. Running dhclient without parameters, after this flag was set, will 
> >> still attempt to perform DHCP on it and will now succeed.
> > 
> > I think sending/receiving should probably just fail unconditionally.
> 
> You mean that you wish that somehow kernel will prevent Tx on net-failover 
> slave netdev
> unless skb is marked with some flag to indicate it has been sent via the 
> net-failover master?

We can maybe avoid binding a protocol socket to the device?

> This indeed resolves the group of userspace issues around performing DHCP on 
> net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> 
> However, I see a couple of down-sides to it:
> 1) It doesn’t resolve all userspace issues listed in this email thread. For 
> example, cloud-init will still attempt to perform network config on 
> net-failover slaves.
> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev 
> rules that match only by MAC.


How about we fail to retrieve mac from the slave?

> 2) It brings non-intuitive customer experience. For example, a customer may 
> attempt to analyse connectivity issue by checking the connectivity
> on a net-failover slave (e.g. the VF) but will see no connectivity when 
> in-fact checking the connectivity on the net-failover master netdev shows 
> correct connectivity.
> 
> The set of changes I vision to fix our issues are:
> 1) Hide net-failover slaves in a different netns created and managed by the 
> kernel. But that user can enter to it and manage the netdevs there if wishes 
> to do so explicitly.
> (E.g. Configure the net-failover VF slave in some special way).
> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
> (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get 
> PCI slot where the matching VF will be hot-plugged by hypervisor.
> 3) Have an explicit virtio-net control message to command hypervisor to 
> switch data-path from virtio-net to VF and vice-versa. Instead of relying on 
> intercepting the PCI master enable-bit
> as an indicator on when VF is about to be set up. (Similar to as done in 
> NetVSC).
> 
> Is there any clear issue we see regarding the above suggestion?
> 
> -Liran

The issue would be this: how do we avoid conflicting with namespaces
created by users?

> > 
> >> Therefore, this proposal just effectively delays when the net-failover 
> >> slave can be operated on by userspace.
> >> But what we actually want is to 

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> 
> 
> > On 20 Mar 2019, at 12:25, Michael S. Tsirkin  wrote:
> > 
> > On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> >>> 
> >>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
>  On Tue, 19 Mar 2019 14:38:06 +0200
>  Liran Alon  wrote:
>  
> > b.3) cloud-init: If configured to perform network-configuration, it 
> > attempts to configure all available netdevs. It should avoid however 
> > doing so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to 
> > blacklist a netdev from being configured in case it is owned by a 
> > specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> > However, this technique doesn’t work for the net-failover mechanism 
> > because both the net-failover netdev and the virtio-net netdev are 
> > owned by the virtio-net PCI driver).
>  
>  Cloud-init should really just ignore all devices that have a master 
>  device.
>  That would have been more general, and safer for other use cases.
> >>> 
> >>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>> safer to just somehow pretend to userspace that the slave links are
> >>> down? And add a special attribute for the actual link state.
> >> 
> >> I think this may be problematic as it would also break legit use case
> >> of userspace attempt to set various config on VF slave.
> >> In general, lying to userspace usually leads to problems.
> > 
> > I hear you on this. So how about instead of lying,
> > we basically just fail some accesses to slaves
> > unless a flag is set e.g. in ethtool.
> > 
> > Some userspace will need to change to set it but in a minor way.
> > Arguably/hopefully failure to set config would generally be a safer
> > failure.
> 
> Once userspace will set this new flag by ethtool, all operations done by 
> other userspace components will still work.

Sorry about being unclear, the idea would be to require the flag on each 
ethtool operation.

> E.g. Running dhclient without parameters, after this flag was set, will still 
> attempt to perform DHCP on it and will now succeed.

I think sending/receiving should probably just fail unconditionally.

> Therefore, this proposal just effectively delays when the net-failover slave 
> can be operated on by userspace.
> But what we actually want is to never allow a net-failover slave to be 
> operated by userspace unless it is explicitly stated
> by userspace that it wishes to perform a set of actions on the net-failover 
> slave.
> 
> Something that was achieved if, for example, the net-failover slaves were in 
> a different netns than default netns.
> This also aligns with expected customer experience that most customers just 
> want to see a 1:1 mapping between a vNIC and a visible netdev.
> But of course maybe there are other ideas that can achieve similar behaviour.
> 
> -Liran
> 
> > 
> > Which things to fail? Probably sending/receiving packets?  Getting MAC?
> > More?
> > 
> >> If we reach
> >> to a scenario where we try to avoid userspace issues generically and
> >> not on a userspace component basis, I believe the right path should be
> >> to hide the net-failover slaves such that explicit action is required
> >> to actually manipulate them (As described in blog-post). E.g.
> >> Automatically move net-failover slaves by kernel to a different netns.
> >> 
> >> -Liran
> >> 
> >>> 
> >>> -- 
> >>> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-20 Thread Michael S. Tsirkin
On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> 
> 
> > On 19 Mar 2019, at 23:19, Michael S. Tsirkin  wrote:
> > 
> > On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >> On Tue, 19 Mar 2019 14:38:06 +0200
> >> Liran Alon  wrote:
> >> 
> >>> b.3) cloud-init: If configured to perform network-configuration, it 
> >>> attempts to configure all available netdevs. It should avoid however 
> >>> doing so on net-failover slaves.
> >>> (Microsoft has handled this by adding a mechanism in cloud-init to 
> >>> blacklist a netdev from being configured in case it is owned by a 
> >>> specific PCI driver. Specifically, they blacklist Mellanox VF driver. 
> >>> However, this technique doesn’t work for the net-failover mechanism 
> >>> because both the net-failover netdev and the virtio-net netdev are owned 
> >>> by the virtio-net PCI driver).
> >> 
> >> Cloud-init should really just ignore all devices that have a master device.
> >> That would have been more general, and safer for other use cases.
> > 
> > Given lots of userspace doesn't do this, I wonder whether it would be
> > safer to just somehow pretend to userspace that the slave links are
> > down? And add a special attribute for the actual link state.
> 
> I think this may be problematic as it would also break legit use case
> of userspace attempt to set various config on VF slave.
> In general, lying to userspace usually leads to problems.

I hear you on this. So how about instead of lying,
we basically just fail some accesses to slaves
unless a flag is set e.g. in ethtool.

Some userspace will need to change to set it but in a minor way.
Arguably/hopefully failure to set config would generally be a safer
failure.

Which things to fail? Probably sending/receiving packets?  Getting MAC?
More?

> If we reach
> to a scenario where we try to avoid userspace issues generically and
> not on a userspace component basis, I believe the right path should be
> to hide the net-failover slaves such that explicit action is required
> to actually manipulate them (As described in blog-post). E.g.
> Automatically move net-failover slaves by kernel to a different netns.
> 
> -Liran
> 
> > 
> > -- 
> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> On Tue, 19 Mar 2019 14:38:06 +0200
> Liran Alon  wrote:
> 
> > b.3) cloud-init: If configured to perform network-configuration, it 
> > attempts to configure all available netdevs. It should avoid however doing 
> > so on net-failover slaves.
> > (Microsoft has handled this by adding a mechanism in cloud-init to 
> > blacklist a netdev from being configured in case it is owned by a specific 
> > PCI driver. Specifically, they blacklist Mellanox VF driver. However, this 
> > technique doesn’t work for the net-failover mechanism because both the 
> > net-failover netdev and the virtio-net netdev are owned by the virtio-net 
> > PCI driver).
> 
> Cloud-init should really just ignore all devices that have a master device.
> That would have been more general, and safer for other use cases.

Given lots of userspace doesn't do this, I wonder whether it would be
safer to just somehow pretend to userspace that the slave links are
down? And add a special attribute for the actual link state.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2019 at 02:38:06PM +0200, Liran Alon wrote:
> Hi Michael,
> 
> Great blog-post which summarise everything very well!
> 
> Some comments I have:

Thanks!
I'll try to update everything in the post when I'm not so jet-lagged.

> 1) I think that when we are using the term “1-netdev model” on community 
> discussion, we tend to refer to what you have defined in blog-post as 
> "3-device model with hidden slaves”.
> Therefore, I would suggest to just remove the “1-netdev model” section and 
> rename the "3-device model with hidden slaves” section to “1-netdev model”.
> 
> 2) The userspace issues result both from using “2-netdev model” and “3-netdev 
> model”. However, they are described in blog-post as they only exist on 
> “3-netdev model”.
> The reason these issues are not seen in Azure environment is because these 
> issues were partially handled by Microsoft for their specific 2-netdev model.
> Which leads me to the next comment.
> 
> 3) I suggest that blog-post will also elaborate on what exactly are the 
> userspace issues which results in models different than “1-netdev model”.
> The issues that I’m aware of are (Please tell me if you are aware of others!):
> (a) udev rename race-condition: When net-failover device is opened, it also 
> opens it's slaves. However, the order of events to udev on KOBJ_ADD is first 
> for the net-failover netdev and only then for the virtio-net netdev. This 
> means that if userspace will respond to first event by open the net-failover, 
> then any attempt of userspace to rename virtio-net netdev as a response to 
> the second event will fail because the virtio-net netdev is already opened. 
> Also note that this udev rename rule is useful because we would like to add 
> rules that renames virtio-net netdev to clearly signal that it’s used as the 
> standby interface of another net-failover netdev.
> The way this problem was workaround by Microsoft in NetVSC is to delay the 
> open done on slave-VF from the open of the NetVSC netdev. However, this is 
> still a race and thus a hacky solution. It was accepted by community only 
> because it’s internal to the NetVSC driver. However, similar solution was 
> rejected by community for the net-failover driver.
> The solution that we currently proposed to address this (Patch by Si-Wei) was 
> to change the rename kernel handling to allow a net-failover slave to be 
> renamed even if it is already opened. Patch is still not accepted.
> (b) Issues caused because of various userspace components DHCP the 
> net-failover slaves: DHCP of course should only be done on the net-failover 
> netdev. Attempting to DHCP on net-failover slaves as-well will cause 
> networking issues. Therefore, userspace components should be taught to avoid 
> doing DHCP on the net-failover slaves. The various userspace components 
> include:
> b.1) dhclient: If run without parameters, it by default just enum all netdevs 
> and attempt to DHCP them all.
> (I don’t think Microsoft has handled this)
> b.2) initramfs / dracut: In order to mount the root file-system from iSCSI, 
> these components needs networking and therefore DHCP on all netdevs.
> (Microsoft haven’t handled (b.2) because they don’t have images which perform 
> iSCSI boot in their Azure setup. Still an open issue)
> b.3) cloud-init: If configured to perform network-configuration, it attempts 
> to configure all available netdevs. It should avoid however doing so on 
> net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist 
> a netdev from being configured in case it is owned by a specific PCI driver. 
> Specifically, they blacklist Mellanox VF driver. However, this technique 
> doesn’t work for the net-failover mechanism because both the net-failover 
> netdev and the virtio-net netdev are owned by the virtio-net PCI driver).
> b.4) Various distros network-manager need to be updated to avoid DHCP on 
> net-failover slaves? (Not sure. Asking...)
> 
> 4) Another interesting use-case where the net-failover mechanism is useful is 
> for handling NIC firmware failures or NIC firmware Live-Upgrade.
> In both cases, there is a need to perform a full PCIe reset of the NIC. Which 
> lose all the NIC eSwitch configuration of the various VFs.

In this setup, how does VF keep going? If it doesn't keep going, why is
it helpful?

> To handle these cases gracefully, one could just hot-unplug all VFs from 
> guests running on host (which will make all guests now use the virtio-net 
> netdev which is backed by a netdev that eventually is on top of PF). 
> Therefore, networking will be restored to guests once the PCIe reset is 
> completed and the PF is functional again. To re-acceelrate the guests 
> network, hypervisor can just hot-plug new VFs to guests.
> 
> P.S:
> I would very appreciate all this forum help in closing on the pending items 
> written in (3). Which currently prevents using this net-failover mechanism in 
> real production use-cases.
> 

Re: [summary] virtio network device failover writeup

2019-03-19 Thread Stephen Hemminger
On Tue, 19 Mar 2019 14:38:06 +0200
Liran Alon  wrote:

> b.3) cloud-init: If configured to perform network-configuration, it attempts 
> to configure all available netdevs. It should avoid however doing so on 
> net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist 
> a netdev from being configured in case it is owned by a specific PCI driver. 
> Specifically, they blacklist Mellanox VF driver. However, this technique 
> doesn’t work for the net-failover mechanism because both the net-failover 
> netdev and the virtio-net netdev are owned by the virtio-net PCI driver).

Cloud-init should really just ignore all devices that have a master device.
That would have been more general, and safer for other use cases.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[summary] virtio network device failover writeup

2019-03-17 Thread Michael S. Tsirkin
Hi all,
I've put up a blog post with a summary of where network
device failover stands and some open issues.
Not sure where best to host it, I just put it up on blogspot:
https://mstsirkin.blogspot.com/2019/03/virtio-network-device-failover-support.html

Comments, corrections are welcome!

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization