Re: [summary] virtio network device failover writeup

Michael S. Tsirkin Thu, 21 Mar 2019 01:58:40 -0700

On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote:
> 
> 
> > On 21 Mar 2019, at 0:10, Michael S. Tsirkin <[email protected]> wrote:
> > 
> > On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote:
> >> 
> >> 
> >>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <[email protected]> wrote:
> >>> 
> >>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote:
> >>>> 
> >>>> 
> >>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <[email protected]> wrote:
> >>>>> 
> >>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <[email protected]> wrote:
> >>>>>>> 
> >>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote:
> >>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200
> >>>>>>>> Liran Alon <[email protected]> wrote:
> >>>>>>>> 
> >>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it 
> >>>>>>>>> attempts to configure all available netdevs. It should avoid 
> >>>>>>>>> however doing so on net-failover slaves.
> >>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to 
> >>>>>>>>> blacklist a netdev from being configured in case it is owned by a 
> >>>>>>>>> specific PCI driver. Specifically, they blacklist Mellanox VF 
> >>>>>>>>> driver. However, this technique doesn’t work for the net-failover 
> >>>>>>>>> mechanism because both the net-failover netdev and the virtio-net 
> >>>>>>>>> netdev are owned by the virtio-net PCI driver).
> >>>>>>>> 
> >>>>>>>> Cloud-init should really just ignore all devices that have a master 
> >>>>>>>> device.
> >>>>>>>> That would have been more general, and safer for other use cases.
> >>>>>>> 
> >>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be
> >>>>>>> safer to just somehow pretend to userspace that the slave links are
> >>>>>>> down? And add a special attribute for the actual link state.
> >>>>>> 
> >>>>>> I think this may be problematic as it would also break legit use case
> >>>>>> of userspace attempt to set various config on VF slave.
> >>>>>> In general, lying to userspace usually leads to problems.
> >>>>> 
> >>>>> I hear you on this. So how about instead of lying,
> >>>>> we basically just fail some accesses to slaves
> >>>>> unless a flag is set e.g. in ethtool.
> >>>>> 
> >>>>> Some userspace will need to change to set it but in a minor way.
> >>>>> Arguably/hopefully failure to set config would generally be a safer
> >>>>> failure.
> >>>> 
> >>>> Once userspace will set this new flag by ethtool, all operations done by 
> >>>> other userspace components will still work.
> >>> 
> >>> Sorry about being unclear, the idea would be to require the flag on each 
> >>> ethtool operation.
> >> 
> >> Oh. I have indeed misunderstood your previous email then. :)
> >> Thanks for clarifying.
> >> 
> >>> 
> >>>> E.g. Running dhclient without parameters, after this flag was set, will 
> >>>> still attempt to perform DHCP on it and will now succeed.
> >>> 
> >>> I think sending/receiving should probably just fail unconditionally.
> >> 
> >> You mean that you wish that somehow kernel will prevent Tx on net-failover 
> >> slave netdev
> >> unless skb is marked with some flag to indicate it has been sent via the 
> >> net-failover master?
> > 
> > We can maybe avoid binding a protocol socket to the device?
> 
> That is indeed another possibility that would work to avoid the DHCP issues.
> And will still allow checking connectivity. So it is better.
> However, I still think it provides an non-intuitive customer experience.
> In addition, I also want to take into account that most customers are 
> expected a 1:1 mapping between a vNIC and a netdev.
> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it 
> defined.
> Customers usually don’t care how they get accelerated networking. They just 
> care they do.
> 
> > 
> >> This indeed resolves the group of userspace issues around performing DHCP 
> >> on net-failover slaves directly (By dracut/initramfs, dhclient and etc.).
> >> 
> >> However, I see a couple of down-sides to it:
> >> 1) It doesn’t resolve all userspace issues listed in this email thread. 
> >> For example, cloud-init will still attempt to perform network config on 
> >> net-failover slaves.
> >> It also doesn’t help with regard to Ubuntu’s netplan issue that creates 
> >> udev rules that match only by MAC.
> > 
> > 
> > How about we fail to retrieve mac from the slave?
> 
> That would work but I think it is cleaner to just not bind PV and VF based on 
> having the same MAC.


There's a reference to that under "Non-MAC based pairing".

I'll look into making it more explicit.

> > 
> >> 2) It brings non-intuitive customer experience. For example, a customer 
> >> may attempt to analyse connectivity issue by checking the connectivity
> >> on a net-failover slave (e.g. the VF) but will see no connectivity when 
> >> in-fact checking the connectivity on the net-failover master netdev shows 
> >> correct connectivity.
> >> 
> >> The set of changes I vision to fix our issues are:
> >> 1) Hide net-failover slaves in a different netns created and managed by 
> >> the kernel. But that user can enter to it and manage the netdevs there if 
> >> wishes to do so explicitly.
> >> (E.g. Configure the net-failover VF slave in some special way).
> >> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. 
> >> (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get 
> >> PCI slot where the matching VF will be hot-plugged by hypervisor.
> >> 3) Have an explicit virtio-net control message to command hypervisor to 
> >> switch data-path from virtio-net to VF and vice-versa. Instead of relying 
> >> on intercepting the PCI master enable-bit
> >> as an indicator on when VF is about to be set up. (Similar to as done in 
> >> NetVSC).
> >> 
> >> Is there any clear issue we see regarding the above suggestion?
> >> 
> >> -Liran
> > 
> > The issue would be this: how do we avoid conflicting with namespaces
> > created by users?
> 
> This is kinda controversial, but maybe separate netns names into 2 groups: 
> hidden and normal.
> To reference a hidden netns, you need to do it explicitly. 
> Hidden and normal netns names can collide as they will be maintained in 
> different namespaces (Yes I’m overloading the term namespace here…).

Maybe it's an unnamed namespace. Hidden until userspace gives it a name?

> Does this seems reasonable?
> 
> -Liran

Reasonable I'd say yes, easy to implement probably no. But maybe I
missed a trick or two.

> > 
> >>> 
> >>>> Therefore, this proposal just effectively delays when the net-failover 
> >>>> slave can be operated on by userspace.
> >>>> But what we actually want is to never allow a net-failover slave to be 
> >>>> operated by userspace unless it is explicitly stated
> >>>> by userspace that it wishes to perform a set of actions on the 
> >>>> net-failover slave.
> >>>> 
> >>>> Something that was achieved if, for example, the net-failover slaves 
> >>>> were in a different netns than default netns.
> >>>> This also aligns with expected customer experience that most customers 
> >>>> just want to see a 1:1 mapping between a vNIC and a visible netdev.
> >>>> But of course maybe there are other ideas that can achieve similar 
> >>>> behaviour.
> >>>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>> Which things to fail? Probably sending/receiving packets?  Getting MAC?
> >>>>> More?
> >>>>> 
> >>>>>> If we reach
> >>>>>> to a scenario where we try to avoid userspace issues generically and
> >>>>>> not on a userspace component basis, I believe the right path should be
> >>>>>> to hide the net-failover slaves such that explicit action is required
> >>>>>> to actually manipulate them (As described in blog-post). E.g.
> >>>>>> Automatically move net-failover slaves by kernel to a different netns.
> >>>>>> 
> >>>>>> -Liran
> >>>>>> 
> >>>>>>> 
> >>>>>>> -- 
> >>>>>>> MST
_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

Reply via email to