Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-03-01 Thread Siwei Liu
On Thu, Feb 28, 2019 at 5:05 PM Jakub Kicinski  wrote:
>
> On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> > On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > > parameters
> > > >
> > > > Sorry I'm a bit confused. What is better than what?
> > >
> > > I mean that blacklist net_failover or module param to disable
> > > net_failover and handle in user space are better than trying to solve
> > > the renaming at kernel level (either by adding module params that make
> > > the kernel rename devices or letting user space change names of running
> > > devices if they are slaves).
> >
> > Before I was aksed to revive this old mail thread, I knew the
> > discussion could end up with something like this. Yes, theoretically
> > there's a point - basically you don't believe kernel should take risk
> > in fixing the issue, so you push back the hope to something in
> > hypothesis that actually wasn't done and hard to get done in reality.
> > It's not too different than saying "hey, what you're asking for is
> > simply wrong, don't do it! Go back to modify userspace to create a
> > bond or team instead!" FWIW I want to emphasize that the debate for
> > what should be the right place to implement this failover facility:
> > userspace versus kernel, had been around for almost a decade, and no
> > real work ever happened in userspace to "standardize" this in the
> > Linux world.
>
> Let me offer you my very subjective opinion of why "no real work ever
> happened in user space".  The actors who have primary interest to get
> the auto-bonding working are HW vendors trying to either convince
> customers to use SR-IOV, or being pressured by customers to make SR-IOV
> easier to consume.  HW vendors hire driver developers, not user space
> developers.  So the solution we arrive at is in the kernel for a non
> technical reason (Conway's law, sort of).
>
> $ cd NetworkManager/
> $ git log --pretty=format:"%ae" | \
> grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
>  81 andrew.zaborow...@intel.com
>   2 david.woodho...@intel.com
>   2 ismo.puusti...@intel.com
>   1 michael.i.dohe...@intel.com
>
> Andrew works on WiFi.
>

I'm sorry, but we don't use NetworkManager in our cloud images at all.
We sufferd from lots of problems when booting from remote iSCSI disk
with NetworkManager enabled, and it looks like those issues are still
there while that's not (my subjective impression) a network config
tool mainly targeting desktop and WiFi users ever cares about. At
least a sign of lack of sufficient testing was made there.

>From cloud service provider perspective, we always prefer single
central solution than speak to various distro vendors with their own
network daemons/config tools thus different solutions. It's hard to
coordicate all efforts in one place. From my personal perspetive, the
in-kernel auto-slave solution is nothing technically inferior than any
userspace implementation, and every major OS/cloud providers choose to
implement this in-kernel model for the same reason. I don't want to
argue more if there's value or not for net_failover to be in Linux
kernel, given that it's already there I think it's better to move on.

We have done extensive work in reporting (actually, fix them
internally before posting) issues to the dracut, udev,
initramfs-tools, and cloud-init community. Although as claimed the
3-netdev should be transparent to userspace in general, the reality is
opposite: the effort is nothing differenet than bring up a new type of
virutal bond than any existing userspace tool would otherwise expect
for a regular physical netdev. If there's ever concern about breaking
userspace, I bet no one ever tries to start using it. If they did they
know what I am saying. The dup MAC address setting and plugging order
are totally new to userspace that none of userspace tools fail to know
how to plumb failover interface in a proper way, if without fixing
them one or another.

-Siwei

> I have asked the NetworkManager folks to implement this feature last
> year when net_failover got dangerously close to getting merged, and
> they said they were never approached with this request before, much less
> offered code that solve it.  Unfortunately before they got around to it
> net_failover was merged already, and they didn't proceed.
>
> So to my knowledge nobody ever tried to solve this in user space.
> I don't think net

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread Siwei Liu
On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski  wrote:
>
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters
> >
> > Sorry I'm a bit confused. What is better than what?
>
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).

Before I was aksed to revive this old mail thread, I knew the
discussion could end up with something like this. Yes, theoretically
there's a point - basically you don't believe kernel should take risk
in fixing the issue, so you push back the hope to something in
hypothesis that actually wasn't done and hard to get done in reality.
It's not too different than saying "hey, what you're asking for is
simply wrong, don't do it! Go back to modify userspace to create a
bond or team instead!" FWIW I want to emphasize that the debate for
what should be the right place to implement this failover facility:
userspace versus kernel, had been around for almost a decade, and no
real work ever happened in userspace to "standardize" this in the
Linux world.  The truth is that it's quite amount of complex work to
get it implemented right at userspace in reality: what Michael
mentions about making dracut auto-bonding aware is just tip of the
iceberg. Basically one would need to modify all the existing network
config tools to treat them well with this new auto-bonding concept:
handle duplicate MACs, differentiate it with regular bond/team, fix
boot time dependency of network boot and etc. Moreover, it's not a
single distro's effort from cloud provider's perspective, at least not
as simple as to say just move it to a daemon systemd/NM then work is
done. We (Oracle) had done extensive work in the past year to help
align various userspace components and work with distro vendors to
patch shipped packages to make them work with the failover 3-netdev
model. The work that needs to be done with userspace auto-bonding
would be more involved than just that, with quite trivial value (just
naming?) in turn that I suspect any developer in userspace could be
motivated.

So, simply put, no, we have zero interest in this direction. If
upstream believes this is the final conclusion, I think we can stop
discussing.

Thanks,
-Siwei
>
> > > for twiddling kernel-based interface naming policy.. :S
> >
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
>
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-21 Thread Siwei Liu
Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

On Wed, Apr 11, 2018 at 12:53 AM Jiri Pirko  wrote:
>
> Tue, Apr 10, 2018 at 11:26:08PM CEST, step...@networkplumber.org wrote:
> >On Tue, 10 Apr 2018 11:59:50 -0700
> >Sridhar Samudrala  wrote:
> >
> >> Use the registration/notification framework supported by the generic
> >> bypass infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala 
> >> ---
> >
> >Thanks for doing this.  Your current version has couple show stopper
> >issues.
> >
> >First, the slave device is instantly taking over the slave.
> >This doesn't allow udev/systemd to do its device rename of the slave
> >device. Netvsc uses a delayed work to workaround this.
>
> Wait. Why the fact a device is enslaved has to affect the udev in any
> way? If it does, smells like a bug in udev.

See above for clarifications.

Thanks,


>
>
> >
> >Secondly, the select queue needs to call queue selection in VF.
> >The bonding/teaming logic doesn't work well for UDP flows.
> >Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> >fixed this performance problem.
> >
> >Lastly, more indirection is bad in current climate.
> >
> >I am not completely adverse to this but it needs to be fast, simple
> >and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Siwei Liu
On Tue, Jun 26, 2018 at 11:49 PM, Samudrala, Sridhar
 wrote:
> On 6/26/2018 11:21 PM, Siwei Liu wrote:
>>
>> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin 
>> wrote:
>>>
>>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>>>>
>>>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin 
>>>> wrote:
>>>>>
>>>>> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>>>
>>>>>>>>>> Might not neccessarily be something wrong, but it's very limited
>>>>>>>>>> to
>>>>>>>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>>>>>>>>
>>>>>>>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>>>>>>>>
>>>>>>>> I think Sridhar and Jiri might be better person to answer it. My
>>>>>>>> impression was that sync'ing the MAC address change between all 3
>>>>>>>> devices is challenging, as the failover driver uses MAC address to
>>>>>>>> match net_device internally.
>>>>>>
>>>>>> Yes. The MAC address is assigned by the hypervisor and it needs to
>>>>>> manage the movement
>>>>>> of the MAC between the PF and VF.  Allowing the guest to change the
>>>>>> MAC will require
>>>>>> synchronization between the hypervisor and the PF/VF drivers. Most of
>>>>>> the VF drivers
>>>>>> don't allow changing guest MAC unless it is a trusted VF.
>>>>>
>>>>> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>>>>> For example I can see host just
>>>>> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>>>>> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>>>
>>>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>>>> got changed before virtio attempts to match and pair the device, it
>>>> ends up with no pairing found out at all.
>>>
>>> Guest seems to match on the hardware mac and ignore whatever
>>> is set by user. Makes sense to me and should not be fragile.
>>
>> Host can change the hardware mac for VF any time.
>
>
> Live migration is initiated and controlled by the Host,  So the Source Host
> will
> reset the MAC during live migration after unplugging the VF. This is to
> redirect the
> VMs frames towards PF so that they can be received via virtio-net standby
> interface.
> The destination host will set the VF MAC and plug the VF after live
> migration is
> completed.
>
> Allowing the guest to change the MAC will require the qemu/libvirt/mgmt
> layers to
> track the MAC changes and replay that change after live migration.
>
If the failover's MAC is kept in sync with VF's MAC address change,
the VF on destination host can be paired using the permanent address
after plugging in, while failover interface will resync the MAC to the
current one in use when enslaving the VF. I think similar is done for
multicast and unicast address list on VF's registration, right? No
need of QEMU or mgmt software keep track of MAC changes.

-Siwei

>
>
>>
>> -Siwei
>>>
>>>
>>>> UUID is better.
>>>>
>>>> -Siwei
>>>>
>>>>> --
>>>>> MST
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Siwei Liu
On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin  wrote:
> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin  wrote:
>> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> >> > > > > Might not neccessarily be something wrong, but it's very limited 
>> >> > > > > to
>> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> >> > > I think Sridhar and Jiri might be better person to answer it. My
>> >> > > impression was that sync'ing the MAC address change between all 3
>> >> > > devices is challenging, as the failover driver uses MAC address to
>> >> > > match net_device internally.
>> >>
>> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage 
>> >> the movement
>> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC 
>> >> will require
>> >> synchronization between the hypervisor and the PF/VF drivers. Most of the 
>> >> VF drivers
>> >> don't allow changing guest MAC unless it is a trusted VF.
>> >
>> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>> > For example I can see host just
>> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>
>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>> got changed before virtio attempts to match and pair the device, it
>> ends up with no pairing found out at all.
>
> Guest seems to match on the hardware mac and ignore whatever
> is set by user. Makes sense to me and should not be fragile.
Host can change the hardware mac for VF any time.

-Siwei
>
>
>> UUID is better.
>>
>> -Siwei
>>
>> >
>> > --
>> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-26 Thread Siwei Liu
On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin  wrote:
> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> > > > > Might not neccessarily be something wrong, but it's very limited to
>> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> > > I think Sridhar and Jiri might be better person to answer it. My
>> > > impression was that sync'ing the MAC address change between all 3
>> > > devices is challenging, as the failover driver uses MAC address to
>> > > match net_device internally.
>>
>> Yes. The MAC address is assigned by the hypervisor and it needs to manage 
>> the movement
>> of the MAC between the PF and VF.  Allowing the guest to change the MAC will 
>> require
>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF 
>> drivers
>> don't allow changing guest MAC unless it is a trusted VF.
>
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

That's why I think pairing using MAC is fragile IMHO. When VF's MAC
got changed before virtio attempts to match and pair the device, it
ends up with no pairing found out at all. UUID is better.

-Siwei

>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu  wrote:
> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin  wrote:
>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin  wrote:
>>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu  wrote:
>>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  
>>> >> > wrote:
>>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  
>>> >> >>> wrote:
>>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>>> >> >>> > "Michael S. Tsirkin"  wrote:
>>> >> >>> >
>>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>> >> >>> >>
>>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>>> >> >>> >> (which some people seem to want in order to then use
>>> >> >>> >> then with different containers).
>>> >> >>> >>
>>> >> >>> >> But it is also handy for when you assign a PF, since then you
>>> >> >>> >> can't set the MAC.
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> > OK, so what about the following:
>>> >> >>> >
>>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that 
>>> >> >>> > indicates
>>> >> >>> >   that we have a new uuid field in the virtio-net config space
>>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a 
>>> >> >>> > uuid,
>>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>>> >> >>> > - when configuring, set the property to the group UUID of the 
>>> >> >>> > vfio-pci
>>> >> >>> >   device
>>> >> >>>
>>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>> >> >>> to still expose UUID in the config space on virtio-pci?
>>> >> >>
>>> >> >>
>>> >> >> Yes but guest is not supposed to read it.
>>> >> >>
>>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>>> >> >>> doesn't support the feature (legacy).
>>> >> >>>
>>> >> >>> -Siwei
>>> >> >>
>>> >> >> Yes but you won't add the primary behind such a bridge.
>>> >> >
>>> >> > I assume the UUID feature is a new one besides the exiting
>>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>> >> > if UUID feature is present and supported by guest, we'll do pairing
>>> >> > based on UUID; if UUID feature present
>>> >>  ^^^  is NOT present
>>> >
>>> > Let's go back for a bit.
>>> >
>>> > What is wrong with matching by mac?
>>> >
>>> > 1. Does not allow multiple NICs with same mac
>>> > 2. Requires MAC to be programmed by host in the PT device
>>> >(which is often possible with VFs but isn't possible with all PT
>>> >devices)
>>>
>>> Might not neccessarily be something wrong, but it's very limited to
>>> prohibit the MAC of VF from changing when enslaved by failover.
>>
>> You mean guest changing MAC? I'm not sure why we prohibit that.
>
> I think Sridhar and Jiri might be better person to answer it. My
> impression was that sync'ing the MAC address change between all 3
> devices is challenging, as the failover driver uses MAC address to
> match net_device internally.
>
>>
>>
>>> The
>>> same as you indicated on the PT device. I suspect the same is
>>> prohibited on even virtio-net and failover is because

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 02:57:39PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin  wrote:
>> > On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin  
>> >> wrote:
>> >> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> >> "Michael S. Tsirkin"  wrote:
>> >> >>
>> >> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> >> > > OK, so what about the following:
>> >> >> > >
>> >> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that 
>> >> >> > > indicates
>> >> >> > >   that we have a new uuid field in the virtio-net config space
>> >> >> > > - in QEMU, add a property for virtio-net that allows to specify a 
>> >> >> > > uuid,
>> >> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >> > > - when configuring, set the property to the group UUID of the 
>> >> >> > > vfio-pci
>> >> >> > >   device
>> >> >> > > - in the guest, use the uuid from the virtio-net device's config 
>> >> >> > > space
>> >> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> >> > >
>> >> >> > > That should work for all virtio transports.
>> >> >> >
>> >> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> >> > since down the road I expect we'll have a very similar feature
>> >> >> > for scsi (and maybe others).
>> >> >> >
>> >> >> > But we do not have a way to have fields that are portable
>> >> >> > both across devices and transports, and I think it would
>> >> >> > be a useful addition. How would this work though? Any idea?
>> >> >>
>> >> >> Can we introduce some kind of device-independent config space area?
>> >> >> Pushing back the device-specific config space by a certain value if the
>> >> >> appropriate feature is negotiated and use that for things like the 
>> >> >> uuid?
>> >> >
>> >> > So config moves back and forth?
>> >> > Reminds me of the msi vector mess we had with pci.
>> >> > I'd rather have every transport add a new config.
>> >> >
>> >> >> But regardless of that, I'm not sure whether extending this approach to
>> >> >> other device types is the way to go. Tying together two different
>> >> >> devices is creating complicated situations at least in the hypervisor
>> >> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> >> around again to look at the "how to handle visibility in QEMU"
>> >> >> questions due to lack of cycles, sorry about that.]
>> >> >>
>> >> >> So, what's the goal of this approach? Only to allow migration with
>> >> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> >> already attached paravirtualized device?
>> >> >
>> >> > These are two sides of the same coin, I think the second approach
>> >> > is closer to what we are doing here.
>> >>
>> >> I'm leaning towards being conservative to keep the potential of doing
>> >> both. It's the vfio migration itself that has to be addessed, not to
>> >> make every virtio device to have an accelerated datapath, right?
>> >>
>> >> -Siwei
>> >
>> > I think that the approach we took (signal configuration
>> > through standby) assumes standby always present and
>> > primary appearing and disappearing.
>>
>> I can imagine that's still true even for 1-netdev model. As long as we
>> can hide the lower devices.
>>
>> That's what I said "to keep the potential to support both" models. But
>> we should not go further along to assume the other aspect ie. to get
>> PV accelerated whenever possible, but not to get VF migrated whenever
>> possible. That's essetially a big diveregenc

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin  wrote:
>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu  wrote:
>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  
>> >> > wrote:
>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  
>> >> >>> wrote:
>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >> >>> > "Michael S. Tsirkin"  wrote:
>> >> >>> >
>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >> >>> >>
>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >> >>> >> (which some people seem to want in order to then use
>> >> >>> >> then with different containers).
>> >> >>> >>
>> >> >>> >> But it is also handy for when you assign a PF, since then you
>> >> >>> >> can't set the MAC.
>> >> >>> >>
>> >> >>> >
>> >> >>> > OK, so what about the following:
>> >> >>> >
>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that 
>> >> >>> > indicates
>> >> >>> >   that we have a new uuid field in the virtio-net config space
>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a 
>> >> >>> > uuid,
>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >>> > - when configuring, set the property to the group UUID of the 
>> >> >>> > vfio-pci
>> >> >>> >   device
>> >> >>>
>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >> >>> to still expose UUID in the config space on virtio-pci?
>> >> >>
>> >> >>
>> >> >> Yes but guest is not supposed to read it.
>> >> >>
>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>> >> >>> doesn't support the feature (legacy).
>> >> >>>
>> >> >>> -Siwei
>> >> >>
>> >> >> Yes but you won't add the primary behind such a bridge.
>> >> >
>> >> > I assume the UUID feature is a new one besides the exiting
>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> >> > if UUID feature is present and supported by guest, we'll do pairing
>> >> > based on UUID; if UUID feature present
>> >>  ^^^  is NOT present
>> >
>> > Let's go back for a bit.
>> >
>> > What is wrong with matching by mac?
>> >
>> > 1. Does not allow multiple NICs with same mac
>> > 2. Requires MAC to be programmed by host in the PT device
>> >(which is often possible with VFs but isn't possible with all PT
>> >devices)
>>
>> Might not neccessarily be something wrong, but it's very limited to
>> prohibit the MAC of VF from changing when enslaved by failover.
>
> You mean guest changing MAC? I'm not sure why we prohibit that.

I think Sridhar and Jiri might be better person to answer it. My
impression was that sync'ing the MAC address change between all 3
devices is challenging, as the failover driver uses MAC address to
match net_device internally.

>
>
>> The
>> same as you indicated on the PT device. I suspect the same is
>> prohibited on even virtio-net and failover is because of the same
>> limitation.
>>
>> >
>> > Both issues are up to host: if host needs them it needs the UUID
>> > feature, if not MAC matching is sufficient.
>>
>> I know. What I'm saying is that we rely on STANDBY feature to control
>> the visibility of the primary, not the UUID feature.
>>
>> -Siwei
&

Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 2:47 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 12:43:26PM -0700, Siwei Liu wrote:
>> > The semantics are that the primary is always used if present in
>> > preference to standby.
>> OK. If this is the only semantics of what "standby" refers to in
>> general, that is fine.
>>
>> I just don't want to limit the failover/standby semantics to the
>> device model specifics, the "accelerated datapath" thing or whatever.
>> I really don't know where the requirements of the "accelerated
>> datapath" came from,
>
> It's a way to put it that matches what failover actually provides.
>
>> as the originial problem is migrating vfio
>> devices which is in match of QEMU's live migration model.
>
> If you put it this way then it's natural to require that we have a
> config with a working vfio device, and that we somehow add virtio for
> duration of migration.

OK. That's the STANDBY feature sematics as I expect. Not something
like "we have a working virtio, but we need an accelerated datapath
via VFIO when the VM is not migrating". The VFIO is the what needs to
be concerned with, not the virtio. The way I view it, the STANDBY
feature, although present in the virtio device, is served as a
communication channel for the paired VFIO device, and the main purpose
of its feature negotiation process has to be around for migrating the
corresponding VFIO. While it's possible to re-use it as a side way to
achieve "accelerated datapath".

There could be other alternatives for vfio migration though, which do
not need the virtio helper, so don't need to get a STANDBY virtio for
that VFIO device.

>
>> Hyper-V has
>> it's limitation to do 1-netdev should not impact how KVM/QEMU should
>> be doing it.
>
> That's a linux thing and pretty orthogonal to host/guest interface.

I agree. So don't assume any device model specifics in the host/guest interface.

>
>> > Jason said virtio net is sometimes preferable.
>> > If that's the case don't make it a standby.
>> >
>> > More advanced use-cases do exist and e.g. Alexander Duyck
>> > suggested using a switch-dev.
>>
>> The switchdev case would need a new feature bit, right?
>>
>> -Siwei
>
> I think it would need to be a completely new device.

So how it relates to virtio or failover?

Regards,
-Siwei
>
>> > failover isn't it though.
>> >
>> > --
>> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin  wrote:
>> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> "Michael S. Tsirkin"  wrote:
>> >>
>> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> > > OK, so what about the following:
>> >> > >
>> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that 
>> >> > > indicates
>> >> > >   that we have a new uuid field in the virtio-net config space
>> >> > > - in QEMU, add a property for virtio-net that allows to specify a 
>> >> > > uuid,
>> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> > > - when configuring, set the property to the group UUID of the vfio-pci
>> >> > >   device
>> >> > > - in the guest, use the uuid from the virtio-net device's config space
>> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> > >
>> >> > > That should work for all virtio transports.
>> >> >
>> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> > since down the road I expect we'll have a very similar feature
>> >> > for scsi (and maybe others).
>> >> >
>> >> > But we do not have a way to have fields that are portable
>> >> > both across devices and transports, and I think it would
>> >> > be a useful addition. How would this work though? Any idea?
>> >>
>> >> Can we introduce some kind of device-independent config space area?
>> >> Pushing back the device-specific config space by a certain value if the
>> >> appropriate feature is negotiated and use that for things like the uuid?
>> >
>> > So config moves back and forth?
>> > Reminds me of the msi vector mess we had with pci.
>> > I'd rather have every transport add a new config.
>> >
>> >> But regardless of that, I'm not sure whether extending this approach to
>> >> other device types is the way to go. Tying together two different
>> >> devices is creating complicated situations at least in the hypervisor
>> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> around again to look at the "how to handle visibility in QEMU"
>> >> questions due to lack of cycles, sorry about that.]
>> >>
>> >> So, what's the goal of this approach? Only to allow migration with
>> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> already attached paravirtualized device?
>> >
>> > These are two sides of the same coin, I think the second approach
>> > is closer to what we are doing here.
>>
>> I'm leaning towards being conservative to keep the potential of doing
>> both. It's the vfio migration itself that has to be addessed, not to
>> make every virtio device to have an accelerated datapath, right?
>>
>> -Siwei
>
> I think that the approach we took (signal configuration
> through standby) assumes standby always present and
> primary appearing and disappearing.

I can imagine that's still true even for 1-netdev model. As long as we
can hide the lower devices.

That's what I said "to keep the potential to support both" models. But
we should not go further along to assume the other aspect ie. to get
PV accelerated whenever possible, but not to get VF migrated whenever
possible. That's essetially a big diveregence of the priority for the
goal we'd want to pursue.

-Siwei

>
> Anything else isn't well supported and I'm not sure we
> should complicate code too much to support it.
>
>>
>> >
>> >> What about migration of vfio devices that are not easily replaced by a
>> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> >> currently only) supported device is dasd (disks) -- which can do a lot
>> >> of specialized things that virtio-blk does not support (and should not
>> >> or even cannot support).
>> >
>> > But maybe virtio-scsi can?
>> >
>> >> Would it be more helpful to focus on generic
>> >> migration support for vfio instead of going about it device by device?
>> >>
>> >> This network device approach already seems far along, so it makes sense
>> >> to continue with it. But I'm not sure whether we want to spend time and
>> >> energy on that for other device types rather than working on a general
>> >> solution for vfio migration.
>> >
>> > I'm inclined to say finalizing this feature would be a good first step
>> > and will teach us how we can move forward.
>> >
>> > --
>> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu  wrote:
>> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  
>> > wrote:
>> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  wrote:
>> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >>> > "Michael S. Tsirkin"  wrote:
>> >>> >
>> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >>> >>
>> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >>> >> (which some people seem to want in order to then use
>> >>> >> then with different containers).
>> >>> >>
>> >>> >> But it is also handy for when you assign a PF, since then you
>> >>> >> can't set the MAC.
>> >>> >>
>> >>> >
>> >>> > OK, so what about the following:
>> >>> >
>> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >>> >   that we have a new uuid field in the virtio-net config space
>> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >>> >   device
>> >>>
>> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >>> to still expose UUID in the config space on virtio-pci?
>> >>
>> >>
>> >> Yes but guest is not supposed to read it.
>> >>
>> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >>> where the corresponding vfio-pci device attached to for a guest which
>> >>> doesn't support the feature (legacy).
>> >>>
>> >>> -Siwei
>> >>
>> >> Yes but you won't add the primary behind such a bridge.
>> >
>> > I assume the UUID feature is a new one besides the exiting
>> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> > if UUID feature is present and supported by guest, we'll do pairing
>> > based on UUID; if UUID feature present
>>  ^^^  is NOT present
>
> Let's go back for a bit.
>
> What is wrong with matching by mac?
>
> 1. Does not allow multiple NICs with same mac
> 2. Requires MAC to be programmed by host in the PT device
>(which is often possible with VFs but isn't possible with all PT
>devices)

Might not neccessarily be something wrong, but it's very limited to
prohibit the MAC of VF from changing when enslaved by failover. The
same as you indicated on the PT device. I suspect the same is
prohibited on even virtio-net and failover is because of the same
limitation.

>
> Both issues are up to host: if host needs them it needs the UUID
> feature, if not MAC matching is sufficient.

I know. What I'm saying is that we rely on STANDBY feature to control
the visibility of the primary, not the UUID feature.

-Siwei

>
>
>> > or not supported by guest,
>> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>> > but the pairing will be fallback to using MAC address. Are you saying
>> > you don't even want to plug in the primary when the UUID feature is
>> > not supported by guest? Does the feature negotiation UUID have to
>> > depend on STANDBY being set, or the other way around? I thought that
>> > just the absence of STANDBY will prevent primary to get exposed to the
>> > guest.
>> >
>> > -Siwei
>> >
>> >>
>> >>>
>> >>> > - in the guest, use the uuid from the virtio-net device's config space
>> >>> >   if applicable; else, fall back to matching by MAC as done today
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin  wrote:
> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> On Thu, 21 Jun 2018 21:20:13 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> > > OK, so what about the following:
>> > >
>> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> > >   that we have a new uuid field in the virtio-net config space
>> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> > > - when configuring, set the property to the group UUID of the vfio-pci
>> > >   device
>> > > - in the guest, use the uuid from the virtio-net device's config space
>> > >   if applicable; else, fall back to matching by MAC as done today
>> > >
>> > > That should work for all virtio transports.
>> >
>> > True. I'm a bit unhappy that it's virtio net specific though
>> > since down the road I expect we'll have a very similar feature
>> > for scsi (and maybe others).
>> >
>> > But we do not have a way to have fields that are portable
>> > both across devices and transports, and I think it would
>> > be a useful addition. How would this work though? Any idea?
>>
>> Can we introduce some kind of device-independent config space area?
>> Pushing back the device-specific config space by a certain value if the
>> appropriate feature is negotiated and use that for things like the uuid?
>
> So config moves back and forth?
> Reminds me of the msi vector mess we had with pci.
> I'd rather have every transport add a new config.
>
>> But regardless of that, I'm not sure whether extending this approach to
>> other device types is the way to go. Tying together two different
>> devices is creating complicated situations at least in the hypervisor
>> (even if it's fairly straightforward in the guest). [I have not come
>> around again to look at the "how to handle visibility in QEMU"
>> questions due to lack of cycles, sorry about that.]
>>
>> So, what's the goal of this approach? Only to allow migration with
>> vfio-pci, or also to plug in a faster device and use it instead of an
>> already attached paravirtualized device?
>
> These are two sides of the same coin, I think the second approach
> is closer to what we are doing here.

I'm leaning towards being conservative to keep the potential of doing
both. It's the vfio migration itself that has to be addessed, not to
make every virtio device to have an accelerated datapath, right?

-Siwei


>
>> What about migration of vfio devices that are not easily replaced by a
>> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> currently only) supported device is dasd (disks) -- which can do a lot
>> of specialized things that virtio-blk does not support (and should not
>> or even cannot support).
>
> But maybe virtio-scsi can?
>
>> Would it be more helpful to focus on generic
>> migration support for vfio instead of going about it device by device?
>>
>> This network device approach already seems far along, so it makes sense
>> to continue with it. But I'm not sure whether we want to spend time and
>> energy on that for other device types rather than working on a general
>> solution for vfio migration.
>
> I'm inclined to say finalizing this feature would be a good first step
> and will teach us how we can move forward.
>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu  wrote:
> On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  wrote:
>> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  wrote:
>>> > On Wed, 20 Jun 2018 22:48:58 +0300
>>> > "Michael S. Tsirkin"  wrote:
>>> >
>>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>> >>
>>> >> It's mostly so we can have e.g. multiple devices with same MAC
>>> >> (which some people seem to want in order to then use
>>> >> then with different containers).
>>> >>
>>> >> But it is also handy for when you assign a PF, since then you
>>> >> can't set the MAC.
>>> >>
>>> >
>>> > OK, so what about the following:
>>> >
>>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>> >   that we have a new uuid field in the virtio-net config space
>>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>>> > - when configuring, set the property to the group UUID of the vfio-pci
>>> >   device
>>>
>>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>> to still expose UUID in the config space on virtio-pci?
>>
>>
>> Yes but guest is not supposed to read it.
>>
>>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>> where the corresponding vfio-pci device attached to for a guest which
>>> doesn't support the feature (legacy).
>>>
>>> -Siwei
>>
>> Yes but you won't add the primary behind such a bridge.
>
> I assume the UUID feature is a new one besides the exiting
> VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
> if UUID feature is present and supported by guest, we'll do pairing
> based on UUID; if UUID feature present
 ^^^  is NOT present

> or not supported by guest,
> we'll still plug in the VF (if guest supports the _F_STANDBY feature)
> but the pairing will be fallback to using MAC address. Are you saying
> you don't even want to plug in the primary when the UUID feature is
> not supported by guest? Does the feature negotiation UUID have to
> depend on STANDBY being set, or the other way around? I thought that
> just the absence of STANDBY will prevent primary to get exposed to the
> guest.
>
> -Siwei
>
>>
>>>
>>> > - in the guest, use the uuid from the virtio-net device's config space
>>> >   if applicable; else, fall back to matching by MAC as done today
>>> >
>>> > That should work for all virtio transports.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  wrote:
> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  wrote:
>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >>
>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >> (which some people seem to want in order to then use
>> >> then with different containers).
>> >>
>> >> But it is also handy for when you assign a PF, since then you
>> >> can't set the MAC.
>> >>
>> >
>> > OK, so what about the following:
>> >
>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >   that we have a new uuid field in the virtio-net config space
>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >   device
>>
>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> to still expose UUID in the config space on virtio-pci?
>
>
> Yes but guest is not supposed to read it.
>
>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> where the corresponding vfio-pci device attached to for a guest which
>> doesn't support the feature (legacy).
>>
>> -Siwei
>
> Yes but you won't add the primary behind such a bridge.

I assume the UUID feature is a new one besides the exiting
VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
if UUID feature is present and supported by guest, we'll do pairing
based on UUID; if UUID feature present or not supported by guest,
we'll still plug in the VF (if guest supports the _F_STANDBY feature)
but the pairing will be fallback to using MAC address. Are you saying
you don't even want to plug in the primary when the UUID feature is
not supported by guest? Does the feature negotiation UUID have to
depend on STANDBY being set, or the other way around? I thought that
just the absence of STANDBY will prevent primary to get exposed to the
guest.

-Siwei

>
>>
>> > - in the guest, use the uuid from the virtio-net device's config space
>> >   if applicable; else, fall back to matching by MAC as done today
>> >
>> > That should work for all virtio transports.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-22 Thread Siwei Liu
On Thu, Jun 21, 2018 at 7:30 PM, Michael S. Tsirkin  wrote:
> On Thu, Jun 21, 2018 at 06:07:18PM -0700, Siwei Liu wrote:
>> On Thu, Jun 21, 2018 at 11:14 AM, Michael S. Tsirkin  wrote:
>> > On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
>> >>
>> >>
>> >> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
>> >> > On 6/12/2018 7:38 PM, Jason Wang wrote:
>> >> > >
>> >> > >
>> >> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> >> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>> >> > > > >
>> >> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>> >> > > > > > I don't think this is sufficient.
>> >> > > > > >
>> >> > > > > > If both primary and standby devices are present, a
>> >> > > > > > legacy guest without
>> >> > > > > > support for the feature might see two devices with same mac and 
>> >> > > > > > get
>> >> > > > > > confused.
>> >> > > > > >
>> >> > > > > > I think that we should only make primary visible after
>> >> > > > > > guest acked the
>> >> > > > > > backup feature bit.
>> >> > > > > I think we want exactly the reverse? E.g fail the
>> >> > > > > negotiation when guest
>> >> > > > > does not ack backup feature.
>> >> > > > >
>> >> > > > > Otherwise legacy guest won't even have the chance to see
>> >> > > > > primary device in
>> >> > > > > the guest.
>> >> > > > That's by design.
>> >> > >
>> >> > > So management needs to know the capability of guest to set the
>> >> > > backup feature. This looks a chicken or egg problem to me.
>> >> >
>> >> > I don't think so. If the tenant requests 'accelerated datapath feature',
>> >> > the management
>> >> > will set 'standby' feature bit on virtio-net interface and if the guest
>> >> > virtio-net driver
>> >> > supports this feature, then the tenant VM will get that capability via a
>> >> > hot-plugged
>> >> > primary device.
>> >>
>> >> Ok, I thought exactly the reverse because of the commit title is "enable
>> >> virtio_net to act as a standby for a passthru device". But re-read the
>> >> commit log content, I understand the case a little bit. Btw, VF is not
>> >> necessarily faster than virtio-net, especially consider virtio-net may 
>> >> have
>> >> a lot of queues.
>> >
>> > Don't do that then, right?
>>
>> I don't understand. Where did the standby feature come to imply the
>> "accelerated datapath" thing?
>> Isn't failover/standby a generic high
>> availblity term, rather than marry it to the concept of device model
>> specifics? Do we expect scsi to work exactly the same way with
>> "accelerated datapath"?
>
> That's not what I said.
> The semantics are that the primary is always used if present in
> preference to standby.
OK. If this is the only semantics of what "standby" refers to in
general, that is fine.

I just don't want to limit the failover/standby semantics to the
device model specifics, the "accelerated datapath" thing or whatever.
I really don't know where the requirements of the "accelerated
datapath" came from, as the originial problem is migrating vfio
devices which is in match of QEMU's live migration model. Hyper-V has
it's limitation to do 1-netdev should not impact how KVM/QEMU should
be doing it.

> Jason said virtio net is sometimes preferable.
> If that's the case don't make it a standby.
>
> More advanced use-cases do exist and e.g. Alexander Duyck
> suggested using a switch-dev.

The switchdev case would need a new feature bit, right?

-Siwei

> failover isn't it though.
>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-21 Thread Siwei Liu
On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  wrote:
> On Wed, 20 Jun 2018 22:48:58 +0300
> "Michael S. Tsirkin"  wrote:
>
>> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>
>> It's mostly so we can have e.g. multiple devices with same MAC
>> (which some people seem to want in order to then use
>> then with different containers).
>>
>> But it is also handy for when you assign a PF, since then you
>> can't set the MAC.
>>
>
> OK, so what about the following:
>
> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>   that we have a new uuid field in the virtio-net config space
> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>   offer VIRTIO_NET_F_STANDBY_UUID if set
> - when configuring, set the property to the group UUID of the vfio-pci
>   device

If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
to still expose UUID in the config space on virtio-pci?

I'm not even sure if it's sane to expose group UUID on the PCI bridge
where the corresponding vfio-pci device attached to for a guest which
doesn't support the feature (legacy).

-Siwei


> - in the guest, use the uuid from the virtio-net device's config space
>   if applicable; else, fall back to matching by MAC as done today
>
> That should work for all virtio transports.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-21 Thread Siwei Liu
On Thu, Jun 21, 2018 at 11:14 AM, Michael S. Tsirkin  wrote:
> On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
>>
>>
>> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
>> > On 6/12/2018 7:38 PM, Jason Wang wrote:
>> > >
>> > >
>> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>> > > > >
>> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>> > > > > > I don't think this is sufficient.
>> > > > > >
>> > > > > > If both primary and standby devices are present, a
>> > > > > > legacy guest without
>> > > > > > support for the feature might see two devices with same mac and get
>> > > > > > confused.
>> > > > > >
>> > > > > > I think that we should only make primary visible after
>> > > > > > guest acked the
>> > > > > > backup feature bit.
>> > > > > I think we want exactly the reverse? E.g fail the
>> > > > > negotiation when guest
>> > > > > does not ack backup feature.
>> > > > >
>> > > > > Otherwise legacy guest won't even have the chance to see
>> > > > > primary device in
>> > > > > the guest.
>> > > > That's by design.
>> > >
>> > > So management needs to know the capability of guest to set the
>> > > backup feature. This looks a chicken or egg problem to me.
>> >
>> > I don't think so. If the tenant requests 'accelerated datapath feature',
>> > the management
>> > will set 'standby' feature bit on virtio-net interface and if the guest
>> > virtio-net driver
>> > supports this feature, then the tenant VM will get that capability via a
>> > hot-plugged
>> > primary device.
>>
>> Ok, I thought exactly the reverse because of the commit title is "enable
>> virtio_net to act as a standby for a passthru device". But re-read the
>> commit log content, I understand the case a little bit. Btw, VF is not
>> necessarily faster than virtio-net, especially consider virtio-net may have
>> a lot of queues.
>
> Don't do that then, right?

I don't understand. Where did the standby feature come to imply the
"accelerated datapath" thing? Isn't failover/standby a generic high
availblity term, rather than marry it to the concept of device model
specifics? Do we expect scsi to work exactly the same way with
"accelerated datapath"?

-Siwei


>
>> >
>> >
>> > >
>> > > >
>> > > > > > And on reset or when backup is cleared in some other way, unplug 
>> > > > > > the
>> > > > > > primary.
>> > > > > What if guest does not support hotplug?
>> > > > It shouldn't ack the backup feature then and it's a good point.
>> > > > We should both document this and check kernel config has
>> > > > hotplug support. Sridhar could you take a look pls?
>> > > >
>> > > > > > Something like the below will do the job:
>> > > > > >
>> > > > > > Primary device is added with a special "primary-failover" flag.
>> > > > > > A virtual machine is then initialized with just a standby virtio
>> > > > > > device. Primary is not yet added.
>> > > > > A question is how to do the matching? Qemu knows nothing about e.g 
>> > > > > mac
>> > > > > address of a pass-through device I believe?
>> > > > Supply a flag to the VFIO when it's added, this way QEMU will know.
>> > > >
>> > > > > > Later QEMU detects that guest driver device set DRIVER_OK.
>> > > > > > It then exposes the primary device to the guest, and triggers
>> > > > > > a device addition event (hot-plug event) for it.
>> > > > > Do you mean we won't have primary device in the initial qemu cli?
>> > > > No, that's not what I mean.
>> > > >
>> > > > I mean management will supply a flag to VFIO and then
>> > > >
>> > > >
>> > > > - VFIO defers exposing
>> > > > primary to guest until guest acks the feature bit.
>> > > > - When we see guest ack, initiate hotplug.
>> > > > - On reboot, hide it again.
>> > > > - On reset without reboot, request hot-unplug and on eject hide
>> > > > it again.
>> > >
>> > > This sounds much like a kind of bonding in qemu.
>> > >
>> > > >
>> > > > > > If QEMU detects guest driver removal, it initiates a
>> > > > > > hot-unplug sequence
>> > > > > > to remove the primary driver.  In particular, if QEMU detects guest
>> > > > > > re-initialization (e.g. by detecting guest reset) it
>> > > > > > immediately removes
>> > > > > > the primary device.
>> > > > > I believe guest failover module should handle this gracefully?
>> > > > It can't control enumeration order, if primary is enumerated before
>> > > > standby then guest will load its driver and it's too late
>> > > > when failover driver is loaded.
>> > >
>> > > Well, even if we can delay the visibility of primary until
>> > > DRIVER_OK, there still be a race I think? And it looks to me it's
>> > > still a bug of guest:
>> > >
>> > > E.g primary could be probed before failover_register() in guest.
>> > > Then we will miss the enslaving of primary forever.
>> >
>> > That is not an issue. Even if the primary is probed before failover
>> > driver, it will
>> > enslave the primary via the call to failover_existing_slave_register()
>> > as 

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-20 Thread Siwei Liu
On Wed, Jun 20, 2018 at 7:34 AM, Cornelia Huck  wrote:
> On Tue, 19 Jun 2018 13:09:14 -0700
> Siwei Liu  wrote:
>
>> On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck  wrote:
>> > On Fri, 15 Jun 2018 10:06:07 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck  wrote:
>> >> > On Thu, 14 Jun 2018 18:57:11 -0700
>> >> > Siwei Liu  wrote:
>
>> >> > I'm a bit confused here. What, exactly, ties the two devices together?
>> >>
>> >> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> >> address (which it doesn't have to), the association between VFIO
>> >> passthrough and standby must be specificed for QEMU to understand the
>> >> relationship with this model. Note, standby feature is no longer
>> >> required to be exposed under this model.
>> >
>> > Isn't that a bit limiting, though?
>> >
>> > With this model, you can probably tie a vfio-pci device and a
>> > virtio-net-pci device together. But this will fail if you have
>> > different transports: Consider tying together a vfio-pci device and a
>> > virtio-net-ccw device on s390, for example. The standby feature bit is
>> > on the virtio-net level and should not have any dependency on the
>> > transport used.
>>
>> Probably we'd limit the support for grouping to virtio-net-pci device
>> and vfio-pci device only. For virtio-net-pci, as you might see with
>> Venu's patch, we store the group UUID on the config space of
>> virtio-pci, which is only applicable to PCI transport.
>>
>> If virtio-net-ccw needs to support the same, I think similar grouping
>> interface should be defined on the VirtIO CCW transport. I think the
>> current implementation of the Linux failover driver assumes that it's
>> SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
>> with, and that the PV path is on same PF without needing to update
>> network of the port-MAC association change. If we need to extend the
>> grouping mechanism to virtio-net-ccw, it has to pass such failover
>> mode to virtio driver specifically through some other option I guess.
>
> Hm, I've just spent some time reading the Linux failover code and I did
> not really find much pci-related magic in there (other than checking
> for a pci device in net_failover_slave_pre_register). We also seem to
> look for a matching device by MAC only. What magic am I missing?

The existing assumptions around SR-IOV VF and thus PCI is implicit. A
lot of simplications are built on the fact that the passthrough device
is a SR-IOV Virtual Function specifically than others: MAC addresses
for couple devices must be the same, changing MAC address is
prohibited, programming VLAN filter is challenged, the datapath of
virtio-net has to share the same physical function where VF belongs
to. There's no hankshake during datapath switching at all to support a
normal passthrough device at this point. I'd imagine some work around
that ahead, which might be a bit involved than just to support a
simplified model for VF migration.

>
> Is the look-for-uuid handling supposed to happen in the host only?

The look-for-MAC matching scheme is not ideal in many aspects. I don't
want to repeat those again, but once the group UUID is added to QEMU,
the failover driver is supposed to switch to the UUID based matching
scheme in the guest.

>
>> >> > If libvirt already has the knowledge that it should manage the two as a
>> >> > couple, why do we need the group id (or something else for other
>> >> > architectures)? (Maybe I'm simply missing something because I'm not
>> >> > that familiar with pci.)
>> >>
>> >> The idea is to have QEMU control the visibility and enumeration order
>> >> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> >> way to achieve it, and perhaps there's other way around also. The
>> >> group ID is not just for QEMU to couple devices, it's also helpful to
>> >> guest too as grouping using MAC address is just not safe.
>> >
>> > Sorry about dragging mainframes into this, but this will only work for
>> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
>> > + virtio-net-ccw example again: The guest cannot find out that the two
>> > belong together by checking some group ID, it has to either use the MAC
>> > or some needs-to-be-architectured property.
>> >
>> > Alternatively, we could propose that mechanism as pci-only, which means
>> > we can rely on mechanisms 

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-19 Thread Siwei Liu
On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck  wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck  wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu  wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-15 Thread Siwei Liu
On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck  wrote:
> On Thu, 14 Jun 2018 18:57:11 -0700
> Siwei Liu  wrote:
>
>> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> think you raised really good points, some of which I don't have answer
>> yet and would also like to explore here.
>>
>> First off, I don't want to push the discussion to the extreme at this
>> point, or sell anything about having QEMU manage everything
>> automatically. Don't get me wrong, it's not there yet. Let's don't
>> assume we are tied to a specific or concerte solution. I think the key
>> for our discussion might be to define or refine the boundary between
>> VM and guest,  e.g. what each layer is expected to control and manage
>> exactly.
>>
>> In my view, there might be possibly 3 different options to represent
>> the failover device conceipt to QEMU and libvirt (or any upper layer
>> software):
>>
>> a. Seperate device: in this model, virtio and passthough remains
>> separate devices just as today. QEMU exposes the standby feature bit
>> for virtio, and publish status/event around the negotiation process of
>> this feature bit for libvirt to react upon. Since Libvirt has the
>> pairing relationship itself, maybe through MAC address or something
>> else, it can control the presence of primary by hot plugging or
>> unplugging the passthrough device, although it has to work tightly
>> with virtio's feature negotation process. Not just for migration but
>> also various corner scenarios (driver/feature ok, device reset,
>> reboot, legacy guest etc) along virtio's feature negotiation.
>
> Yes, that one has obvious tie-ins to virtio's modus operandi.
>
>>
>> b. Coupled device: in this model, virtio and passthough devices are
>> weakly coupled using some group ID, i.e. QEMU match the passthough
>> device for a standby virtio instance by comparing the group ID value
>> present behind each device's bridge. Libvirt provides QEMU the group
>> ID for both type of devices, and only deals with hot plug for
>> migration, by checking some migration status exposed (e.g. the feature
>> negotiation status on the virtio device) by QEMU. QEMU manages the
>> visibility of the primary in guest along virtio's feature negotiation
>> process.
>
> I'm a bit confused here. What, exactly, ties the two devices together?

The group UUID. Since QEMU VFIO dvice does not have insight of MAC
address (which it doesn't have to), the association between VFIO
passthrough and standby must be specificed for QEMU to understand the
relationship with this model. Note, standby feature is no longer
required to be exposed under this model.

> If libvirt already has the knowledge that it should manage the two as a
> couple, why do we need the group id (or something else for other
> architectures)? (Maybe I'm simply missing something because I'm not
> that familiar with pci.)

The idea is to have QEMU control the visibility and enumeration order
of the passthrough VFIO for the failover scenario. Hotplug can be one
way to achieve it, and perhaps there's other way around also. The
group ID is not just for QEMU to couple devices, it's also helpful to
guest too as grouping using MAC address is just not safe.

>
>>
>> c. Fully combined device: in this model, virtio and passthough devices
>> are viewed as a single VM interface altogther. QEMU not just controls
>> the visibility of the primary in guest, but can also manage the
>> exposure of the passthrough for migratability. It can be like that
>> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> to provide group ID for grouping the two devices, if just one single
>> combined device is exposed by QEMU. In either case, QEMU manages all
>> aspect of such internal construct, including virtio feature
>> negotiation, presence of the primary, and live migration.
>
> Same question as above.
>
>>
>> It looks like to me that, in your opinion, you seem to prefer go with
>> (a). While I'm actually okay with either (b) or (c). Do I understand
>> your point correctly?
>
> I'm not yet preferring anything, as I'm still trying to understand how
> this works :) I hope we can arrive at a model that covers the use case
> and that is also flexible enough to be extended to other platforms.
>
>>
>> The reason that I feel that (a) might not be ideal, just as Michael
>> alluded to (quoting below), is that as management stack, it really
>> doesn't need to care about the detailed process of feature negotiation
>> (if we view the guest presence of the primary as part of feature
>> negotiation at an extended level not 

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-14 Thread Siwei Liu
h via vfio. The
> vfio-handled device is there for performance, the virtio device for
> migratability. We have a new virtio feature bit for that which needs to
> be negotiated for that 'combined' device to be available. We have to
> consider two cases:
>
> - Older guests that do not support the new feature bit. We presume that
>   those guests will be confused if they get two network devices with
>   the same MAC, so the idea is to not show them the vfio-handled device
>   at all.
> - Guests that negotiate the feature bit. We only know positively that
>   they (a) know the feature bit and (b) are prepared to handle the
>   consequences of negotiating it after they set the FEATURES_OK bit.
>   This is therefore the earliest point in time that the vfio-handled
>   device should be visible or usable for the guest.
>
> On Wed, 13 Jun 2018 18:02:01 -0700
> Siwei Liu  wrote:
>
>> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
>>  wrote:
>> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>> >>
>> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> >>>
>> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>> >>>>
>> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>> >>>>>
>> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>> >>>>>>
>> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> >>>>>>>
>> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>> >>>>>>> device to
>> >>>>>>> act as a standby for another device with the same MAC address.
>> >>>>>>>
>> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
>> >>>>>>> feature 'true'
>> >>>>>>> by default as i am using libvirt to start the VMs.
>> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>> >>>>>>> via libvirt
>> >>>>>>> XML file?
>> >>>>>>>
>> >>>>>>> Signed-off-by: Sridhar Samudrala 
>> >>>>>>
>> >>>>>> So I do not think we can commit to this interface: we
>> >>>>>> really need to control visibility of the primary device.
>> >>>>>
>> >>>>> The problem is legacy guest won't use primary device at all if we do
>> >>>>> this.
>> >>>>
>> >>>> And that's by design - I think it's the only way to ensure the
>> >>>> legacy guest isn't confused.
>> >>>
>> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
>> >>> the visibility
>> >>> of the primary device. The primary device may not be specified as an
>> >>> argument to Qemu. It
>> >>> may be plugged in later.
>> >>> The cloud service provider is providing a feature that enables low
>> >>> latency datapath and live
>> >>> migration capability.
>> >>> A tenant can use this feature only if he is running a VM that has
>> >>> virtio-net with failover support.
>
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
>
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
>
> Should it be possible to add a virtio/vfio pair later on?
>
>> >>
>> >> Well live migration is there already. The new feature is low latency
>> >> data path.
>> >
>> >
>> > we get live migration with just virtio.  But I meant live migration with VF
>> > as
>> > primary device.
>> >
>> >>
>> >> And it's the guest that needs failover support not the VM.
>> >
>> >
>> > Isn't guest and VM synonymous?
>
> I think we need to be really careful to not mix up the two: The VM
> contains the definitions, but it is up to the guest how it uses them.
>
>> >
>> >
>> >>
>> >>
>> >>> I think Qemu should check if guest virtio-net supports this feature and
>> >>> provide a mechanism for
>> 

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-13 Thread Siwei Liu
On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
 wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:

 On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>
> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>
>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>
>>> This feature bit can be used by hypervisor to indicate virtio_net
>>> device to
>>> act as a standby for another device with the same MAC address.
>>>
>>> I tested this with a small change to the patch to mark the STANDBY
>>> feature 'true'
>>> by default as i am using libvirt to start the VMs.
>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>> via libvirt
>>> XML file?
>>>
>>> Signed-off-by: Sridhar Samudrala 
>>
>> So I do not think we can commit to this interface: we
>> really need to control visibility of the primary device.
>
> The problem is legacy guest won't use primary device at all if we do
> this.

 And that's by design - I think it's the only way to ensure the
 legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio.  But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.

Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.

-Siwei

>
>
>>
>>
> How about control the visibility of standby device?
>
> Thanks

 standy the always there to guarantee no downtime.

>> However just for testing purposes, we could add a non-stable
>> interface "x-standby" with the understanding that as any
>> x- prefix it's unstable and will be changed down the road,
>> likely in the next release.
>>
>>
>>> ---
>>> hw/net/virtio-net.c | 2 ++
>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>> 2 files changed, 5 insertions(+)
>>>
>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>> index 90502fca7c..38b3140670 100644
>>> --- a/hw/net/virtio-net.c
>>> +++ b/hw/net/virtio-net.c
>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>  true),
>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>> SPEED_UNKNOWN),
>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>> +DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>> VIRTIO_NET_F_STANDBY,
>>> +  false),
>>> DEFINE_PROP_END_OF_LIST(),
>>> };
>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>> b/include/standard-headers/linux/virtio_net.h
>>> index e9f255ea3f..01ec09684c 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> 

Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-13 Thread Siwei Liu
On Tue, Jun 12, 2018 at 4:47 AM, Michael S. Tsirkin  wrote:
> On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
>> The thing is cloud service provider might prefer sticking to the same
>> level of service agreement (SLA) of keeping VF over migration,
>
> That requirement is trivially satisfied by just a single VF :) If your
> SLA does not require live migration, you should do just that.

The requirement is enable live migration by default without getting
user too much attention. Migration should be attempted if guest is
live migratable. If not, present a single VF to legacy guest.

You seem to think it's not a valid use case? Or impossible to do so?
What if I have something in mind that can support my theory as well?

-Siwei

>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-05 Thread Siwei Liu
On Tue, Jun 5, 2018 at 2:32 PM, Michael S. Tsirkin  wrote:
> On Tue, Jun 05, 2018 at 02:16:44PM -0700, Siwei Liu wrote:
>> Good to see this discussion going. I share the same feeling that the
>> decision of plugging the primary (passthrough) should only be made
>> until guest driver acknowledges DRIVER_OK and _F_STANDBY.
>> Architecturally this intelligence should be baken to QEMU itself
>> rather than moving up to management stack, such as libvirt.
>>
>> A few questions in line below.
>>
>> On Tue, Jun 5, 2018 at 5:33 AM, Michael S. Tsirkin  wrote:
>> > I don't think this is sufficient.
>> >
>> > If both primary and standby devices are present, a legacy guest without
>> > support for the feature might see two devices with same mac and get
>> > confused.
>> >
>> > I think that we should only make primary visible after guest acked the
>> > backup feature bit.
>> >
>> > And on reset or when backup is cleared in some other way, unplug the
>> > primary.
>> >
>> > Something like the below will do the job:
>> >
>> > Primary device is added with a special "primary-failover" flag.
>>
>> I wonder if you envision this flag as a user interface or some
>> internal attribute/property to QEMU device?
>>
>> Since QEMU needs to associate this particular primary-failover device
>> with a virtio standby instance for checking DRIVER_OK as you indicate
>> below, I wonder if the group ID thing will be helpful to set
>> "primary-failover" flag internally without having to add another
>> option in CLI?
>
> I haven't thought about it but it's an option.
>
>> > A virtual machine is then initialized with just a standby virtio
>> > device. Primary is not yet added.
>> >
>> > Later QEMU detects that guest driver device set DRIVER_OK.
>> > It then exposes the primary device to the guest, and triggers
>> > a device addition event (hot-plug event) for it.
>>
>> Sounds OK. While there might be a small window the guest already
>> starts to use virtio interface before the passthrough gets plugged in,
>> I think that should be fine.
>>
>> Another option is to wait until the primary appears and gets
>> registered in the guest, so it can bring up the upper failover
>> interface.
>
> We can't be sure this will ever happen, can we? We might be asked to
> migrate at any time.

Right. For that to work, virtio driver needs a complicated state
tracking hand-in-hand with the host for VF's hot plugging activity. If
hot plugging is defered due to migration activity the failover
interface can be brought up immediately.

As said, this is just an option. I don't have strong preference here.

>
>> >
>> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> > to remove the primary driver.  In particular, if QEMU detects guest
>> > re-initialization (e.g. by detecting guest reset) it immediately removes
>> > the primary device.
>>
>> Agreed.
>>
>> For legacy guest, user might prefer seeing a single passthrough device
>> rather than a virtio device. I would hope there's an option to allow
>> it to happen, instead of removing all passthrough devices.
>>
>> Regards,
>> -Siwei
>
> I don't see a way to make it work since then you can't migrate, can you?

The thing is cloud service provider might prefer sticking to the same
level of service agreement (SLA) of keeping VF over migration,
particularly when guest OS or kernel does not have the support.The
footprint in guest OS support might NOT be prevailing in the early
days. There's no point the cloud vendor can switch back and forth, as
there's no way for hypervisor to detect what kernel version even OS
the guest will be running ahead of time, until the guest gets fully
loaded and boots up, right?

-Siwei

>
> The way I see it, if you don't need migration, just use PT without
> failover.  If you do, then you either use virtio or failover if guest
> supports it.
>
>> >
>> > We can move some of this code to management as well, architecturally it
>> > does not make too much sense but it might be easier implementation-wise.
>> >
>> > HTH
>> >
>> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> >> Ping on this patch now that the kernel patches are accepted into davem's 
>> >> net-next tree.
>> >> https://patchwork.ozlabs.org/cover/920005/
>> >>
>> >>
>> >> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> >> > Thi

Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-05 Thread Siwei Liu
Good to see this discussion going. I share the same feeling that the
decision of plugging the primary (passthrough) should only be made
until guest driver acknowledges DRIVER_OK and _F_STANDBY.
Architecturally this intelligence should be baken to QEMU itself
rather than moving up to management stack, such as libvirt.

A few questions in line below.

On Tue, Jun 5, 2018 at 5:33 AM, Michael S. Tsirkin  wrote:
> I don't think this is sufficient.
>
> If both primary and standby devices are present, a legacy guest without
> support for the feature might see two devices with same mac and get
> confused.
>
> I think that we should only make primary visible after guest acked the
> backup feature bit.
>
> And on reset or when backup is cleared in some other way, unplug the
> primary.
>
> Something like the below will do the job:
>
> Primary device is added with a special "primary-failover" flag.

I wonder if you envision this flag as a user interface or some
internal attribute/property to QEMU device?

Since QEMU needs to associate this particular primary-failover device
with a virtio standby instance for checking DRIVER_OK as you indicate
below, I wonder if the group ID thing will be helpful to set
"primary-failover" flag internally without having to add another
option in CLI?

> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
>
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.

Sounds OK. While there might be a small window the guest already
starts to use virtio interface before the passthrough gets plugged in,
I think that should be fine.

Another option is to wait until the primary appears and gets
registered in the guest, so it can bring up the upper failover
interface.

>
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.

Agreed.

For legacy guest, user might prefer seeing a single passthrough device
rather than a virtio device. I would hope there's an option to allow
it to happen, instead of removing all passthrough devices.

Regards,
-Siwei

>
> We can move some of this code to management as well, architecturally it
> does not make too much sense but it might be easier implementation-wise.
>
> HTH
>
> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> Ping on this patch now that the kernel patches are accepted into davem's 
>> net-next tree.
>> https://patchwork.ozlabs.org/cover/920005/
>>
>>
>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> > This feature bit can be used by hypervisor to indicate virtio_net device to
>> > act as a standby for another device with the same MAC address.
>> >
>> > I tested this with a small change to the patch to mark the STANDBY feature 
>> > 'true'
>> > by default as i am using libvirt to start the VMs.
>> > Is there a way to pass the newly added feature bit 'standby' to qemu via 
>> > libvirt
>> > XML file?
>> >
>> > Signed-off-by: Sridhar Samudrala 
>> > ---
>> >   hw/net/virtio-net.c | 2 ++
>> >   include/standard-headers/linux/virtio_net.h | 3 +++
>> >   2 files changed, 5 insertions(+)
>> >
>> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> > index 90502fca7c..38b3140670 100644
>> > --- a/hw/net/virtio-net.c
>> > +++ b/hw/net/virtio-net.c
>> > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>> >true),
>> >   DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>> >   DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> > +DEFINE_PROP_BIT64("standby", VirtIONet, host_features, 
>> > VIRTIO_NET_F_STANDBY,
>> > +  false),
>> >   DEFINE_PROP_END_OF_LIST(),
>> >   };
>> > diff --git a/include/standard-headers/linux/virtio_net.h 
>> > b/include/standard-headers/linux/virtio_net.h
>> > index e9f255ea3f..01ec09684c 100644
>> > --- a/include/standard-headers/linux/virtio_net.h
>> > +++ b/include/standard-headers/linux/virtio_net.h
>> > @@ -57,6 +57,9 @@
>> >  * Steering */
>> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>> > +#define VIRTIO_NET_F_STANDBY  62/* Act as standby for another 
>> > device
>> > + * with the same MAC.
>> > + */
>> >   #define VIRTIO_NET_F_SPEED_DUPLEX 63  /* Device set linkspeed and 
>> > duplex */
>> >   #ifndef VIRTIO_NET_NO_LEGACY
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-31 Thread Siwei Liu
On Thu, May 31, 2018 at 11:35 AM, Michael S. Tsirkin  wrote:
> On Wed, May 30, 2018 at 10:06:35PM -0400, Stephen Hemminger wrote:
>> On Thu, 24 May 2018 09:55:14 -0700
>> Sridhar Samudrala  wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > failover infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala 
>>
>> Why was this merged? It was never signed off by any of the netvsc 
>> maintainers,
>> and there were still issues unresolved.
>>
>> There are also namespaces issues I am fixing and this breaks them.
>> Will start my patch set with a revert for this. Sorry
>
> As long as you finish the patch set with re-integrating with failover,
> that's fine IMHO.
>
> I suspect it's easier to add the code to failover though - namespace
> things likely affect virtio as well. Lookup by ID would be an optional
> feature for virtio, but probably a useful one - I won't ask you
I would think for production uses this is a required feature and
should be enabled by default.

Venu (cc'ed) is working on the group ID stuff currently for pairing
virtio and passthrough devices. Would appreciate feedback on the group
ID proposal on virtio-dev.

-Siwei

> to add it to virtio but it could be a mode in failover
> that virtio will activate down the road. And reducing the number of
> times we look cards up based on ID can only be a good thing.
>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-29 Thread Siwei Liu
On Sat, Apr 28, 2018 at 2:42 AM, Jiri Pirko  wrote:
> Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>netdev is present with the same MAC address. It allows live migration
>>of a VM with a direct attached VF without the need to setup a bond/team
>>between a VF and virtio net device in the guest.
>>
>>The hypervisor needs to enable only one datapath at any time so that
>>packets don't get looped back to the VM over the other datapath. When a VF
>>is plugged, the virtio datapath link state can be marked as down. The
>>hypervisor needs to unplug the VF device from the guest on the source host
>>and reset the MAC filter of the VF to initiate failover of datapath to
>>virtio before starting the migration. After the migration is completed,
>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>to the guest to switch over to VF datapath.
>>
>>It uses the generic failover framework that provides 2 functions to create
>>and destroy a master failover netdev. When STANDBY feature is enabled, an
>>additional netdev(failover netdev) is created that acts as a master device
>>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>>is marked as 'standby' netdev and a passthru device with the same MAC is
>>registered as 'primary' netdev.
>>
>>This patch is based on the discussion initiated by Jesse on this thread.
>>https://marc.info/?l=linux-virtualization=151189725224231=2
>>
>
> When I enabled the standby feature (hardcoded), I have 2 netdevices now:
> 4: ens3:  mtu 1500 qdisc noqueue state UP 
> group default qlen 1000
> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>valid_lft forever preferred_lft forever
> 5: ens3n_sby:  mtu 1500 qdisc fq_codel state 
> UP group default qlen 1000
> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>valid_lft forever preferred_lft forever
>
> However, it seems to confuse my initscripts on Fedora:
> [root@test1 ~]# ifup ens3
> ./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
> expected
> ./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
> ./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
> operator expected
>
You should teach Fedora and all cloud vendors to upgrade their
initscripts and other userspace tools, no?

> Determining IP information for ens3
> ens3n_sby...Cannot find device "ens3n_sby.pid"
> Cannot find device "ens3n_sby.lease"
>  failed.
>
> I tried to change the standby device mac:
> ip link set ens3n_sby addr 52:54:00:b2:a7:f2
> [root@test1 ~]# ifup ens3
>
> Determining IP information for ens3... done.
> [root@test1 ~]#
>
> But now the network does not work. I think that the mac change on
> standby device should be probably refused, no?
>
> When I change the mac back, all works fine.
>
>
> Now I try to change mac of the failover master:
> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
> RTNETLINK answers: Operation not supported
>
> That I did expect to work. I would expect this would change the mac of
> the master and both standby and primary slaves.
>
>
> Now I tried to add a primary pci device. I don't have any fancy VF on my
> test setup, but I expected the good old 8139cp to work:
> [root@test1 ~]# ethtool -i ens9
> driver: 8139cp
> 
> [root@test1 ~]# ip link set ens9 addr 52:54:00:b2:a7:f1
>
> I see no message in dmesg, so I guess the failover module did not
> enslave this netdev. The mac change is not monitored. I would expect
> that it is and whenever a device changes mac to the failover one, it
> should be enslaved and whenever it changes mac back to something else,
> it should be released - the primary one ofcourse.
>
>
>
> [...]
>
>>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>+size_t len)
>>+{
>>+  struct virtnet_info *vi = netdev_priv(dev);
>>+  int ret;
>>+
>>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
>>+  return -EOPNOTSUPP;
>>+
>>+  ret = snprintf(buf, len, "_sby");
>
> please avoid the "_".
>
> [...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-27 Thread Siwei Liu
On Thu, Apr 26, 2018 at 4:42 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin 
>> >> >> >> <m...@redhat.com> wrote:
>> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger 
>> >> >> >> >> > wrote:
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> >> >> > > > >especially the
>> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors 
>> >> >> >> >> > > > >to support transparent
>> >> >> >> >> > > > >mode, ans we really can't have a new model; or do 
>> >> >> >> >> > > > >backport.
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> >> > > >
>> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> >> >> > > > oddities.
>> >> >> >> >> > >
>> >> >> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> >> >> > > userspace application.
>> >> >> >> >> >
>> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at 
>> >> >> >> >> > netdevs?
>> >> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >> >
>> >> >> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> >> >> > > application.
>> >> >> >> >> >
>> >> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> >> >> > userspace
>> >> >> >> >> > API at all.
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux 
>> >> >> >> >> network devices
>> >> >> >> >> to look for Linux netvsc device and the paired VF device and 
>> >> >> >> >> setup the
>> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe 
>> >> >> >> >> (bondingish) instance
>> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as 
>> >> >> >> >> the Mellanox
>> >> >> >> >> VF device.
>> >> >> >> >>
>> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 
>> >> >> >> >> device model
>> >> >> >> >> or start hiding devices from userspace.
>> >> >> >> >
>> >> >> >> > Okay so how does the existing patch break that? IIUC does not go 
>> >> >> >>

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-26 Thread Siwei Liu
On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >> >>
>> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> >> > > > >
>> >> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> >> > > > >especially the
>> >> >> >> > > > >three device model.  MS worked hard with distro vendors to 
>> >> >> >> > > > >support transparent
>> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> >> > > > >
>> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> > > >
>> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> >> > > > oddities.
>> >> >> >> > >
>> >> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> >> > > userspace application.
>> >> >> >> >
>> >> >> >> > It is userspace but are you sure dpdk is actually poking at 
>> >> >> >> > netdevs?
>> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >
>> >> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> >> > > application.
>> >> >> >> >
>> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> >> > userspace
>> >> >> >> > API at all.
>> >> >> >> >
>> >> >> >>
>> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux 
>> >> >> >> network devices
>> >> >> >> to look for Linux netvsc device and the paired VF device and setup 
>> >> >> >> the
>> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> >> >> instance
>> >> >> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> >> >> Mellanox
>> >> >> >> VF device.
>> >> >> >>
>> >> >> >> So it depends on existing 2 device model. You can't go to a 3 
>> >> >> >> device model
>> >> >> >> or start hiding devices from userspace.
>> >> >> >
>> >> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >
>> >> >> >> Also, I am working on associating netvsc and VF device based on 
>> >> >> >> serial number
>> >> >> >> rather than MAC address. The serial number is how Windows works 
>> >> >> >> now, and it makes
>> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >> >
>> >> >> > Maybe we should support same for virtio ...
>> >> >> > Which serial do you mean? From vpd?
>> >> >> >
>> >> >> > I guess you will want to keep supporting MAC for 

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Siwei Liu
On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >>
>> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> > > > >
>> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> > > > >especially the
>> >> >> > > > >three device model.  MS worked hard with distro vendors to 
>> >> >> > > > >support transparent
>> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> > > > >
>> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> > > >
>> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> > > > oddities.
>> >> >> > >
>> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> > > userspace application.
>> >> >> >
>> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >
>> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> > > application.
>> >> >> >
>> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> > userspace
>> >> >> > API at all.
>> >> >> >
>> >> >>
>> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> >> >> devices
>> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> >> instance
>> >> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> >> Mellanox
>> >> >> VF device.
>> >> >>
>> >> >> So it depends on existing 2 device model. You can't go to a 3 device 
>> >> >> model
>> >> >> or start hiding devices from userspace.
>> >> >
>> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >
>> >> >> Also, I am working on associating netvsc and VF device based on serial 
>> >> >> number
>> >> >> rather than MAC address. The serial number is how Windows works now, 
>> >> >> and it makes
>> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >
>> >> > Maybe we should support same for virtio ...
>> >> > Which serial do you mean? From vpd?
>> >> >
>> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >
>> >> > It all seems like a reasonable thing to support in the generic core.
>> >>
>> >> That's the reason why I chose explicit identifier rather than rely on
>> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> can't, malicious guest user can fake MAC address to skip binding.
>> >>
>> >> -Siwei
>> >
>> > Address should be sampled at device creation to prevent this
>> > kind of hack. Not that it buys the malicious user much:
>> > if you can poke at MAC addresses you probably already can
>> > break networking.
>>
>> I don't understand why poking at MAC address may potentially break
>> networking.
>

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Siwei Liu
On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >>
>> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> > > > >
>> >> > > > >I will NAK patches to change to common code for netvsc especially 
>> >> > > > >the
>> >> > > > >three device model.  MS worked hard with distro vendors to support 
>> >> > > > >transparent
>> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> > > > >
>> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> > > >
>> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> > >
>> >> > > The network device model is a userspace API, and DPDK is a userspace 
>> >> > > application.
>> >> >
>> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> > AFAIK it's normally banging device registers directly.
>> >> >
>> >> > > You can't go breaking userspace even if you don't like the 
>> >> > > application.
>> >> >
>> >> > Could you please explain how is the proposed patchset breaking
>> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> > API at all.
>> >> >
>> >>
>> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> >> devices
>> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> instance
>> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> Mellanox
>> >> VF device.
>> >>
>> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> or start hiding devices from userspace.
>> >
>> > Okay so how does the existing patch break that? IIUC does not go to
>> > a 3 device model since netvsc calls failover_register directly.
>> >
>> >> Also, I am working on associating netvsc and VF device based on serial 
>> >> number
>> >> rather than MAC address. The serial number is how Windows works now, and 
>> >> it makes
>> >> sense for Linux and Windows to use the same mechanism if possible.
>> >
>> > Maybe we should support same for virtio ...
>> > Which serial do you mean? From vpd?
>> >
>> > I guess you will want to keep supporting MAC for old hypervisors?
>> >
>> > It all seems like a reasonable thing to support in the generic core.
>>
>> That's the reason why I chose explicit identifier rather than rely on
>> MAC address to bind/pair a device. MAC address can change. Even if it
>> can't, malicious guest user can fake MAC address to skip binding.
>>
>> -Siwei
>
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

I don't understand why poking at MAC address may potentially break
networking. Unlike VF, passthrough PCI endpoint device has its freedom
to change the MAC address. Even on a VF setup it's not neccessarily
always safe to assume the VF's MAC address cannot or shouldn't be
changed. That depends on the specific need whether the host admin
wants to restrict guest from changing the MAC address, although in
most cases it's true.

I understand we can use the perm_addr to distinguish. But as said,
this will pose limitation of flexible configuration where one can
assign VFs with identical MAC address at all while each VF belongs to
different PF and/or different subnet for e.g. load balancing. And
furthermore, the QEMU device model never uses MAC address to be
interpreted as an identifier, which requires to be unique per VM
instance. Why we're introducing this inconsistency?

-Siwei

>
>
>
>
>>
>> >
>> > --
>> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Siwei Liu
On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model.  MS worked hard with distro vendors to support 
>> > > > >transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace 
>> > > application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it 
>> makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.

That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.

-Siwei


>
> --
> MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-19 Thread Siwei Liu
On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on 

Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-19 Thread Siwei Liu
On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-17 Thread Siwei Liu
I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <losewe...@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <da...@davemloft.net> wrote:
>> From: Siwei Liu <losewe...@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature th

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin  wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala  wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala 
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 8:43 AM, Jiri Pirko  wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com 
>>> > > > > wrote:
>>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>>> > > > > > > sridhar.samudr...@intel.com wrote:
>>> > > > > [...]
>>> > > > >
>>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>>> > > > > > > > *bypass_netdev,
>>> > > > > > > > +   struct net_device 
>>> > > > > > > > *child_netdev)
>>> > > > > > > > +{
>>> > > > > > > > +  struct virtnet_bypass_info *vbi;
>>> > > > > > > > +  bool backup;
>>> > > > > > > > +
>>> > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>>> > > > > > > > +  backup = (child_netdev->dev.parent == 
>>> > > > > > > > bypass_netdev->dev.parent);
>>> > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>> > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>>> > > > > > > > +  netdev_info(bypass_netdev,
>>> > > > > > > > +  "%s attempting to join bypass dev 
>>> > > > > > > > when %s already present\n",
>>> > > > > > > > +  child_netdev->name, backup ? 
>>> > > > > > > > "backup" : "active");
>>> > > > > > > Bypass module should check if there is already some other netdev
>>> > > > > > > enslaved and refuse right there.
>>> > > > > > This will work for virtio-net with 3 netdev model, but this check 
>>> > > > > > has to be done by netvsc
>>> > > > > > as its bypass_netdev is same as the backup_netdev.
>>> > > > > > Will add a flag while registering with the bypass module to 
>>> > > > > > indicate if the driver is doing
>>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check 
>>> > > > > > can be done in bypass module
>>> > > > > > for 3 netdev scenario.
>>> > > > > Just let me undestand it clearly. What I expect the difference 
>>> > > > > would be
>>> > > > > between 2netdev and3 netdev model is this:
>>> > > > > 2netdev:
>>> > > > >  bypass_master
>>> > > > > /
>>> > > > >/
>>> > > > > VF_slave
>>> > > > >
>>> > > > > 3netdev:
>>> > > > >  bypass_master
>>> > > > > / \
>>> > > > >/   \
>>> > > > > VF_slave   backup_slave
>>> > > > >
>>> > > > > Is that correct? If not, how does it look like?
>>> > > > >
>>> > > > >
>>> > > > Looks correct.
>>> > > > VF_slave and backup_slave are the original netdevs and are present in 
>>> > > > both the models.
>>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave 
>>> > > > and backup_slave are
>>> > > > marked as the 2 slaves of this new netdev.
>>> > > You say it looks correct and in another sentence you provide completely
>>> > > different description. Could you please look again?
>>> > >
>>> > To be exact, 2 netdev model with netvsc looks like this.
>>> >
>>> > netvsc_netdev
>>> >   /
>>> >  /
>>> > VF_slave
>>> >
>>> > With virtio_net, 3 netdev model
>>> >
>>> >   bypass_netdev
>>> >   / \
>>> >  /   \
>>> > VF_slave   virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>>
>> bypass_netdev
>> / \
>>/   \
>>VF_slave   virtio_net netdev (original)
>
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
>netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
>configured on it (well you could, but the rx_handler would eat every
>incoming packet). So you will break the user bacause he would have to
>move the configuration to the new master device.

That's exactly the point why I need to hide the lower netdev slaves
and trying the align the naming of the bypass with where IP was
configured on the original netdev. The current 3-netdev model is not
"transparent" by any means.

-Siwei

> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-10 Thread Siwei Liu
On Sun, Apr 8, 2018 at 9:32 AM, David Miller <da...@davemloft.net> wrote:
> From: Siwei Liu <losewe...@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <losewe...@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <and...@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> > Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
> Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn  wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>Andrew
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-06 Thread Siwei Liu
(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
> devices = container_get(qdev_get_machine(), "/peripheral");
> return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu
On Wed, Apr 4, 2018 at 10:37 AM, David Miller  wrote:
> From: David Ahern 
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu
(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <losewe...@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <step...@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsah...@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsah...@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern  wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin  wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>   __u16 max_virtqueue_pairs;
>>   /* Default maximum transmit unit advice */
>>   __u16 mtu;
>> + /* Device at bus:slot.function backed up by virtio_net */
>> + __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin  wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>  }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +   uint16_t *devfn, Error **errp)
>> +{
>> +uint16_t busnum = 0, slot = 0, func = 0;
>> +const char *pc, *pd, *pe;
>> +Error *local_err = NULL;
>> +ObjectClass *class;
>> +char value[1024];
>> +BusState *bus;
>> +uint64_t u64;
>> +
>> +if (!(pc = strchr(id, ':'))) {
>> +error_setg(errp, "Invalid id: backup=%s, "
>> +   "correct format should be backup="
>> +   "':[.]'", id);
>> +return -1;
>> +}
>> +get_opt_name(value, sizeof(value), id, ':');
>> +if (pc != id + 1) {
>> +bus = qbus_find(value, errp);
>> +if (!bus)
>> +return -1;
>> +
>> +class = object_get_class(OBJECT(bus));
>> +if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +error_setg(errp, "%s is not a device on pci bus", id);
>> +return -1;
>> +}
>> +busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +}
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +if (!devfn)
>> +goto out;
>> +
>> +pd = strchr(pc, '.');
>> +pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +if (pe != pc + 1) {
>> +parse_option_number("slot", value, , _err);
>> +if (local_err) {
>> +error_propagate(errp, local_err);
>> +return -1;
>> +}
>> +slot = (uint16_t)u64;
>> +}
>> +if (pd && *(pd + 1) != '\0') {
>> +parse_option_number("function", pd, , _err);
>> +if (local_err) {
>> +error_propagate(errp, local_err);
>> +return -1;
>> +}
>> +func = (uint16_t)u64;
>> +}
>> +
>> +out:
>> +if (busnr)
>> +*busnr = busnum;
>> +if (devfn)
>> +*devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>  DeviceState *dev;
>> --
>> 1.8.3.1
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko  wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsah...@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:

 There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
 would prefer a better solution than playing games with name prefixes and
 one that includes an API for users to list all devices -- even ones
 hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern  wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Siwei Liu
On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko  wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsah...@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu 
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c  | 281 
>>> ++--
>>>  net/core/net_namespace.c|   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Siwei Liu
On Sun, Apr 1, 2018 at 9:11 AM, David Ahern  wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu 
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c  | 281 
>> ++--
>>  net/core/net_namespace.c|   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko  wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote:
>
> [...]
>

 No, that's not what I was talking about of course. I thought you
 mentioned the upgrade scenario this patch would like to address is to
 use the bypass interface "to take the place of the original virtio,
 and get udev to rename the bypass to what the original virtio_net
 was". That is one of the possible upgrade paths for sure. However the
 upgrade path I was seeking is to use the bypass interface to take the
 place of original VF interface while retaining the name and network
 configs, which generally can be done simply with kernel upgrade. It
 would become limiting as this patch makes the bypass interface share
 the same virtio pci device with virito backup. Can this bypass
 interface be made general to take place of any pci device other than
 virtio-net? This will be more helpful as the cloud users who has
 existing setup on VF interface don't have to recreate it on virtio-net
 and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <losewe...@gmail.com> wrote:
>>>>
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.du...@gmail.com> wrote:
>>>>>
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>>>>
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>>>>>> netdev
>>>>>>> as default for transmits when it is available with link up and
>>>>>>> running.
>>>>>>
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>>virtio pci device, udev tries to rename both of them with the same
>>>>>>> name
>>>>>>>and the 2nd rename will fail. This would be OK as long as the
>>>>>>> first netdev
>>>>>>>to be renamed is the 'bypass' netdev, but the order in which udev
>>>>>>> gets
>>>>>>>to rename the 2 netdevs is not reliable.
>>>>>>
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>>
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>>
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.du...@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <losewe...@gmail.com> wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>> It extends virtio_net to use alternate datapath when available and
>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>> as default for transmits when it is available with link up and running.
>>>>
>>>> Thank you do doing this.
>>>>
>>>>> We noticed a couple of issues with this approach during testing.
>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>>   and the 2nd rename will fail. This would be OK as long as the first 
>>>>> netdev
>>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>   to rename the 2 netdevs is not reliable.
>>>>
>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>> struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal 

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Siwei Liu
On Fri, Jan 26, 2018 at 8:51 AM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
>
> On 1/26/2018 12:14 AM, Siwei Liu wrote:
>>
>> On Tue, Jan 23, 2018 at 2:58 PM, Michael S. Tsirkin <m...@redhat.com>
>> wrote:
>>>
>>> On Tue, Jan 23, 2018 at 12:24:47PM -0800, Siwei Liu wrote:
>>>>
>>>> On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>>>>>>
>>>>>> First off, as mentioned in another thread, the model of stacking up
>>>>>> virt-bond functionality over virtio seems a wrong direction to me.
>>>>>> Essentially the migration process would need to carry over all guest
>>>>>> side configurations previously done on the VF/PT and get them moved to
>>>>>> the new device being it virtio or VF/PT.
>>>>>
>>>>> I might be wrong but I don't see why we should worry about this
>>>>> usecase.
>>>>> Whoever has a bond configured already has working config for migration.
>>>>> We are trying to help people who don't, not convert existig users.
>>>>
>>>> That has been placed in the view of cloud providers that the imported
>>>> images from the store must be able to run unmodified thus no
>>>> additional setup script is allowed (just as Stephen mentioned in
>>>> another mail). Cloud users don't care about live migration themselves
>>>> but the providers are required to implement such automation mechanism
>>>> to make this process transparent if at all possible. The user does not
>>>> care about the device underneath being VF or not, but they do care
>>>> about consistency all across and the resulting performance
>>>> acceleration in making VF the prefered datapath. It is not quite
>>>> peculiar user cases but IMHO *any* approach proposed for live
>>>> migration should be able to persist the state including network config
>>>> e.g. as simple as MTU. Actually this requirement has nothing to do
>>>> with virtio but our target users are live migration agnostic, being it
>>>> tracking DMA through dirty pages, using virtio as the helper, or
>>>> whatsoever, the goal of persisting configs across remains same.
>>>
>>> So the patching being discussed here will mostly do exactly that if your
>>> original config was simply a single virtio net device.
>>>
>> True, but I don't see the patch being discussed starts with good
>> foundation of supporting the same for VF/PT device. That is the core
>> of the issue.
>
>
>>> What kind of configs do your users have right now?
>>
>> Any configs be it generic or driver specific that the VF/PT device
>> supports and have been enabled/configured. General network configs
>> (MAC, IP address, VLAN, MTU, iptables rules), ethtool settings
>> (hardware offload, # of queues and ring entris, RSC options, rss
>> rxfh-indir table, rx-flow-hash, et al) , bpf/XDP program being run, tc
>> flower offload, just to name a few. As cloud providers we don't limit
>> users from applying driver specific tuning to the NIC/VF, and
>> sometimes this is essential to achieving best performance for their
>> workload. We've seen cases like tuning coalescing parameters for
>> getting low latency, changing rx-flow-hash function for better VXLAN
>> throughput, or even adopting quite advanced NIC features such as flow
>> director or cloud filter. We don't expect users to compromise even a
>> little bit on these. That is once we turn on live migration for the VF
>> or pass through devices in the VM, it all takes place under the hood,
>> users (guest admins, applications) don't have to react upon it or even
>> notice the change. I should note that the majority of live migrations
>> take place between machines with completely identical hardware, it's
>> more critical than necessary to keep the config as-is across the move,
>> stealth while quiet.
>
>
> This usecase is much more complicated and different than what this patch is
> trying
> to address.

Yep, it is technically difficult, but as cloud providers we would like
to take actions to address use case for our own if no one else is
willing to do so. However we're not seeking complicated design or
messing up the others such as your use case. As this is the first time
a real patch of the PV failover approach, although having be discussed
for years, posted to the mailing lis

Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Siwei Liu
On Tue, Jan 23, 2018 at 2:58 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Jan 23, 2018 at 12:24:47PM -0800, Siwei Liu wrote:
>> On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>> >> First off, as mentioned in another thread, the model of stacking up
>> >> virt-bond functionality over virtio seems a wrong direction to me.
>> >> Essentially the migration process would need to carry over all guest
>> >> side configurations previously done on the VF/PT and get them moved to
>> >> the new device being it virtio or VF/PT.
>> >
>> > I might be wrong but I don't see why we should worry about this usecase.
>> > Whoever has a bond configured already has working config for migration.
>> > We are trying to help people who don't, not convert existig users.
>>
>> That has been placed in the view of cloud providers that the imported
>> images from the store must be able to run unmodified thus no
>> additional setup script is allowed (just as Stephen mentioned in
>> another mail). Cloud users don't care about live migration themselves
>> but the providers are required to implement such automation mechanism
>> to make this process transparent if at all possible. The user does not
>> care about the device underneath being VF or not, but they do care
>> about consistency all across and the resulting performance
>> acceleration in making VF the prefered datapath. It is not quite
>> peculiar user cases but IMHO *any* approach proposed for live
>> migration should be able to persist the state including network config
>> e.g. as simple as MTU. Actually this requirement has nothing to do
>> with virtio but our target users are live migration agnostic, being it
>> tracking DMA through dirty pages, using virtio as the helper, or
>> whatsoever, the goal of persisting configs across remains same.
>
> So the patching being discussed here will mostly do exactly that if your
> original config was simply a single virtio net device.
>

True, but I don't see the patch being discussed starts with good
foundation of supporting the same for VF/PT device. That is the core
of the issue.

>
> What kind of configs do your users have right now?

Any configs be it generic or driver specific that the VF/PT device
supports and have been enabled/configured. General network configs
(MAC, IP address, VLAN, MTU, iptables rules), ethtool settings
(hardware offload, # of queues and ring entris, RSC options, rss
rxfh-indir table, rx-flow-hash, et al) , bpf/XDP program being run, tc
flower offload, just to name a few. As cloud providers we don't limit
users from applying driver specific tuning to the NIC/VF, and
sometimes this is essential to achieving best performance for their
workload. We've seen cases like tuning coalescing parameters for
getting low latency, changing rx-flow-hash function for better VXLAN
throughput, or even adopting quite advanced NIC features such as flow
director or cloud filter. We don't expect users to compromise even a
little bit on these. That is once we turn on live migration for the VF
or pass through devices in the VM, it all takes place under the hood,
users (guest admins, applications) don't have to react upon it or even
notice the change. I should note that the majority of live migrations
take place between machines with completely identical hardware, it's
more critical than necessary to keep the config as-is across the move,
stealth while quiet.

As you see generic bond or bridge cannot suffice the need. That's why
we need a new customized virt bond driver, and tailor it for VM live
migration specifically. Leveraging para-virtual e.g. virtio net device
as the backup path is one thing, tracking through driver config
changes in order to re-config as necessary is another. I would think
without considering the latter, the proposal being discussed is rather
incomplete, and remote to be useful in production.

>
>
>> >
>> >> Without the help of a new
>> >> upper layer bond driver that enslaves virtio and VF/PT devices
>> >> underneath, virtio will be overloaded with too much specifics being a
>> >> VF/PT backup in the future.
>> >
>> > So this paragraph already includes at least two conflicting
>> > proposals. On the one hand you want a separate device for
>> > the virtual bond, on the other you are saying a separate
>> > driver.
>>
>> Just to be crystal clear: separate virtual bond device (netdev ops,
>> not necessarily bus device) for VM migration specifically with a
>> separate driver.
>
> Okay, but note that any config someone had on a virtio 

Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-23 Thread Siwei Liu
On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>> First off, as mentioned in another thread, the model of stacking up
>> virt-bond functionality over virtio seems a wrong direction to me.
>> Essentially the migration process would need to carry over all guest
>> side configurations previously done on the VF/PT and get them moved to
>> the new device being it virtio or VF/PT.
>
> I might be wrong but I don't see why we should worry about this usecase.
> Whoever has a bond configured already has working config for migration.
> We are trying to help people who don't, not convert existig users.

That has been placed in the view of cloud providers that the imported
images from the store must be able to run unmodified thus no
additional setup script is allowed (just as Stephen mentioned in
another mail). Cloud users don't care about live migration themselves
but the providers are required to implement such automation mechanism
to make this process transparent if at all possible. The user does not
care about the device underneath being VF or not, but they do care
about consistency all across and the resulting performance
acceleration in making VF the prefered datapath. It is not quite
peculiar user cases but IMHO *any* approach proposed for live
migration should be able to persist the state including network config
e.g. as simple as MTU. Actually this requirement has nothing to do
with virtio but our target users are live migration agnostic, being it
tracking DMA through dirty pages, using virtio as the helper, or
whatsoever, the goal of persisting configs across remains same.

>
>> Without the help of a new
>> upper layer bond driver that enslaves virtio and VF/PT devices
>> underneath, virtio will be overloaded with too much specifics being a
>> VF/PT backup in the future.
>
> So this paragraph already includes at least two conflicting
> proposals. On the one hand you want a separate device for
> the virtual bond, on the other you are saying a separate
> driver.

Just to be crystal clear: separate virtual bond device (netdev ops,
not necessarily bus device) for VM migration specifically with a
separate driver.

>
> Further, the reason to have a separate *driver* was that
> some people wanted to share code with netvsc - and that
> one does not create a separate device, which you can't
> change without breaking existing configs.

I'm not sure I understand this statement. netvsc is already another
netdev being created than the enslaved VF netdev, why it bothers? In
the Azure case, the stock image to be imported does not bind to a
specific driver but only MAC address. And people just deal with the
new virt-bond netdev rather than the underlying virtio and VF. And
both these two underlying netdevs should be made invisible to prevent
userspace script from getting them misconfigured IMHO.

A separate driver was for code sharing for sure, only just netvsc but
could be other para-virtual devices floating around: any PV can serve
as the side channel and the backup path for VF/PT. Once we get the new
driver working atop virtio we may define ops and/or protocol needed to
talk to various other PV frontend that may implement the side channel
of its own for datapath switching (e.g. virtio is one of them, Xen PV
frontend can be another). I just don't like to limit the function to
virtio only and we have to duplicate code then it starts to scatter
around all over the places.

I understand right now we start it as simple so it may just be fine
that the initial development activities center around virtio. However,
from cloud provider/vendor perspective I don't see the proposed scheme
limits to virtio only. Any other PV driver which has the plan to
support the same scheme can benefit. The point is that we shouldn't be
limiting the scheme to virtio specifics so early which is hard to have
it promoted to a common driver once we get there.

>
> So some people want a fully userspace-configurable switchdev, and that
> already exists at some level, and maybe it makes sense to add more
> features for performance.
>
> But the point was that some host configurations are very simple,
> and it probably makes sense to pass this information to the guest
> and have guest act on it directly. Let's not conflate the two.

It may be fine to push some of the configurations from host but that
perhaps doesn't cover all the cases: how is it possible for the host
to save all network states and configs done by the guest before
migration. Some of the configs might come from future guest which is
unknown to host. Anyhow the bottom line is that the guest must be able
to act on those configuration request changes automatically without
involving users intervention.

Regards,
-Siwei

>
> --
> MST
___

Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-22 Thread Siwei Liu
First off, as mentioned in another thread, the model of stacking up
virt-bond functionality over virtio seems a wrong direction to me.
Essentially the migration process would need to carry over all guest
side configurations previously done on the VF/PT and get them moved to
the new device being it virtio or VF/PT. Without the help of a new
upper layer bond driver that enslaves virtio and VF/PT devices
underneath, virtio will be overloaded with too much specifics being a
VF/PT backup in the future. I hope you're already aware of the issue
in longer term and move to that model as soon as possible. See more
inline.

On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
 wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. The VF datapath is only used
> for unicast traffic. Broadcasts/multicasts go via virtio datapath so that
> east-west broadcasts don't use the PCI bandwidth.

Why not making an this an option/optimization rather than being the
only means? The problem of east-west broadcast eating PCI bandwidth
depends on specifics of the (virtual) network setup, while some users
won't want to lose VF's merits such as latency. Why restricting
broadcast/multicast xmit to virtio only which potentially regresses
the performance against raw VF?

> It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.

Is there a host side patch (planned) for this MAC filter switching
process? As said in another thread, that simple script won't work for
macvtap backend.

Thanks,
-Siwei

>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2
>
> Signed-off-by: Sridhar Samudrala 
> Reviewed-by: Jesse Brandeburg 
> ---
>  drivers/net/virtio_net.c | 307 
> ++-
>  1 file changed, 305 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f149a160a8c5..0e58d364fde9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -120,6 +122,15 @@ struct receive_queue {
> struct xdp_rxq_info xdp_rxq;
>  };
>
> +struct virtnet_vf_pcpu_stats {
> +   u64 rx_packets;
> +   u64 rx_bytes;
> +   u64 tx_packets;
> +   u64 tx_bytes;
> +   struct u64_stats_sync   syncp;
> +   u32 tx_dropped;
> +};
> +
>  struct virtnet_info {
> struct virtio_device *vdev;
> struct virtqueue *cvq;
> @@ -182,6 +193,10 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> +   /* State to manage the associated VF interface. */
> +   struct net_device __rcu *vf_netdev;
> +   struct virtnet_vf_pcpu_stats __percpu *vf_stats;
>  };
>
>  struct padded_vnet_hdr {
> @@ -1314,16 +1329,53 @@ static int xmit_skb(struct send_queue *sq, struct 
> sk_buff *skb)
> return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
>  }
>
> +/* Send skb on the slave VF device. */
> +static int virtnet_vf_xmit(struct net_device *dev, struct net_device 
> *vf_netdev,
> +  struct sk_buff *skb)
> +{
> +   struct virtnet_info *vi = netdev_priv(dev);
> +   unsigned int len = skb->len;
> +   int rc;
> +
> +   skb->dev = vf_netdev;
> +   skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +   rc = dev_queue_xmit(skb);
> +   if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
> +   struct virtnet_vf_pcpu_stats *pcpu_stats
> +   = this_cpu_ptr(vi->vf_stats);
> +
> +   u64_stats_update_begin(_stats->syncp);
> +   pcpu_stats->tx_packets++;
> +   pcpu_stats->tx_bytes += len;
> +   u64_stats_update_end(_stats->syncp);
> +   } else {
> +   this_cpu_inc(vi->vf_stats->tx_dropped);
> +   }
> +
> +   return rc;
> +}
> +
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
> struct virtnet_info *vi = netdev_priv(dev);
> int qnum = skb_get_queue_mapping(skb);
> struct send_queue *sq = >sq[qnum];
> +   struct net_device *vf_netdev;
> int err;
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = 

Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device

2018-01-22 Thread Siwei Liu
Apologies I didn't notice that the discussion was mistakenly taken
offline. Post it back.

-Siwei

On Sat, Jan 13, 2018 at 7:25 AM, Siwei Liu <losewe...@gmail.com> wrote:
> On Thu, Jan 11, 2018 at 12:32 PM, Samudrala, Sridhar
> <sridhar.samudr...@intel.com> wrote:
>> On 1/8/2018 9:22 AM, Siwei Liu wrote:
>>>
>>> On Sat, Jan 6, 2018 at 2:33 AM, Samudrala, Sridhar
>>> <sridhar.samudr...@intel.com> wrote:
>>>>
>>>> On 1/5/2018 9:07 AM, Siwei Liu wrote:
>>>>>
>>>>> On Thu, Jan 4, 2018 at 8:22 AM, Samudrala, Sridhar
>>>>> <sridhar.samudr...@intel.com> wrote:
>>>>>>
>>>>>> On 1/3/2018 10:28 AM, Alexander Duyck wrote:
>>>>>>>
>>>>>>> On Wed, Jan 3, 2018 at 10:14 AM, Samudrala, Sridhar
>>>>>>> <sridhar.samudr...@intel.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/3/2018 8:59 AM, Alexander Duyck wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubak...@wp.pl>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>>>>>>>>>>
>>>>>>>>>>> This patch series enables virtio to switch over to a VF datapath
>>>>>>>>>>> when
>>>>>>>>>>> a
>>>>>>>>>>> VF
>>>>>>>>>>> netdev is present with the same MAC address. It allows live
>>>>>>>>>>> migration
>>>>>>>>>>> of
>>>>>>>>>>> a VM
>>>>>>>>>>> with a direct attached VF without the need to setup a bond/team
>>>>>>>>>>> between
>>>>>>>>>>> a
>>>>>>>>>>> VF and virtio net device in the guest.
>>>>>>>>>>>
>>>>>>>>>>> The hypervisor needs to unplug the VF device from the guest on the
>>>>>>>>>>> source
>>>>>>>>>>> host and reset the MAC filter of the VF to initiate failover of
>>>>>>>>>>> datapath
>>>>>>>>>>> to
>>>>>>>>>>> virtio before starting the migration. After the migration is
>>>>>>>>>>> completed,
>>>>>>>>>>> the
>>>>>>>>>>> destination hypervisor sets the MAC filter on the VF and plugs it
>>>>>>>>>>> back
>>>>>>>>>>> to
>>>>>>>>>>> the guest to switch over to VF datapath.
>>>>>>>>>>>
>>>>>>>>>>> It is based on netvsc implementation and it may be possible to
>>>>>>>>>>> make
>>>>>>>>>>> this
>>>>>>>>>>> code
>>>>>>>>>>> generic and move it to a common location that can be shared by
>>>>>>>>>>> netvsc
>>>>>>>>>>> and virtio.
>>>>>>>>>>>
>>>>>>>>>>> This patch series is based on the discussion initiated by Jesse on
>>>>>>>>>>> this
>>>>>>>>>>> thread.
>>>>>>>>>>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>>>>>>>>>
>>>>>>>>>> How does the notion of a device which is both a bond and a leg of a
>>>>>>>>>> bond fit with Alex's recent discussions about feature propagation?
>>>>>>>>>> Which propagation rules will apply to VirtIO master?  Meaning of
>>>>>>>>>> the
>>>>>>>>>> flags on a software upper device may be different.  Why muddy the
>>>>>>>>>> architecture like this and not introduce a synthetic bond device?
>>>>>>>>>
>>>>>>>>> It doesn't really fit with the notion I had. I think there may have
>>>>>>>>> been a bit of a disconnect as I have been out for the last week or
>>>>>

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2018-01-10 Thread Siwei Liu
On Wed, Dec 20, 2017 at 8:52 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
> On Wed, 20 Dec 2017 18:16:30 -0800, Siwei Liu wrote:
>> > The plan is to remove the delay and do the naming in the kernel.
>> > This was suggested by Lennart since udev is only doing naming policy
>> > because kernel names were not repeatable.
>> >
>> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>> >
>> > Patch is pending.
>>
>> While it's good to show VF with specific naming to indicate
>> enslavement, I wonder wouldn't it be better to hide this netdev at all
>> from the user space? IMHO this extra device is useless when being
>> enslaved and we may delegate controls (e.g. ethtool) over to the
>> para-virtual device instead? That way it's possible to eliminate the
>> possibility of additional udev setup or modification?
>>
>> I'm not sure if this  is consistent with Windows guest or not, but I
>> don't find it _Linux_ user friendly that ethtool doesn't work on the
>> composite interface any more, and I have to end up with finding out
>> the correct enslaved VF I must operate on.
>
> Hiding "low level" netdevs comes up from time to time, and is more
> widely applicable than just to VF bonds.  We should find a generic
> solution to that problem.

Wholeheartedly agreed.

Be it a generic virtual bond or virtio-net specific one, there should
be some common code between netvsc and virtio for this type of work.
For avoiding duplicated bugs, consistent (Linux) user experience,
future code refactoring/management, and whatever...

One thing worth to note is that, unlike the Hyper-V netvsc backend
there's currently no equivalent (fine-grained) Linux ndo_* driver
interface that is able to move around MAC address/VLAN filters at
run-time specifically. The OID_RECEIVE_FILTER_MOVE_FILTER request I
mean. That translates to one substantial difference in VF
plumbing/unplumbing sequence: you cannot move the MAC address around
to paravirt device until VF is fully unplugged out of the guest OS. I
don't know what backend changes to be proposed for virtio-net as
helper, but the common code needs to work with both flavors of data
path switching backends and do its job correctly.

Regards,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2018-01-10 Thread Siwei Liu
On Tue, Dec 19, 2017 at 10:41 AM, Stephen Hemminger
 wrote:
> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> David Miller  wrote:
>
>> From: Stephen Hemminger 
>> Date: Tue, 19 Dec 2017 09:55:48 -0800
>>
>> > could be 10ms, just enough to let udev do its renaming
>>
>> Please, move to some kind of notification or event based handling of
>> this problem.
>>
>> No delay is safe, what if userspace gets swapped out or whatever
>> else might make userspace stall unexpectedly?
>>
>
> The plan is to remove the delay and do the naming in the kernel.
> This was suggested by Lennart since udev is only doing naming policy
> because kernel names were not repeatable.
>
> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>
> Patch is pending.

While it's good to show VF with specific naming to indicate
enslavement, I wonder wouldn't it be better to hide this netdev at all
from the user space? IMHO this extra device is useless when being
enslaved and we may delegate controls (e.g. ethtool) over to the
para-virtual device instead? That way it's possible to eliminate the
possibility of additional udev setup or modification?

I'm not sure if this  is consistent with Windows guest or not, but I
don't find it _Linux_ user friendly that ethtool doesn't work on the
composite interface any more, and I have to end up with finding out
the correct enslaved VF I must operate on.

Regards,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2018-01-10 Thread Siwei Liu
On Tue, Dec 19, 2017 at 10:41 AM, Stephen Hemminger <
step...@networkplumber.org> wrote:

> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> David Miller  wrote:
>
> > From: Stephen Hemminger 
> > Date: Tue, 19 Dec 2017 09:55:48 -0800
> >
> > > could be 10ms, just enough to let udev do its renaming
> >
> > Please, move to some kind of notification or event based handling of
> > this problem.
> >
> > No delay is safe, what if userspace gets swapped out or whatever
> > else might make userspace stall unexpectedly?
> >
>
> The plan is to remove the delay and do the naming in the kernel.
> This was suggested by Lennart since udev is only doing naming policy
> because kernel names were not repeatable.
>
> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>
> Patch is pending.
>

While it's good to show VF with specific naming to indicate enslavement, I
wonder wouldn't it be better to hide this netdev at all from the user
space? IMHO this extra device is useless when being enslaved and we may
delegate controls (e.g. ethtool) over to the para-virtual device instead?
That way it's possible to eliminate the possibility of additional udev
setup or modification?

I'm not sure if this  is consistent with Windows guest or not, but I don't
find it _Linux_ user friendly that ethtool doesn't work on the composite
interface any more, and I have to end up with finding out the correct
enslaved VF I must operate on.

Regards,
-Siwei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization