Re: [PATCH RFC v7 net-next] netdev: pass the stuck queue to the timeout handler

2019-12-03 Thread Jakub Kicinski
On Tue, 3 Dec 2019 02:12:48 -0500, Michael S. Tsirkin wrote:
> This allows incrementing the correct timeout statistic without any mess.
> Down the road, devices can learn to reset just the specific queue.
> 
> The patch was generated with the following script:

Still:

Acked-by: Jakub Kicinski 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2] drivers: net: virtio_net: Implement a dev_watchdog handler

2019-11-24 Thread Jakub Kicinski
On Sun, 24 Nov 2019 18:29:49 -0500, Michael S. Tsirkin wrote:
> netdev: pass the stuck queue to the timeout handler
> 
> This allows incrementing the correct timeout statistic without any mess.
> Down the road, devices can learn to reset just the specific queue.

FWIW

Acked-by: Jakub Kicinski 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2] drivers: net: virtio_net: Implement a dev_watchdog handler

2019-11-24 Thread Jakub Kicinski
On Sun, 24 Nov 2019 16:48:35 -0500, Michael S. Tsirkin wrote:
> diff --git a/arch/m68k/emu/nfeth.c b/arch/m68k/emu/nfeth.c
> index a4ebd2445eda..8e06e7407854 100644
> --- a/arch/m68k/emu/nfeth.c
> +++ b/arch/m68k/emu/nfeth.c
> @@ -167,7 +167,7 @@ static int nfeth_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   return 0;
>  }
>  
> -static void nfeth_tx_timeout(struct net_device *dev)
> +static void nfeth_tx_timeout(struct net_device *dev, int txqueue)

Given the recent vf ndo problems, I wonder if it's worth making the
queue id unsigned from the start? Since it's coming from the stack
there should be no range checking required, but also signed doesn't
help anything so why not?

>  {
>   dev->stats.tx_errors++;
>   netif_wake_queue(dev);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread Jakub Kicinski
On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters  
> > >
> > > Sorry I'm a bit confused. What is better than what?  
> >
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).  
> 
> Before I was aksed to revive this old mail thread, I knew the
> discussion could end up with something like this. Yes, theoretically
> there's a point - basically you don't believe kernel should take risk
> in fixing the issue, so you push back the hope to something in
> hypothesis that actually wasn't done and hard to get done in reality.
> It's not too different than saying "hey, what you're asking for is
> simply wrong, don't do it! Go back to modify userspace to create a
> bond or team instead!" FWIW I want to emphasize that the debate for
> what should be the right place to implement this failover facility:
> userspace versus kernel, had been around for almost a decade, and no
> real work ever happened in userspace to "standardize" this in the
> Linux world.

Let me offer you my very subjective opinion of why "no real work ever
happened in user space".  The actors who have primary interest to get
the auto-bonding working are HW vendors trying to either convince
customers to use SR-IOV, or being pressured by customers to make SR-IOV
easier to consume.  HW vendors hire driver developers, not user space
developers.  So the solution we arrive at is in the kernel for a non
technical reason (Conway's law, sort of).

$ cd NetworkManager/
$ git log --pretty=format:"%ae" | \
grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
 81 andrew.zaborow...@intel.com
  2 david.woodho...@intel.com
  2 ismo.puusti...@intel.com
  1 michael.i.dohe...@intel.com

Andrew works on WiFi.

I have asked the NetworkManager folks to implement this feature last
year when net_failover got dangerously close to getting merged, and
they said they were never approached with this request before, much less
offered code that solve it.  Unfortunately before they got around to it
net_failover was merged already, and they didn't proceed.  

So to my knowledge nobody ever tried to solve this in user space.
I don't think net_failover is particularly terrible, or that renaming
of primary in the kernel is the end of the world, but I'd appreciate if
you could point me to efforts to solve it upstream in user space
components, or acknowledge that nobody actually tried that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread Jakub Kicinski
On Thu, 28 Feb 2019 15:14:55 -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters
> > > 
> > > Sorry I'm a bit confused. What is better than what?  
> > 
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).
> >   
> > > > for twiddling kernel-based interface naming policy.. :S
> > > 
> > > I see your point. But my point is slave names don't really matter, only
> > > master name matters.  So I am not sure there's any policy worth talking
> > > about here.  
> > 
> > Oh yes, I don't disagree with you, but others seems to want to rename
> > the auto-bonded lower devices.  Which can be done trivially if it was 
> > a daemon in user space instantiating the auto-bond.  We are just
> > providing a basic version of auto-bonding in the kernel.  If there are
> > extra requirements on policy, or naming - the whole thing is better
> > solved in user space.  
> 
> OK so it seems that you would be happy with a combination of the module
> parameter disabling failover completely and renaming primary in kernel?
> Did I get it right?

Not 100%, I'm personally not convinced that renaming primary in the
kernel is okay.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread Jakub Kicinski
On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > It is a bit of a the chicken or the egg situation ;)  But users can
> > just blacklist, too.  Anyway, I think this is far better than module
> > parameters  
> 
> Sorry I'm a bit confused. What is better than what?

I mean that blacklist net_failover or module param to disable
net_failover and handle in user space are better than trying to solve
the renaming at kernel level (either by adding module params that make
the kernel rename devices or letting user space change names of running
devices if they are slaves).

> > for twiddling kernel-based interface naming policy.. :S  
> 
> I see your point. But my point is slave names don't really matter, only
> master name matters.  So I am not sure there's any policy worth talking
> about here.

Oh yes, I don't disagree with you, but others seems to want to rename
the auto-bonded lower devices.  Which can be done trivially if it was 
a daemon in user space instantiating the auto-bond.  We are just
providing a basic version of auto-bonding in the kernel.  If there are
extra requirements on policy, or naming - the whole thing is better
solved in user space.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread Jakub Kicinski
On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > Can the users who care about the naming put net_failover into
> > > > "user space will do the bond enslavement" mode, and do the bond
> > > > creation/management themselves from user space (in systemd/ 
> > > > Network Manager) based on the failover flag?
> > > 
> > > Putting issues of compatibility aside (userspace tends to be confused if
> > > you give it two devices with same MAC), how would you have it work in
> > > practice? Timer based hacks like netvsc where if userspace didn't
> > > respond within X seconds we assume it won't and do everything ourselves?  
> > 
> > Well, what I'm saying is basically if user space knows how to deal with
> > the auto-bonding, we can put aside net_failover for the most part.  It
> > can either be blacklisted or it can have some knob which will
> > effectively disable the auto-enslavement.  
> 
> OK I guess we could add a module parameter to skip this.
> Is this what you mean?

Yup.

> > Auto-bonding capable user space can do the renames, spawn the bond,
> > etc. all by itself.  I'm basically going back to my initial proposal
> > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > this, but we merged net_failover before those folks got around to
> > implementing it.  
> 
> In particular because there's no policy involved whatsoever
> here so it's just mechanism being pushed up to userspace.
> 
> > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > disable the kernel mechanism and take care of it all.  If kernel is
> > booted with an old user space which doesn't have capable NM/systemd -
> > net_failover will kick in and do its best.  
> 
> Sure - it's just 2 lines of code, see below.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> But I don't intend to bother until there's actual interest from
> userspace developers to bother. In particular it is not just NM/systemd
> even on Fedora - e.g. you will need to teach dracut to somehow detect
> and handle this - right now it gets confused if there are two devices
> with same MAC addresses.

It is a bit of a the chicken or the egg situation ;)  But users can
just blacklist, too.  Anyway, I think this is far better than module
parameters for twiddling kernel-based interface naming policy.. :S

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 955b3e76eb8d..dd2b2c370003 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  module_param(napi_tx, bool, 0644);
> +module_param(disable_failover, bool, 0644);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   virtnet_init_settings(dev);
>  
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> + !disable_failover) {
>   vi->failover = net_failover_create(vi->dev);
>   if (IS_ERR(vi->failover)) {
>   err = PTR_ERR(vi->failover);
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread Jakub Kicinski
On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > As this scheme adds much complexity to the kernel naming convention
> > > > (currently it's just ethX names) that no userspace can understand.
> > > 
> > > Anything that pokes at slaves needs to be specially designed anyway.
> > > Naming seems like a minor issue.  
> > 
> > Can the users who care about the naming put net_failover into
> > "user space will do the bond enslavement" mode, and do the bond
> > creation/management themselves from user space (in systemd/ 
> > Network Manager) based on the failover flag?  
> 
> Putting issues of compatibility aside (userspace tends to be confused if
> you give it two devices with same MAC), how would you have it work in
> practice? Timer based hacks like netvsc where if userspace didn't
> respond within X seconds we assume it won't and do everything ourselves?

Well, what I'm saying is basically if user space knows how to deal with
the auto-bonding, we can put aside net_failover for the most part.  It
can either be blacklisted or it can have some knob which will
effectively disable the auto-enslavement.

Auto-bonding capable user space can do the renames, spawn the bond,
etc. all by itself.  I'm basically going back to my initial proposal
here :)  There is a RedHat bugzilla for the NetworkManager team to do
this, but we merged net_failover before those folks got around to
implementing it.

IOW if NM/systemd is capable of doing the auto-bonding itself it can
disable the kernel mechanism and take care of it all.  If kernel is
booted with an old user space which doesn't have capable NM/systemd -
net_failover will kick in and do its best.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread Jakub Kicinski
On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > As this scheme adds much complexity to the kernel naming convention
> > (currently it's just ethX names) that no userspace can understand.  
> 
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.

Can the users who care about the naming put net_failover into
"user space will do the bond enslavement" mode, and do the bond
creation/management themselves from user space (in systemd/ 
Network Manager) based on the failover flag?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Resource management for ndo_xdp_xmit (Was: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames)

2019-02-09 Thread Jakub Kicinski
On Sat, 9 Feb 2019 00:18:31 +, Saeed Mahameed wrote:
> On Fri, 2019-02-08 at 15:17 -0800, Saeed Mahameed wrote:
> > On Thu, 2019-02-07 at 19:08 +, Saeed Mahameed wrote:  
> > > 
> > > So 
> > > 1) on dev_map_update_elem() we will call
> > > dev->dev->ndo_bpf() to notify the device on the intention to
> > > start/stop
> > > redirect, and wait for it to create/destroy the HW resources
> > > before/after actually updating the map
> > >   
> > 
> > silly me, dev_map_update_elem must be atomic, we can't hook driver
> > resource allocation to it, it must come as a separate request
> > (syscall)
> > from user space to request to create XDP redirect resources.
> >   
> 
> Well, it is possible to render dev_map_update_elem non-atomic and fail
> BPF programs who try to update it in the verifier
> check_map_func_compatibility.
> 
> if you know of any case where devmap needs to be updated from the BPF
> program please let me know.

Did we find a solution to non-map redirect?

Sorry if I missed the discussion, I couldn't make the iovisor call this
week due to travel.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames

2019-02-06 Thread Jakub Kicinski
On Wed, 6 Feb 2019 14:48:14 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 6 Feb 2019 00:06:33 +
> Saeed Mahameed  wrote:
> 
> > 3) Unrelated, In non XDP case, if skb allocation fails or driver fails
> > to pass the skb up to the stack for somereason, should the driver
> > increase rx packets ? IMHO the answer should be yes if we want to have
> > similar behavior between XDP and non XDP cases.  
> 
> I don't think "skb allocation fails" should increase rx packets
> counter.  The difference is that these events are outside sysadm/users
> control, and is an error detected inside the driver.  The XDP program
> takes a policy choice to XDP_DROP a packet, which can be accounted
> inside the XDP prog (as the samples show) or as we also discuss via a
> more generic XDP-action counters.

FWIW that's my understanding as well.  My understanding of Linux stats
is that they are incrementing one counter per packet.  I.e. in RX
direction success packets are those given to the stack, and for TX
those given to the hardware.  Standards (IETF/IEEE) usually count stats
on the same layer boundary, but I think software generally counts when
it's done with the packet.

I haven't seen it documented anywhere, yet.  I have tried to document
it in the docs of the recent RFC:
https://patchwork.ozlabs.org/patch/1032332/

Incidentally XDP_DROP may have been better named XDP_DISCARD from stats
perspective ;)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-27 Thread Jakub Kicinski
On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
> Basically we need some sort of PCI or PCIe topology mapping for the
> devices that can be translated into something we can communicate over
> the communication channel. 

Hm.  This is probably a completely stupid idea, but if we need to
start marshalling configuration requests/hints maybe the entire problem
could be solved by opening a netlink socket from hypervisor?  Even make
teamd run on the hypervisor side...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jakub Kicinski
On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
> > I don't see why the team cannot be there always.  
> 
> It is more the logistical nightmare. Part of the goal here was to work
> with the cloud base images that are out there such as
> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
> overhead for this stays fairly small and would be pulled in as just a
> standard part of the kernel update process. The virtio bypass only
> pops up if the backup bit is present. With the team solution it
> requires that the base image use the team driver on virtio_net when it
> sees one. I doubt the OSVs would want to do that just because SR-IOV
> isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jakub Kicinski
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
> 
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-18 Thread Jakub Kicinski
On Sat, 17 Feb 2018 09:12:01 -0800, Alexander Duyck wrote:
> >> We noticed a couple of issues with this approach during testing.
> >> - As both 'bypass' and 'backup' netdevs are associated with the same
> >>   virtio pci device, udev tries to rename both of them with the same name
> >>   and the 2nd rename will fail. This would be OK as long as the first 
> >> netdev
> >>   to be renamed is the 'bypass' netdev, but the order in which udev gets
> >>   to rename the 2 netdevs is not reliable.  
> >
> > Out of curiosity - why do you link the master netdev to the virtio
> > struct device?  
> 
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

That makes sense.  Is it udev/naming that you're most concerned about
here?  I.e. what's the user space app that expects the netdev to be
linked?  This is just out of curiosity, the linking of netdevs to
devices is a bit of a PITA in the switchdev eswitch mode world, with
libvirt expecting only certain devices to be there..  Right now we're
not linking VF reprs, which breaks naming.  I wanted to revisit that.

> > FWIW two solutions that immediately come to mind is to export "backup"
> > as phys_port_name of the backup virtio link and/or assign a name to the
> > master like you are doing already.  I think team uses team%d and bond
> > uses bond%d, soft naming of master devices seems quite natural in this
> > case.  
> 
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?

Yes, provided systemd is new enough.

> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
> > IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> > link is quite neat.  
> 
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.

That's my understanding too.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available

2018-02-16 Thread Jakub Kicinski
On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
> 
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
> 
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
> 
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2
> 
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck  

> +static void
> +virtnet_bypass_get_stats(struct net_device *dev,
> +  struct rtnl_link_stats64 *stats)
> +{
> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
> + const struct rtnl_link_stats64 *new;
> + struct rtnl_link_stats64 temp;
> + struct net_device *child_netdev;
> +
> + spin_lock(>stats_lock);
> + memcpy(stats, >bypass_stats, sizeof(*stats));
> +
> + rcu_read_lock();
> +
> + child_netdev = rcu_dereference(vbi->active_netdev);
> + if (child_netdev) {
> + new = dev_get_stats(child_netdev, );
> + virtnet_bypass_fold_stats(stats, new, >active_stats);
> + memcpy(>active_stats, new, sizeof(*new));
> + }
> +
> + child_netdev = rcu_dereference(vbi->backup_netdev);
> + if (child_netdev) {
> + new = dev_get_stats(child_netdev, );
> + virtnet_bypass_fold_stats(stats, new, >backup_stats);
> + memcpy(>backup_stats, new, sizeof(*new));
> + }
> +
> + rcu_read_unlock();
> +
> + memcpy(>bypass_stats, stats, sizeof(*stats));
> + spin_unlock(>stats_lock);
> +}
> +
> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
> +{
> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
> + struct net_device *child_netdev;
> + int ret = 0;
> +
> + child_netdev = rcu_dereference(vbi->active_netdev);
> + if (child_netdev) {
> + ret = dev_set_mtu(child_netdev, new_mtu);
> + if (ret)
> + return ret;
> + }
> +
> + child_netdev = rcu_dereference(vbi->backup_netdev);
> + if (child_netdev) {
> + ret = dev_set_mtu(child_netdev, new_mtu);
> + if (ret)
> + netdev_err(child_netdev,
> +"Unexpected failure to set mtu to %d\n",
> +new_mtu);

You should probably unwind if set fails on one of the legs.

> + }
> +
> + dev->mtu = new_mtu;
> + return 0;
> +}

nit: stats, mtu, all those mundane things are implemented in team
 already.  If we had this as kernel-internal team mode we wouldn't
 have to reimplement them...  You probably did investigate that
 option, for my edification, would you mind saying what the
 challenges/downsides were?

> +static struct net_device *
> +get_virtnet_bypass_bymac(struct net *net, const u8 *mac)
> +{
> + struct net_device *dev;
> +
> + ASSERT_RTNL();
> +
> + for_each_netdev(net, dev) {
> + if (dev->netdev_ops != _bypass_netdev_ops)
> + continue;   /* not a virtnet_bypass device */

Is there anything inherently wrong with enslaving another virtio dev
now?  I was expecting something like a hash map to map MAC addr ->
master and then one can check if dev is already enslaved to that master.
Just a random thought, I'm probably missing something...

> + if (ether_addr_equal(mac, dev->perm_addr))
> + return dev;
> + }
> +
> + return NULL;
> +}
> +
> +static struct net_device *
> +get_virtnet_bypass_byref(struct net_device *child_netdev)
> +{
> + struct net *net = dev_net(child_netdev);
> + struct net_device *dev;
> +
> + ASSERT_RTNL();
> +
> + for_each_netdev(net, dev) {
> + struct virtnet_bypass_info *vbi;
> +
> + if (dev->netdev_ops != _bypass_netdev_ops)
> + continue;   /* not a virtnet_bypass device */
> +
> + vbi = 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-16 Thread Jakub Kicinski
On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
> Ppatch 2 is in response to the community request for a 3 netdev
> solution.  However, it creates some issues we'll get into in a moment.
> It extends virtio_net to use alternate datapath when available and
> registered. When BACKUP feature is enabled, virtio_net driver creates
> an additional 'bypass' netdev that acts as a master device and controls
> 2 slave devices.  The original virtio_net netdev is registered as
> 'backup' netdev and a passthru/vf device with the same MAC gets
> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
> associated with the same 'pci' device.  The user accesses the network
> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
> as default for transmits when it is available with link up and running.

Thank you do doing this.

> We noticed a couple of issues with this approach during testing.
> - As both 'bypass' and 'backup' netdevs are associated with the same
>   virtio pci device, udev tries to rename both of them with the same name
>   and the 2nd rename will fail. This would be OK as long as the first netdev
>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>   to rename the 2 netdevs is not reliable. 

Out of curiosity - why do you link the master netdev to the virtio
struct device?

FWIW two solutions that immediately come to mind is to export "backup"
as phys_port_name of the backup virtio link and/or assign a name to the
master like you are doing already.  I think team uses team%d and bond
uses bond%d, soft naming of master devices seems quite natural in this
case.

IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
link is quite neat.

> - When the 'active' netdev is unplugged OR not present on a destination
>   system after live migration, the user will see 2 virtio_net netdevs.

That's necessary and expected, all configuration applies to the master
so master must exist.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Jakub Kicinski
On Fri, 26 Jan 2018 21:33:01 -0800, Samudrala, Sridhar wrote:
> >> 3 netdev model breaks this configuration starting with the creation
> >> and naming of the 2 devices to udev needing to be aware of master and
> >> slave virtio-net devices.  
> > I don't understand this comment.  There is one virtio-net device and
> > one "virtio-bond" netdev.  And user space has to be aware of the special
> > automatic arrangement anyway, because it can't touch the VF.  It
> > doesn't make any difference whether it ignores the VF or PV and VF.
> > It simply can't touch the slaves, no matter how many there are.  
> 
> If the userspace is not expected to touch the slaves, then why do we need to
> take extra effort to expose a netdev that is just not really useful.

You said:
"[user space] needs to be aware of master and slave virtio-net devices."

I'm saying:
It has to be aware of the special arrangement whether there is an
explicit bond netdev or not.

> >> Also, from a user experience point of view, loading a virtio-net with
> >> BACKUP feature enabled will now show 2 virtio-net netdevs.  
> > One virtio-net and one virtio-bond, which represents what's happening.  
> This again assumes that we want to represent a bond setup. Can't we 
> treat this
> as virtio-net providing an alternate low-latency datapath by taking over 
> VF datapath?

Bond is just a familiar name, we can call it something else if you
prefer.  The point is there are two data paths which can have
independent low-level settings and a higher level entity with
global settings which represents any path to the outside world.

Hiding low-level netdevs from a lay user requires a generic solution.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Jakub Kicinski
On Fri, 26 Jan 2018 15:30:35 -0800, Samudrala, Sridhar wrote:
> On 1/26/2018 2:47 PM, Jakub Kicinski wrote:
> > On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote:  
> >> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote:  
> >>>> and the VM is not expected to do any tuning/optimizations on the VF 
> >>>> driver
> >>>> directly,
> >>>> i think the current patch that follows the netvsc model of 2 
> >>>> netdevs(virtio
> >>>> and vf) should
> >>>> work fine.  
> >>> OK. For your use case that's fine. But that's too specific scenario
> >>> with lots of restrictions IMHO, perhaps very few users will benefit
> >>> from it, I'm not sure. If you're unwilling to move towards it, we'd
> >>> take this one and come back with a generic solution that is able to
> >>> address general use cases for VF/PT live migration .  
> >> I think that's a fine approach. Scratch your own itch!  I imagine a very
> >> generic virtio-switchdev providing host routing info to guests could
> >> address lots of usecases. A driver could bind to that one and enslave
> >> arbitrary other devices.  Sounds reasonable.
> >>
> >> But given the fundamental idea of a failover was floated at least as
> >> early as 2013, and made 0 progress since precisely because it kept
> >> trying to address more and more features, and given netvsc is already
> >> using the basic solution with some success, I'm not inclined to block
> >> this specific effort waiting for the generic one.  
> > I think there is an agreement that the extra netdev will be useful for
> > more advanced use cases, and is generally preferable.  What is the
> > argument for not doing that from the start?  If it was made I must have
> > missed it.  Is it just unwillingness to write the extra 300 lines of
> > code?  Sounds like a pretty weak argument when adding kernel ABI is at
> > stake...  
> 
> I am still not clear on the need for the extra netdev created by 
> virtio_net. The only advantage i can see is that the stats can be
> broken between VF and virtio datapaths compared to the aggregrated
> stats on virtio netdev as seen with the 2 netdev approach.

Maybe you're not convinced but multiple arguments were made.

> With 2 netdev model, any VM image that has a working network 
> configuration will transparently get VF based acceleration without
> any changes. 

Nothing happens transparently.  Things may happen automatically.  The
VF netdev doesn't disappear with netvsc.  The PV netdev transforms into
something it did not use to be.  And configures and reports some
information from the PV (e.g. speed) but PV doesn't pass traffic any
longer.

> 3 netdev model breaks this configuration starting with the creation
> and naming of the 2 devices to udev needing to be aware of master and
> slave virtio-net devices.

I don't understand this comment.  There is one virtio-net device and
one "virtio-bond" netdev.  And user space has to be aware of the special
automatic arrangement anyway, because it can't touch the VF.  It
doesn't make any difference whether it ignores the VF or PV and VF.
It simply can't touch the slaves, no matter how many there are.

> Also, from a user experience point of view, loading a virtio-net with
> BACKUP feature enabled will now show 2 virtio-net netdevs.

One virtio-net and one virtio-bond, which represents what's happening.

> For live migration with advanced usecases that Siwei is suggesting, i 
> think we need a new driver with a new device type that can track the
> VF specific feature settings even when the VF driver is unloaded.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Jakub Kicinski
On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote:
> > > and the VM is not expected to do any tuning/optimizations on the VF driver
> > > directly,
> > > i think the current patch that follows the netvsc model of 2 
> > > netdevs(virtio
> > > and vf) should
> > > work fine.  
> > 
> > OK. For your use case that's fine. But that's too specific scenario
> > with lots of restrictions IMHO, perhaps very few users will benefit
> > from it, I'm not sure. If you're unwilling to move towards it, we'd
> > take this one and come back with a generic solution that is able to
> > address general use cases for VF/PT live migration .  
> 
> I think that's a fine approach. Scratch your own itch!  I imagine a very
> generic virtio-switchdev providing host routing info to guests could
> address lots of usecases. A driver could bind to that one and enslave
> arbitrary other devices.  Sounds reasonable.
> 
> But given the fundamental idea of a failover was floated at least as
> early as 2013, and made 0 progress since precisely because it kept
> trying to address more and more features, and given netvsc is already
> using the basic solution with some success, I'm not inclined to block
> this specific effort waiting for the generic one.

I think there is an agreement that the extra netdev will be useful for
more advanced use cases, and is generally preferable.  What is the
argument for not doing that from the start?  If it was made I must have
missed it.  Is it just unwillingness to write the extra 300 lines of
code?  Sounds like a pretty weak argument when adding kernel ABI is at
stake...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-23 Thread Jakub Kicinski
On Tue, 23 Jan 2018 03:23:57 +0200, Michael S. Tsirkin wrote:
> > > > b. next-gen silicon may be able to disguise as virtio device, and the
> > > >loop check in virtio driver will prevent the legitimate bond it such
> > > >case.  AFAIU that's one of the goals of next-gen virtio spec as 
> > > > well.
> > > 
> > > In fact we have a virtio feature bit for the fallback.
> > > So this part does not depend on how software in guest works
> > > and does not need software solutions.  
> > 
> > You mean in the new spec?  Nice.  Still I think people will try to
> > implement the old one too given sufficiently capable HW.  
> 
> Existing HW won't have the BACKUP feature so the new functionality
> won't be activated. So no problem I think.

I meant code that compares of netdev_ops, e.g.:

+   /* Skip our own events */
+   if (event_dev->netdev_ops == _netdev)
+   return NOTIFY_DONE;

Would be an obstacle to bonding virtio_nets.  But that's just one of
the thoughts, perhaps of disputable value.  Anyway, separate netdev and
netdev_ops will solve this, and I think we agree that's preferable :)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-22 Thread Jakub Kicinski
On Tue, 23 Jan 2018 02:47:57 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> > On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:  
> > > > As we are using virtio_net to control and manage the VF data path, it 
> > > > is not
> > > > clear to me
> > > > what is the advantage of creating a new device rather than extending
> > > > virtio_net to manage
> > > > the VF datapath via transparent bond mechanism.  
> > > 
> > > So that XDP redirect actions can differentiate between virtio, PT
> > > device and the bond. Without it there's no way to redirect
> > > to virtio specifically.  
> > 
> > Let's make a list :P
> > 
> > separate netdev:
> > 1. virtio device being a bond master is confusing/unexpected.
> > 2. virtio device being both a master and a slave is confusing.  
> 
> vlans are like this too, aren't they?

Perhaps a bad wording.  Both master and member would be better.

> > 3. configuration of a master may have different semantics than
> >configuration of a slave.
> > 4. two para-virt devices may create a loop (or rather will be bound 
> >to each other indeterministically, depending on which spawns first).  
> 
> For 2 virtio devices, we can disable the bond to make it deterministic.

Do you mean the hypervisor can or is there a knob in virtio_net to mask
off features?  Would that require re-probe of the virtio device?

> > 5. there is no user configuration AFAIR in existing patch, VM admin
> >won't be able to prevent the bond.  Separate netdev we can make 
> >removable even if it's spawned automatically.  
> 
> That's more or less a feature. If you want configurability, just use
> any of the existing generic solutions (team,bond,bridge,...).

The use case in mind is that VM admin wants to troubleshoot a problem
and temporarily disable the auto-bond without touching the hypervisor 
(and either member preferably).

> > 6. XDP redirect use-case (or any explicit use of the virtio slave)
> >(from MST)
> > 
> > independent driver:
> > 7. code reuse.  
> 
> With netvsc? That precludes a separate device though because of
> compatibility.

Hopefully with one of the established bonding drivers (fingers
crossed).  We may see proliferation of special bonds (see Achiad's
presentation from last netdev about NIC-NUMA-node-bonds).

> > separate device:  
> 
> I'm not sure I understand how "separate device" is different from
> "separate netdev". Do you advocate for a special device who's job is
> just to tell the guest "bind these two devices together"?
> 
> Yea, sure, that works. However for sure it's more work to
> implement and manage at all levels. Further
> 
> - it doesn't answer the question
> - a feature bit in a virtio device is cheap enough that
>   I wouldn't worry too much about this feature
>   going unused later.

Right, I think we are referring to different things as device.  I mean
a bus device/struct device, but no strong preference on that one.  I'll
be happy as long as there is a separate netdev, really :)

> > 8. bond any netdev with any netdev.
> > 9. reuse well-known device driver model.
> > a. natural anchor for hypervisor configuration (switchdev etc.)  
> 
> saparate netdev has the same property
>
> > b. next-gen silicon may be able to disguise as virtio device, and the
> >loop check in virtio driver will prevent the legitimate bond it such
> >case.  AFAIU that's one of the goals of next-gen virtio spec as well.  
> 
> In fact we have a virtio feature bit for the fallback.
> So this part does not depend on how software in guest works
> and does not need software solutions.

You mean in the new spec?  Nice.  Still I think people will try to
implement the old one too given sufficiently capable HW.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-22 Thread Jakub Kicinski
On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > As we are using virtio_net to control and manage the VF data path, it is not
> > clear to me
> > what is the advantage of creating a new device rather than extending
> > virtio_net to manage
> > the VF datapath via transparent bond mechanism.
> 
> So that XDP redirect actions can differentiate between virtio, PT
> device and the bond. Without it there's no way to redirect
> to virtio specifically.

Let's make a list :P

separate netdev:
1. virtio device being a bond master is confusing/unexpected.
2. virtio device being both a master and a slave is confusing.
3. configuration of a master may have different semantics than
   configuration of a slave.
4. two para-virt devices may create a loop (or rather will be bound 
   to each other indeterministically, depending on which spawns first).
5. there is no user configuration AFAIR in existing patch, VM admin
   won't be able to prevent the bond.  Separate netdev we can make 
   removable even if it's spawned automatically.
6. XDP redirect use-case (or any explicit use of the virtio slave)
   (from MST)

independent driver:
7. code reuse.

separate device:
8. bond any netdev with any netdev.
9. reuse well-known device driver model.
a. natural anchor for hypervisor configuration (switchdev etc.)
b. next-gen silicon may be able to disguise as virtio device, and the
   loop check in virtio driver will prevent the legitimate bond it such
   case.  AFAIU that's one of the goals of next-gen virtio spec as well.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device

2018-01-02 Thread Jakub Kicinski
On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
> This patch series enables virtio to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration of a VM
> with a direct attached VF without the need to setup a bond/team between a
> VF and virtio net device in the guest.
> 
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.
> 
> It is based on netvsc implementation and it may be possible to make this code 
> generic and move it to a common location that can be shared by netvsc and 
> virtio.
> 
> This patch series is based on the discussion initiated by Jesse on this 
> thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2

How does the notion of a device which is both a bond and a leg of a
bond fit with Alex's recent discussions about feature propagation?
Which propagation rules will apply to VirtIO master?  Meaning of the
flags on a software upper device may be different.  Why muddy the
architecture like this and not introduce a synthetic bond device?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-20 Thread Jakub Kicinski
On Wed, 20 Dec 2017 18:16:30 -0800, Siwei Liu wrote:
> > The plan is to remove the delay and do the naming in the kernel.
> > This was suggested by Lennart since udev is only doing naming policy
> > because kernel names were not repeatable.
> >
> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >
> > Patch is pending.  
> 
> While it's good to show VF with specific naming to indicate
> enslavement, I wonder wouldn't it be better to hide this netdev at all
> from the user space? IMHO this extra device is useless when being
> enslaved and we may delegate controls (e.g. ethtool) over to the
> para-virtual device instead? That way it's possible to eliminate the
> possibility of additional udev setup or modification?
> 
> I'm not sure if this  is consistent with Windows guest or not, but I
> don't find it _Linux_ user friendly that ethtool doesn't work on the
> composite interface any more, and I have to end up with finding out
> the correct enslaved VF I must operate on.

Hiding "low level" netdevs comes up from time to time, and is more
widely applicable than just to VF bonds.  We should find a generic
solution to that problem.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-20 Thread Jakub Kicinski
On Thu, 21 Dec 2017 02:15:31 +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 20, 2017 at 02:33:34PM -0800, Jakub Kicinski wrote:
> > On Mon, 18 Dec 2017 16:40:36 -0800, Sridhar Samudrala wrote:  
> > > +static int virtio_netdev_event(struct notifier_block *this,
> > > +unsigned long event, void *ptr)
> > > +{
> > > + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> > > +
> > > + /* Skip our own events */
> > > + if (event_dev->netdev_ops == _netdev)
> > > + return NOTIFY_DONE;  
> > 
> > I wonder how does this work WRT loop prevention.  What if I have two
> > virtio devices with the same MAC, what is preventing them from claiming
> > each other?  Is it only the check above (not sure what is "own" in
> > comment referring to)?  
> 
> I expect we will add a feature bit (VIRTIO_NET_F_MASTER) and it will
> be host's responsibility not to add more than 1 such device.

Do you mean a virtio_net feature bit?  That won't stop the loop with
a hyperv device.  Unless we want every paravirt device to have such
bit and enforce there can be only one magic bond enslavement active in
the system.

Making the bonding information separate from the slaves seems so much
cleaner...  No limitation on device types, or how many bonds user
chooses to create.  No MAC matching...

> > I'm worried the check above will not stop virtio from enslaving hyperv
> > interfaces and vice versa, potentially leading to a loop, no?  There is
> > also the fact that it would be preferable to share the code between
> > paravirt drivers, to avoid duplicated bugs.
> > 
> > My suggestion during the previous discussion was to create a paravirt
> > bond device, which will explicitly tell the OS which interfaces to
> > bond, regardless of which driver they're using.  Could be some form of
> > ACPI/FW driver too, I don't know enough about x86 FW to suggest
> > something fully fleshed :(  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-13 Thread Jakub Kicinski
On Tue, 5 Dec 2017 11:59:17 +0200, achiad shochat wrote:
>  I second Jacob - having a netdev of one device driver enslave a netdev
>  of another device driver is an awkward a-symmetric model.
>  Regardless of whether they share the same backend device.
>  Only I am not sure the Linux Bond is the right choice.
>  e.g one may well want to use the virtio device also when the
>  pass-through device is available, e.g for multicasts, east-west
>  traffic, etc.
>  I'm not sure the Linux Bond fits that functionality.
>  And, as I hear in this thread, it is hard to make it work out of the box.
>  So I think the right thing would be to write a new dedicated module
>  for this purpose.  
> >
> > This part I can sort of agree with. What if we were to look at
> > providing a way to somehow advertise that the two devices were meant
> > to be boded for virtualization purposes? For now lets call it a
> > "virt-bond". Basically we could look at providing a means for virtio
> > and VF drivers to advertise that they want this sort of bond. Then it
> > would just be a matter of providing some sort of side channel to
> > indicate where you want things like multicast/broadcast/east-west
> > traffic to go.
> 
> I like this approach.

+1 on a separate driver, just enslaving devices to virtio may break
existing setups.  If people are bonding from user space today, if they
update their kernel it may surprise them how things get auto-mangled.

Is what Alex is suggesting a separate PV device that says "I would
like to be a bond of those two interfaces"?  That would make the HV
intent explicit and kernel decisions more understandable.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Thu, 30 Nov 2017 15:54:40 +0200, Michael S. Tsirkin wrote:
> On Wed, Nov 29, 2017 at 07:51:38PM -0800, Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:  
> > > On 2017年11月29日 03:27, Jesse Brandeburg wrote:  
> > > > Hi, I'd like to get some feedback on a proposal to enhance virtio-net
> > > > to ease configuration of a VM and that would enable live migration of
> > > > passthrough network SR-IOV devices.
> > > >
> > > > Today we have SR-IOV network devices (VFs) that can be passed into a VM
> > > > in order to enable high performance networking direct within the VM.
> > > > The problem I am trying to address is that this configuration is
> > > > generally difficult to live-migrate.  There is documentation [1]
> > > > indicating that some OS/Hypervisor vendors will support live migration
> > > > of a system with a direct assigned networking device.  The problem I
> > > > see with these implementations is that the network configuration
> > > > requirements that are passed on to the owner of the VM are quite
> > > > complicated.  You have to set up bonding, you have to configure it to
> > > > enslave two interfaces, those interfaces (one is virtio-net, the other
> > > > is SR-IOV device/driver like ixgbevf) must support MAC address changes
> > > > requested in the VM, and on and on...
> > > >
> > > > So, on to the proposal:
> > > > Modify virtio-net driver to be a single VM network device that
> > > > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > > > address. This would cause the virtio-net driver to appear and work like
> > > > a simplified bonding/team driver.  The live migration problem would be
> > > > solved just like today's bonding solution, but the VM user's networking
> > > > config would be greatly simplified.
> > > >
> > > > At it's simplest, it would appear something like this in the VM.
> > > >
> > > > ==
> > > > = vnet0  =
> > > >   =
> > > > (virtio- =   |
> > > >   net)=   |
> > > >   =  ==
> > > >   =  = ixgbef =
> > > > ==  ==
> > > >
> > > > (forgive the ASCII art)
> > > >
> > > > The fast path traffic would prefer the ixgbevf or other SR-IOV device
> > > > path, and fall back to virtio's transmit/receive when migrating.
> > > >
> > > > Compared to today's options this proposal would
> > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > > speeds
> > > > 2) simplify end user configuration in the VM (most if not all of the
> > > > set up to enable migration would be done in the hypervisor)
> > > > 3) allow live migration via a simple link down and maybe a PCI
> > > > hot-unplug of the SR-IOV device, with failover to the virtio-net
> > > > driver core
> > > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > > between vendors if the VM os has driver support for all the required
> > > > SR-IOV devices.
> > > >
> > > > Runtime operation proposed:
> > > > -  virtio-net driver loads, SR-IOV driver loads
> > > > - virtio-net finds other NICs that match it's MAC address by
> > > >both examining existing interfaces, and sets up a new device notifier
> > > > - virtio-net enslaves the first NIC with the same MAC address
> > > > - virtio-net brings up the slave, and makes it the "preferred" path
> > > > - virtio-net follows the behavior of an active backup bond/team
> > > > - virtio-net acts as the interface to the VM
> > > > - live migration initiates
> > > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > > - failover to virtio-net as primary path
> > > > - migration continues to new host
> > > > - new host is started with virio-net as primary
> > > > - if no SR-IOV, virtio-net stays primary
> > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > > - virtio-net notices new NIC and starts over at enslave step above
> > > >
> > > > Future ideas (brainstorming):
> > > > - Optimize Fast east-west by having special rules to direct east-west
> > > >traffic thro

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Wed, 29 Nov 2017 20:10:09 -0800, Stephen Hemminger wrote:
> On Wed, 29 Nov 2017 19:51:38 -0800 Jakub Kicinski wrote:
> > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:  
> > > On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > > Author: stephen hemminger <step...@networkplumber.org>
> > > Date:   Tue Aug 1 19:58:53 2017 -0700
> > > 
> > >      netvsc: transparent VF management
> > > 
> > >      This patch implements transparent fail over from synthetic NIC
> > > to SR-IOV virtual function NIC in Hyper-V environment. It is a
> > > better alternative to using bonding as is done now. Instead, the
> > > receive and transmit fail over is done internally inside the driver.
> > > 
> > >      Using bonding driver has lots of issues because it depends on
> > > the script being run early enough in the boot process and with
> > > sufficient information to make the association. This patch moves
> > > all that functionality into the kernel.
> > > 
> > >      Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
> > >      Signed-off-by: David S. Miller <da...@davemloft.net>
> > > 
> > > If my understanding is correct there's no need to for any extension
> > > of virtio spec. If this is true, maybe you can start to prepare the
> > > patch?
> > 
> > IMHO this is as close to policy in the kernel as one can get.  User
> > land has all the information it needs to instantiate that bond/team
> > automatically.  In fact I'm trying to discuss this with NetworkManager
> > folks and Red Hat right now:
> > 
> > https://mail.gnome.org/archives/networkmanager-list/2017-November/msg00038.html
> > 
> > Can we flip the argument and ask why is the kernel supposed to be
> > responsible for this?  It's not like we run DHCP out of the kernel
> > on new interfaces...   
> 
> Although "policy should not be in the kernel" is a a great mantra,
> it is not practical in the real world.
> 
> If you think it can be solved in userspace, then you haven't had to
> deal with four different network initialization
> systems, multiple orchestration systems and customers on ancient
> Enterprise distributions.

I would accept that argument if anyone ever tried to get those
Enterprise distros to handle this use case.  From conversations I 
had it seemed like no one ever did, and SR-IOV+virtio bonding is 
what has been done to solve this since day 1 of SR-IOV networking.

For practical reasons it's easier to push this into the kernel, 
because vendors rarely employ developers of the user space
orchestrations systems.  Is that not the real problem here,
potentially? :)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-01 Thread Jakub Kicinski
On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> On 2017年11月29日 03:27, Jesse Brandeburg wrote:
> > Hi, I'd like to get some feedback on a proposal to enhance virtio-net
> > to ease configuration of a VM and that would enable live migration of
> > passthrough network SR-IOV devices.
> >
> > Today we have SR-IOV network devices (VFs) that can be passed into a VM
> > in order to enable high performance networking direct within the VM.
> > The problem I am trying to address is that this configuration is
> > generally difficult to live-migrate.  There is documentation [1]
> > indicating that some OS/Hypervisor vendors will support live migration
> > of a system with a direct assigned networking device.  The problem I
> > see with these implementations is that the network configuration
> > requirements that are passed on to the owner of the VM are quite
> > complicated.  You have to set up bonding, you have to configure it to
> > enslave two interfaces, those interfaces (one is virtio-net, the other
> > is SR-IOV device/driver like ixgbevf) must support MAC address changes
> > requested in the VM, and on and on...
> >
> > So, on to the proposal:
> > Modify virtio-net driver to be a single VM network device that
> > enslaves an SR-IOV network device (inside the VM) with the same MAC
> > address. This would cause the virtio-net driver to appear and work like
> > a simplified bonding/team driver.  The live migration problem would be
> > solved just like today's bonding solution, but the VM user's networking
> > config would be greatly simplified.
> >
> > At it's simplest, it would appear something like this in the VM.
> >
> > ==
> > = vnet0  =
> >   =
> > (virtio- =   |
> >   net)=   |
> >   =  ==
> >   =  = ixgbef =
> > ==  ==
> >
> > (forgive the ASCII art)
> >
> > The fast path traffic would prefer the ixgbevf or other SR-IOV device
> > path, and fall back to virtio's transmit/receive when migrating.
> >
> > Compared to today's options this proposal would
> > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > speeds
> > 2) simplify end user configuration in the VM (most if not all of the
> > set up to enable migration would be done in the hypervisor)
> > 3) allow live migration via a simple link down and maybe a PCI
> > hot-unplug of the SR-IOV device, with failover to the virtio-net
> > driver core
> > 4) allow vendor agnostic hardware acceleration, and live migration
> > between vendors if the VM os has driver support for all the required
> > SR-IOV devices.
> >
> > Runtime operation proposed:
> > -  virtio-net driver loads, SR-IOV driver loads
> > - virtio-net finds other NICs that match it's MAC address by
> >both examining existing interfaces, and sets up a new device notifier
> > - virtio-net enslaves the first NIC with the same MAC address
> > - virtio-net brings up the slave, and makes it the "preferred" path
> > - virtio-net follows the behavior of an active backup bond/team
> > - virtio-net acts as the interface to the VM
> > - live migration initiates
> > - link goes down on SR-IOV, or SR-IOV device is removed
> > - failover to virtio-net as primary path
> > - migration continues to new host
> > - new host is started with virio-net as primary
> > - if no SR-IOV, virtio-net stays primary
> > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > - virtio-net notices new NIC and starts over at enslave step above
> >
> > Future ideas (brainstorming):
> > - Optimize Fast east-west by having special rules to direct east-west
> >traffic through virtio-net traffic path
> >
> > Thanks for reading!
> > Jesse  
> 
> Cc netdev.
> 
> Interesting, and this method is actually used by netvsc now:
> 
> commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> Author: stephen hemminger 
> Date:   Tue Aug 1 19:58:53 2017 -0700
> 
>      netvsc: transparent VF management
> 
>      This patch implements transparent fail over from synthetic NIC to
>      SR-IOV virtual function NIC in Hyper-V environment. It is a better
>      alternative to using bonding as is done now. Instead, the receive and
>      transmit fail over is done internally inside the driver.
> 
>      Using bonding driver has lots of issues because it depends on the
>      script being run early enough in the boot process and with sufficient
>      information to make the association. This patch moves all that
>      functionality into the kernel.
> 
>      Signed-off-by: Stephen Hemminger 
>      Signed-off-by: David S. Miller 
> 
> If my understanding is correct there's no need to for any extension of 
> virtio spec. If this is true, maybe you can start to prepare the patch?

IMHO this is as close to policy in the kernel as one can get.  User
land has all the information it needs to instantiate that bond/team
automatically.  In fact I'm