from:"Stephen Hemminger"

Re: [RFC 1/2] vhost: IFC VF hardware operation layer

2019-10-15 Thread Stephen Hemminger

On Wed, 16 Oct 2019 09:03:17 +0800
Zhu Lingshan  wrote:

> + IFC_INFO(&dev->dev, "PCI capability mapping:\n"
> + "common cfg: %p\n"
> + "notify base: %p\n"
> + "isr cfg: %p\n"
> + "device cfg: %p\n"
> + "multiplier: %u\n",
> + hw->common_cfg,
> + hw->notify_base,
> + hw->isr,
> + hw->dev_cfg,
> + hw->notify_off_multiplier);

Since kernel messages go to syslog, syslog does not handle multi-line
messages very well. This should be a single line.

Also, this is the kind of message that should be at the debug
level; something that is useful to the driver developers
but not something that needs to be filling up every end users log.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 1/2] vhost: IFC VF hardware operation layer

2019-10-15 Thread Stephen Hemminger

On Wed, 16 Oct 2019 09:03:17 +0800
Zhu Lingshan  wrote:

> +int ifcvf_init_hw(struct ifcvf_hw *hw, struct pci_dev *dev)
> +{
> + int ret;
> + u8 pos;
> + struct virtio_pci_cap cap;
> + u32 i;
> + u16 notify_off;

For network code, the preferred declaration style is
reverse christmas tree.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v6] failover: allow name change on IFF_UP slave interfaces

2019-04-07 Thread Stephen Hemminger

On Fri, 5 Apr 2019 18:01:43 -0400
"Michael S. Tsirkin"  wrote:

> > 
> > This notifier is not really necessary, there already is a CHANGENAME
> > that gets sent.
> > NETDEV_CHANGE is used in other cases to mean that the state (flags)
> > have changed.  
> 
> The point is some existing scripts might not expect name
> change to happen without a status change afterwards (since it was
> impossible for so long). So this reports a change
> to make sure scripts do not miss it.

I don't think it matters because if device named is changed and it
is down (!IFF_UP) then only CHANGENAME is sent. The NETDEV_CHANGE is
just noise to an application.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v6] failover: allow name change on IFF_UP slave interfaces

2019-04-05 Thread Stephen Hemminger

On Fri, 5 Apr 2019 17:28:55 -0400
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 03, 2019 at 12:52:47AM -0400, Si-Wei Liu wrote:
> > When a netdev appears through hot plug then gets enslaved by a failover
> > master that is already up and running, the slave will be opened
> > right away after getting enslaved. Today there's a race that userspace
> > (udev) may fail to rename the slave if the kernel (net_failover)
> > opens the slave earlier than when the userspace rename happens.
> > Unlike bond or team, the primary slave of failover can't be renamed by
> > userspace ahead of time, since the kernel initiated auto-enslavement is
> > unable to, or rather, is never meant to be synchronized with the rename
> > request from userspace.
> > 
> > As the failover slave interfaces are not designed to be operated
> > directly by userspace apps: IP configuration, filter rules with
> > regard to network traffic passing and etc., should all be done on master
> > interface. In general, userspace apps only care about the
> > name of master interface, while slave names are less important as long
> > as admin users can see reliable names that may carry
> > other information describing the netdev. For e.g., they can infer that
> > "ens3nsby" is a standby slave of "ens3", while for a
> > name like "eth0" they can't tell which master it belongs to.
> > 
> > Historically the name of IFF_UP interface can't be changed because
> > there might be admin script or management software that is already
> > relying on such behavior and assumes that the slave name can't be
> > changed once UP. But failover is special: with the in-kernel
> > auto-enslavement mechanism, the userspace expectation for device
> > enumeration and bring-up order is already broken. Previously initramfs
> > and various userspace config tools were modified to bypass failover
> > slaves because of auto-enslavement and duplicate MAC address. Similarly,
> > in case that users care about seeing reliable slave name, the new type
> > of failover slaves needs to be taken care of specifically in userspace
> > anyway.
> > 
> > It's less risky to lift up the rename restriction on failover slave
> > which is already UP. Although it's possible this change may potentially
> > break userspace component (most likely configuration scripts or
> > management software) that assumes slave name can't be changed while
> > UP, it's relatively a limited and controllable set among all userspace
> > components, which can be fixed specifically to listen for the rename
> > and/or link down/up events on failover slaves. Userspace component
> > interacting with slaves is expected to be changed to operate on failover
> > master interface instead, as the failover slave is dynamic in nature
> > which may come and go at any point.  The goal is to make the role of
> > failover slaves less relevant, and userspace components should only
> > deal with failover master in the long run.
> > 
> > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> > Signed-off-by: Si-Wei Liu 
> > Reviewed-by: Liran Alon   
> 
> Acked-by: Michael S. Tsirkin 
> 
> Stephen are you happy with this approach?

I think it is the best solution for what you want to do. 

Did you test with some things like Free Range Routing, VPP or other userspace
control planes that consume netlink?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v6] failover: allow name change on IFF_UP slave interfaces

2019-04-05 Thread Stephen Hemminger

On Wed,  3 Apr 2019 00:52:47 -0400
Si-Wei Liu  wrote:

>  
> + if (unlikely(dev->flags & IFF_UP)) {
> + struct netdev_notifier_change_info change_info = {
> + .info.dev = dev,
> + };
> +
> + call_netdevice_notifiers_info(NETDEV_CHANGE,
> +   &change_info.info);
> + }

This notifier is not really necessary, there already is a CHANGENAME
that gets sent.

NETDEV_CHANGE is used in other cases to mean that the state (flags)
have changed.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v5] failover: allow name change on IFF_UP slave interfaces

2019-04-03 Thread Stephen Hemminger

On Tue, 2 Apr 2019 22:22:18 -0700
"Samudrala, Sridhar"  wrote:

> On 4/2/2019 8:14 PM, Stephen Hemminger wrote:
> > On Tue, 2 Apr 2019 15:23:29 -0700
> > si-wei liu  wrote:
> >   
> >> On 4/2/2019 2:53 PM, Stephen Hemminger wrote:  
> >>> On Mon,  1 Apr 2019 19:04:53 -0400
> >>> Si-Wei Liu  wrote:
> >>> 
> >>>> +if (dev->flags & IFF_UP &&
> >>>> +likely(!(dev->priv_flags & IFF_FAILOVER_SLAVE)))  
> >>> Why is property limited to failover slave, it would make sense for netvsc
> >>> as well. Why not make it a flag like live address change?  
> >> Well, netvsc today is still taking the delayed approach meaning that it
> >> is incompatible yet with this live name change flag if need be. ;-)
> >>
> >> I thought Sridhar did not like to introduce an additional
> >> IFF_SLAVE_RENAME_OK flag given that failover slave is the only consumer
> >> for the time being. Even though I can get it back, patch is needed for
> >> netvsc to remove the VF takeover delay IMHO.
> >>
> >> Sridhar, what do you think we revive the IFF_SLAVE_RENAME_OK flag which
> >> allows netvsc to be used later on? Or maybe, IFF_LIVE_RENAME_OK for a
> >> better name?
> >>
> >> -Siwei  
> > 
> > I would name it IFF_LIVE_NAME_CHANGE to match IFF_LIVE_ADDR_CHANGE
> > there is no reason its use should be restricted to SLAVE devices.
> >  
> Stephen,
> May be you should consider moving netvsc to use the net_failover driver now?
> 

NO

Why would I waste time doing that when there is a working and cleaner solution
that is working across 4 OS's and three versions of five major distributions?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v5] failover: allow name change on IFF_UP slave interfaces

2019-04-02 Thread Stephen Hemminger

On Tue, 2 Apr 2019 15:23:29 -0700
si-wei liu  wrote:

> On 4/2/2019 2:53 PM, Stephen Hemminger wrote:
> > On Mon,  1 Apr 2019 19:04:53 -0400
> > Si-Wei Liu  wrote:
> >  
> >> +  if (dev->flags & IFF_UP &&
> >> +  likely(!(dev->priv_flags & IFF_FAILOVER_SLAVE)))  
> > Why is property limited to failover slave, it would make sense for netvsc
> > as well. Why not make it a flag like live address change?  
> Well, netvsc today is still taking the delayed approach meaning that it 
> is incompatible yet with this live name change flag if need be. ;-)
> 
> I thought Sridhar did not like to introduce an additional 
> IFF_SLAVE_RENAME_OK flag given that failover slave is the only consumer 
> for the time being. Even though I can get it back, patch is needed for 
> netvsc to remove the VF takeover delay IMHO.
> 
> Sridhar, what do you think we revive the IFF_SLAVE_RENAME_OK flag which 
> allows netvsc to be used later on? Or maybe, IFF_LIVE_RENAME_OK for a 
> better name?
> 
> -Siwei

I would name it IFF_LIVE_NAME_CHANGE to match IFF_LIVE_ADDR_CHANGE
there is no reason its use should be restricted to SLAVE devices.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v5] failover: allow name change on IFF_UP slave interfaces

2019-04-02 Thread Stephen Hemminger

On Mon,  1 Apr 2019 19:04:53 -0400
Si-Wei Liu  wrote:

> + if (dev->flags & IFF_UP &&
> + likely(!(dev->priv_flags & IFF_FAILOVER_SLAVE)))

Why is property limited to failover slave, it would make sense for netvsc
as well. Why not make it a flag like live address change?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v4] failover: allow name change on IFF_UP slave interfaces

2019-03-29 Thread Stephen Hemminger

On Thu, 28 Mar 2019 19:47:27 -0400
Si-Wei Liu  wrote:

> + if (unlikely(dev->flags & IFF_UP)) {
> + struct netdev_notifier_change_info change_info;
> +
> + change_info.flags_changed = 0;

Simpler to use structure initialization, which also avoid any chance
of unititialized fields.

struct netdev_notifier_change_info change_info
= { .flags_changed =  0 };
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net v3] failover: allow name change on IFF_UP slave interfaces

2019-03-28 Thread Stephen Hemminger

On Wed, 27 Mar 2019 16:44:19 -0700
si-wei liu  wrote:

> On 3/27/2019 4:11 AM, Jiri Pirko wrote:
> > Wed, Mar 27, 2019 at 12:48:13AM CET, si-wei@oracle.com wrote:  
> >> When a netdev appears through hot plug then gets enslaved by a failover
> >> master that is already up and running, the slave will be opened
> >> right away after getting enslaved. Today there's a race that userspace
> >> (udev) may fail to rename the slave if the kernel (net_failover)
> >> opens the slave earlier than when the userspace rename happens.
> >> Unlike bond or team, the primary slave of failover can't be renamed by
> >> userspace ahead of time, since the kernel initiated auto-enslavement is
> >> unable to, or rather, is never meant to be synchronized with the rename
> >> request from userspace.
> >>
> >> As the failover slave interfaces are not designed to be operated
> >> directly by userspace apps: IP configuration, filter rules with
> >> regard to network traffic passing and etc., should all be done on master
> >> interface. In general, userspace apps only care about the
> >> name of master interface, while slave names are less important as long
> >> as admin users can see reliable names that may carry
> >> other information describing the netdev. For e.g., they can infer that
> >> "ens3nsby" is a standby slave of "ens3", while for a
> >> name like "eth0" they can't tell which master it belongs to.
> >>
> >> Historically the name of IFF_UP interface can't be changed because
> >> there might be admin script or management software that is already
> >> relying on such behavior and assumes that the slave name can't be
> >> changed once UP. But failover is special: with the in-kernel
> >> auto-enslavement mechanism, the userspace expectation for device
> >> enumeration and bring-up order is already broken. Previously initramfs
> >> and various userspace config tools were modified to bypass failover
> >> slaves because of auto-enslavement and duplicate MAC address. Similarly,
> >> in case that users care about seeing reliable slave name, the new type
> >> of failover slaves needs to be taken care of specifically in userspace
> >> anyway.
> >>
> >> It's less risky to lift up the rename restriction on failover slave
> >> which is already UP. Although it's possible this change may potentially
> >> break userspace component (most likely configuration scripts or
> >> management software) that assumes slave name can't be changed while
> >> UP, it's relatively a limited and controllable set among all userspace
> >> components, which can be fixed specifically to listen for the rename
> >> and/or link down/up events on failover slaves. Userspace component
> >> interacting with slaves is expected to be changed to operate on failover
> >> master interface instead, as the failover slave is dynamic in nature
> >> which may come and go at any point.  The goal is to make the role of
> >> failover slaves less relevant, and userspace components should only
> >> deal with failover master in the long run.
> >>
> >> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> >> Signed-off-by: Si-Wei Liu 
> >> Reviewed-by: Liran Alon 
> >>
> >> --
> >> v1 -> v2:
> >> - Drop configurable module parameter (Sridhar)
> >>
> >> v2 -> v3:
> >> - Drop additional IFF_SLAVE_RENAME_OK flag (Sridhar)
> >> - Send down and up events around rename (Michael S. Tsirkin)
> >> ---
> >> net/core/dev.c | 37 ++---
> >> 1 file changed, 34 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 722d50d..3e0cd80 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -1171,6 +1171,7 @@ int dev_get_valid_name(struct net *net, struct 
> >> net_device *dev,
> >> int dev_change_name(struct net_device *dev, const char *newname)
> >> {
> >>unsigned char old_assign_type;
> >> +  bool reopen_needed = false;
> >>char oldname[IFNAMSIZ];
> >>int err = 0;
> >>int ret;
> >> @@ -1180,8 +1181,24 @@ int dev_change_name(struct net_device *dev, const 
> >> char *newname)
> >>BUG_ON(!dev_net(dev));
> >>
> >>net = dev_net(dev);
> >> -  if (dev->flags & IFF_UP)
> >> -  return -EBUSY;
> >> +
> >> +  /* Allow failover slave to rename even when
> >> +   * it is up and running.
> >> +   *
> >> +   * Failover slaves are special, since userspace
> >> +   * might rename the slave after the interface
> >> +   * has been brought up and running due to
> >> +   * auto-enslavement.
> >> +   *
> >> +   * Failover users don't actually care about slave
> >> +   * name change, as they are only expected to operate
> >> +   * on master interface directly.
> >> +   */
> >> +  if (dev->flags & IFF_UP) {
> >> +  if (likely(!(dev->priv_flags & IFF_FAILOVER_SLAVE)))
> >> +  return -EBUSY;
> >> +  reopen_needed = true;
> >> +  }
> >>
> >>write_seqcount_begin(&devnet_rename_seq);
> >>
> >> @@ -1198,6 +1215,9 @@ int dev_change_name(struct net_device *dev, const

Re: [PATCH net v3] failover: allow name change on IFF_UP slave interfaces

2019-03-26 Thread Stephen Hemminger

On Tue, 26 Mar 2019 19:48:13 -0400
Si-Wei Liu  wrote:

> When a netdev appears through hot plug then gets enslaved by a failover
> master that is already up and running, the slave will be opened
> right away after getting enslaved. Today there's a race that userspace
> (udev) may fail to rename the slave if the kernel (net_failover)
> opens the slave earlier than when the userspace rename happens.
> Unlike bond or team, the primary slave of failover can't be renamed by
> userspace ahead of time, since the kernel initiated auto-enslavement is
> unable to, or rather, is never meant to be synchronized with the rename
> request from userspace.
> 
> As the failover slave interfaces are not designed to be operated
> directly by userspace apps: IP configuration, filter rules with
> regard to network traffic passing and etc., should all be done on master
> interface. In general, userspace apps only care about the
> name of master interface, while slave names are less important as long
> as admin users can see reliable names that may carry
> other information describing the netdev. For e.g., they can infer that
> "ens3nsby" is a standby slave of "ens3", while for a
> name like "eth0" they can't tell which master it belongs to.
> 
> Historically the name of IFF_UP interface can't be changed because
> there might be admin script or management software that is already
> relying on such behavior and assumes that the slave name can't be
> changed once UP. But failover is special: with the in-kernel
> auto-enslavement mechanism, the userspace expectation for device
> enumeration and bring-up order is already broken. Previously initramfs
> and various userspace config tools were modified to bypass failover
> slaves because of auto-enslavement and duplicate MAC address. Similarly,
> in case that users care about seeing reliable slave name, the new type
> of failover slaves needs to be taken care of specifically in userspace
> anyway.
> 
> It's less risky to lift up the rename restriction on failover slave
> which is already UP. Although it's possible this change may potentially
> break userspace component (most likely configuration scripts or
> management software) that assumes slave name can't be changed while
> UP, it's relatively a limited and controllable set among all userspace
> components, which can be fixed specifically to listen for the rename
> and/or link down/up events on failover slaves. Userspace component
> interacting with slaves is expected to be changed to operate on failover
> master interface instead, as the failover slave is dynamic in nature
> which may come and go at any point.  The goal is to make the role of
> failover slaves less relevant, and userspace components should only
> deal with failover master in the long run.
> 
> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> Signed-off-by: Si-Wei Liu 
> Reviewed-by: Liran Alon 


Why do you need to do dev_close/dev_open which will bounce
the link?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Stephen Hemminger

On Thu, 21 Mar 2019 15:04:37 +0200
Liran Alon  wrote:

> > 
> > OK. Now what happens if master is moved to another namespace? Do we need
> > to move the slaves too?  
> 
> No. Why would we move the slaves? The whole point is to make most customer 
> ignore the net-failover slaves and remain them “hidden” in their dedicated 
> netns.
> We won’t prevent customer from explicitly moving the net-failover slaves out 
> of this netns, but we will not move them out of there automatically.

The 2-device netvsc already handles case where master changes namespace.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-21 Thread Stephen Hemminger

On Thu, 21 Mar 2019 08:57:03 -0400
"Michael S. Tsirkin"  wrote:

> On Thu, Mar 21, 2019 at 02:47:50PM +0200, Liran Alon wrote:
> > 
> >   
> > > On 21 Mar 2019, at 14:37, Michael S. Tsirkin  wrote:
> > > 
> > > On Thu, Mar 21, 2019 at 12:07:57PM +0200, Liran Alon wrote:  
> > >> 2) It brings non-intuitive customer experience. For example, a 
> > >> customer may attempt to analyse connectivity issue by checking the 
> > >> connectivity
> > >> on a net-failover slave (e.g. the VF) but will see no connectivity 
> > >> when in-fact checking the connectivity on the net-failover master 
> > >> netdev shows correct connectivity.
> > >> 
> > >> The set of changes I vision to fix our issues are:
> > >> 1) Hide net-failover slaves in a different netns created and managed 
> > >> by the kernel. But that user can enter to it and manage the netdevs 
> > >> there if wishes to do so explicitly.
> > >> (E.g. Configure the net-failover VF slave in some special way).
> > >> 2) Match the virtio-net and the VF based on a PV attribute instead 
> > >> of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net 
> > >> interface to get PCI slot where the matching VF will be hot-plugged 
> > >> by hypervisor.
> > >> 3) Have an explicit virtio-net control message to command hypervisor 
> > >> to switch data-path from virtio-net to VF and vice-versa. Instead of 
> > >> relying on intercepting the PCI master enable-bit
> > >> as an indicator on when VF is about to be set up. (Similar to as 
> > >> done in NetVSC).
> > >> 
> > >> Is there any clear issue we see regarding the above suggestion?
> > >> 
> > >> -Liran  
> > > 
> > > The issue would be this: how do we avoid conflicting with namespaces
> > > created by users?  
> >  
> >  This is kinda controversial, but maybe separate netns names into 2 
> >  groups: hidden and normal.
> >  To reference a hidden netns, you need to do it explicitly. 
> >  Hidden and normal netns names can collide as they will be maintained 
> >  in different namespaces (Yes I’m overloading the term namespace 
> >  here…).  
> > >>> 
> > >>> Maybe it's an unnamed namespace. Hidden until userspace gives it a 
> > >>> name?  
> > >> 
> > >> This is also a good idea that will solve the issue. Yes.
> > >>   
> > >>>   
> >  Does this seems reasonable?
> >  
> >  -Liran  
> > >>> 
> > >>> Reasonable I'd say yes, easy to implement probably no. But maybe I
> > >>> missed a trick or two.  
> > >> 
> > >> BTW, from a practical point of view, I think that even until we figure 
> > >> out a solution on how to implement this,
> > >> it was better to create an kernel auto-generated name (e.g. 
> > >> “kernel_net_failover_slaves")
> > >> that will break only userspace workloads that by a very rare-chance have 
> > >> a netns that collides with this then
> > >> the breakage we have today for the various userspace components.
> > >> 
> > >> -Liran  
> > > 
> > > It seems quite easy to supply that as a module parameter. Do we need two
> > > namespaces though? Won't some userspace still be confused by the two
> > > slaves sharing the MAC address?  
> > 
> > That’s one reasonable option.
> > Another one is that we will indeed change the mechanism by which we 
> > determine a VF should be bonded with a virtio-net device.
> > i.e. Expose a new virtio-net property that specify the PCI slot of the VF 
> > to be bonded with.
> > 
> > The second seems cleaner but I don’t have a strong opinion on this. Both 
> > seem reasonable to me and your suggestion is faster to implement from 
> > current state of things.
> > 
> > -Liran  
> 
> OK. Now what happens if master is moved to another namespace? Do we need
> to move the slaves too?
> 
> Also siwei's patch is then kind of extraneous right?
> Attempts to rename a slave will now fail as it's in a namespace...

I did try moving slave device into a namespace at one point.
The problem is that introduces all sorts of locking problems in the code
because you can't do it directly in the context of when the callback
happens that a new slave device is discovered.

Since you can't safely change device namespace in the notifier,
it requires a work queue. Then you add more complexity and error cases
because the slave is exposed for a short period, and handling all the
state race unwinds...

Good idea but hard to implement
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [summary] virtio network device failover writeup

2019-03-19 Thread Stephen Hemminger

On Tue, 19 Mar 2019 14:38:06 +0200
Liran Alon  wrote:

> b.3) cloud-init: If configured to perform network-configuration, it attempts 
> to configure all available netdevs. It should avoid however doing so on 
> net-failover slaves.
> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist 
> a netdev from being configured in case it is owned by a specific PCI driver. 
> Specifically, they blacklist Mellanox VF driver. However, this technique 
> doesn’t work for the net-failover mechanism because both the net-failover 
> netdev and the virtio-net netdev are owned by the virtio-net PCI driver).

Cloud-init should really just ignore all devices that have a master device.
That would have been more general, and safer for other use cases.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC] vhost: select TAP if VHOST is configured

2019-03-13 Thread Stephen Hemminger

If VHOST_NET is configured but TUN and TAP are not, then the
kernel will build but vhost will not work correctly since it can't
setup the necessary tap device.

A solution is to select it.

Fixes: 9a393b5d5988 ("tap: tap as an independent module")
Signed-off-by: Stephen Hemminger 
---
 drivers/vhost/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index b580885243f7..a24c69598241 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -1,7 +1,8 @@
 config VHOST_NET
tristate "Host kernel accelerator for virtio net"
-   depends on NET && EVENTFD && (TUN || !TUN) && (TAP || !TAP)
+   depends on NET && EVENTFD
select VHOST
+   select TAP
---help---
  This kernel module can be loaded in host kernel to accelerate
  guest networking with virtio_net. Not to be confused with virtio_net
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread Stephen Hemminger

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:

> > I have a vague idea: would it work to *not* set
> > IFF_UP on slave devices at all?  
> Hmm, I ever thought about this option, and it appears this solution is 
> more invasive than required to convert existing scripts, despite the 
> controversy of introducing internal netdev state to differentiate user 
> visible state. Either we disallow slave to be brought up by user, or to 
> not set IFF_UP flag but instead use the internal one, could end up with 
> substantial behavioral change that breaks scripts. Consider any admin 
> script that does `ip link set dev ... up' successfully just assumes the 
> link is up and subsequent operation can be done as usual. While it *may* 
> work for dracut (yet to be verified), I'm a bit concerned that there are 
> more scripts to be converted than those that don't follow volatile 
> failover slave names. It's technically doable, but may not worth the 
> effort (in terms of porting existing scripts/apps).
> 
> Thanks
> -Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread Stephen Hemminger

On Wed, 27 Feb 2019 18:50:44 -0500
"Michael S. Tsirkin"  wrote:

> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > 
> > 
> > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:  
> > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:  
> > > > 
> > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:  
> > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:  
> > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu 
> > > > > > > > > > > wrote:  
> > > > > > > > > > > > Sorry for replying to this ancient thread. There was 
> > > > > > > > > > > > some remaining
> > > > > > > > > > > > issue that I don't think the initial net_failover patch 
> > > > > > > > > > > > got addressed
> > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > 
> > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev 
> > > > > > > > > > > > userspace was
> > > > > > > > > > > > not specifically writtten for such kernel automatic 
> > > > > > > > > > > > enslavement.
> > > > > > > > > > > > Specifically, if it is a bond or team, the slave would 
> > > > > > > > > > > > typically get
> > > > > > > > > > > > renamed *before* virtual device gets created, that's 
> > > > > > > > > > > > what udev can
> > > > > > > > > > > > control (without getting netdev opened early by the 
> > > > > > > > > > > > other part of
> > > > > > > > > > > > kernel) and other userspace components for e.g. 
> > > > > > > > > > > > initramfs,
> > > > > > > > > > > > init-scripts can coordinate well in between. The 
> > > > > > > > > > > > in-kernel
> > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace 
> > > > > > > > > > > > convention,
> > > > > > > > > > > > which don't provides a solution if user care about 
> > > > > > > > > > > > consistent naming
> > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > 
> > > > > > > > > > > > Previously this issue had been specifically called out 
> > > > > > > > > > > > when IFF_HIDDEN
> > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a 
> > > > > > > > > > > > solution to this
> > > > > > > > > > > > problem ever since. Please share your mind how to 
> > > > > > > > > > > > proceed and solve
> > > > > > > > > > > > this userspace issue if netdev does not welcome a 
> > > > > > > > > > > > 1-netdev model.  
> > > > > > > > > > > Above says:
> > > > > > > > > > > 
> > > > > > > > > > >there's no motivation in the systemd/udevd 
> > > > > > > > > > > community at
> > > > > > > > > > >this point to refactor the rename logic and make 
> > > > > > > > > > > it work well with
> > > > > > > > > > >3-netdev.
> > > > > > > > > > > 
> > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > >   
> > > > > > > > > > There's nothing user can get if just skipping slave devices 
> > > > > > > > > > - the
> > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or 
> > > > > > > > > > eth1 the
> > > > > > > > > > next reboot, while the rest may conform to the naming 
> > > > > > > > > > scheme (ens3
> > > > > > > > > > and such). There's no way one can fix this in userspace 
> > > > > > > > > > alone - when
> > > > > > > > > > the failover is created the enslaved netdev was opened by 
> > > > > > > > > > the kernel
> > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > negotiation protocol for kernel to know when userspace has 
> > > > > > > > > > done
> > > > > > > > > > initial renaming of the interface. I would expect netdev 
> > > > > > > > > > list should
> > > > > > > > > > at least provide the direction in general for how this can 
> > > > > > > > > > be
> > > > > > > > > > solved...  
> > > > > > > I was just wondering what did you mean when you said
> > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > was there a proposal udev rejected?  
> > > > > > No. I never believed this particular issue can be fixed in 
> > > > > > userspace alone.
> > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > relevant discussion ever happened in various userspace communities 
> > > > > > (for e.g.
> > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO 
> > > > > > the root
> > > > > > of the issue derives from the kernel, it makes more sense to start 
> > > > > > from
> > > > > > netdev, work out and decide on a solution: see what can be don

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread Stephen Hemminger

On Tue, 26 Feb 2019 16:17:21 -0800
si-wei liu  wrote:

> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> >>
> >> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
>  On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> > On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
>  Sorry for replying to this ancient thread. There was some remaining
>  issue that I don't think the initial net_failover patch got addressed
>  cleanly, see:
> 
>  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> 
>  The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>  not specifically writtten for such kernel automatic enslavement.
>  Specifically, if it is a bond or team, the slave would typically get
>  renamed *before* virtual device gets created, that's what udev can
>  control (without getting netdev opened early by the other part of
>  kernel) and other userspace components for e.g. initramfs,
>  init-scripts can coordinate well in between. The in-kernel
>  auto-enslavement of net_failover breaks this userspace convention,
>  which don't provides a solution if user care about consistent naming
>  on the slave netdevs specifically.
> 
>  Previously this issue had been specifically called out when 
>  IFF_HIDDEN
>  and the 1-netdev was proposed, but no one gives out a solution to 
>  this
>  problem ever since. Please share your mind how to proceed and solve
>  this userspace issue if netdev does not welcome a 1-netdev model.  
> >>> Above says:
> >>>
> >>>   there's no motivation in the systemd/udevd community at
> >>>   this point to refactor the rename logic and make it work well 
> >>> with
> >>>   3-netdev.
> >>>
> >>> What would the fix be? Skip slave devices?
> >>>  
> >> There's nothing user can get if just skipping slave devices - the
> >> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >> next reboot, while the rest may conform to the naming scheme (ens3
> >> and such). There's no way one can fix this in userspace alone - when
> >> the failover is created the enslaved netdev was opened by the kernel
> >> earlier than the userspace is made aware of, and there's no
> >> negotiation protocol for kernel to know when userspace has done
> >> initial renaming of the interface. I would expect netdev list should
> >> at least provide the direction in general for how this can be
> >> solved...  
> >>> I was just wondering what did you mean when you said
> >>> "refactor the rename logic and make it work well with 3-netdev" -
> >>> was there a proposal udev rejected?  
> >> No. I never believed this particular issue can be fixed in userspace alone.
> >> Previously someone had said it could be, but I never see any work or
> >> relevant discussion ever happened in various userspace communities (for 
> >> e.g.
> >> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> >> of the issue derives from the kernel, it makes more sense to start from
> >> netdev, work out and decide on a solution: see what can be done in the
> >> kernel in order to fix it, then after that engage userspace community for
> >> the feasibility...
> >>  
> >>> Anyway, can we write a time diagram for what happens in which order that
> >>> leads to failure?  That would help look for triggers that we can tie
> >>> into, or add new ones.
> >>>  
> >> See attached diagram.
> >>  
> >>>
> >>>
> >>>  
> > Is there an issue if slave device names are not predictable? The 
> > user/admin scripts are expected
> > to only work with the master failover device.  
>  Where does this expectation come from?
> 
>  Admin users may have ethtool or tc configurations that need to deal with
>  predictable interface name. Third-party app which was built upon 
>  specifying
>  certain interface name can't be modified to chase dynamic names.
> 
>  Specifically, we have pre-canned image that uses ethtool to fine tune VF
>  offload settings post boot for specific workload. Those images won't work
>  well if the name is constantly changing just after couple rounds of live
>  migration.  
> >>> It should be possible to specify the ethtool configuration on the
> >>> master and have it automatically propagated to the slave.
> >>>
> >>> BTW this is something we should look at IMHO.  
> >> I was elaborating a few examples that the expectation and assumption that
> >> user/admin scripts only deal with master failover device is incorrect. It
> >>

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-25 Thread Stephen Hemminger

On Mon, 25 Feb 2019 16:58:07 -0800
si-wei liu  wrote:

> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>
> >> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>
> >>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> 
>  On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >> Sorry for replying to this ancient thread. There was some remaining
> >> issue that I don't think the initial net_failover patch got addressed
> >> cleanly, see:
> >>
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>
> >> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >> not specifically writtten for such kernel automatic enslavement.
> >> Specifically, if it is a bond or team, the slave would typically get
> >> renamed *before* virtual device gets created, that's what udev can
> >> control (without getting netdev opened early by the other part of
> >> kernel) and other userspace components for e.g. initramfs,
> >> init-scripts can coordinate well in between. The in-kernel
> >> auto-enslavement of net_failover breaks this userspace convention,
> >> which don't provides a solution if user care about consistent naming
> >> on the slave netdevs specifically.
> >>
> >> Previously this issue had been specifically called out when IFF_HIDDEN
> >> and the 1-netdev was proposed, but no one gives out a solution to this
> >> problem ever since. Please share your mind how to proceed and solve
> >> this userspace issue if netdev does not welcome a 1-netdev model.  
> > Above says:
> >
> >  there's no motivation in the systemd/udevd community at
> >  this point to refactor the rename logic and make it work well with
> >  3-netdev.
> >
> > What would the fix be? Skip slave devices?
> >  
>  There's nothing user can get if just skipping slave devices - the
>  name is still unchanged and unpredictable e.g. eth0, or eth1 the
>  next reboot, while the rest may conform to the naming scheme (ens3
>  and such). There's no way one can fix this in userspace alone - when
>  the failover is created the enslaved netdev was opened by the kernel
>  earlier than the userspace is made aware of, and there's no
>  negotiation protocol for kernel to know when userspace has done
>  initial renaming of the interface. I would expect netdev list should
>  at least provide the direction in general for how this can be
>  solved...  
> >
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?  
> No. I never believed this particular issue can be fixed in userspace 
> alone. Previously someone had said it could be, but I never see any work 
> or relevant discussion ever happened in various userspace communities 
> (for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
> IMHO the root of the issue derives from the kernel, it makes more sense 
> to start from netdev, work out and decide on a solution: see what can be 
> done in the kernel in order to fix it, then after that engage userspace 
> community for the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> >  
> 
> See attached diagram.
> 
> >
> >
> >
> >  
> >>> Is there an issue if slave device names are not predictable? The 
> >>> user/admin scripts are expected
> >>> to only work with the master failover device.  
> >> Where does this expectation come from?
> >>
> >> Admin users may have ethtool or tc configurations that need to deal with
> >> predictable interface name. Third-party app which was built upon specifying
> >> certain interface name can't be modified to chase dynamic names.
> >>
> >> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >> offload settings post boot for specific workload. Those images won't work
> >> well if the name is constantly changing just after couple rounds of live
> >> migration.  
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> >
> > BTW this is something we should look at IMHO.  
> I was elaborating a few examples that the expectation and assumption 
> that user/admin scripts only deal with master failover device is 
> incorrect. It had never been taken good care of, although I did try to 
> emphasize it from the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down 
> to the slave is the key pursuance of 1-netdev model. However, what I am 
> seeking now is any alternative that can also fix the specific udev 
> r

Re: [PATCH net] vhost: correctly check the return value of translate_desc() in log_used()

2019-02-15 Thread Stephen Hemminger

On Fri, 15 Feb 2019 15:53:24 +0800
Jason Wang  wrote:

> When fail, translate_desc() returns negative value, otherwise the
> number of iovs. So we should fail when the return value is negative
> instead of a blindly check against zero.
> 
> Reported-by: Stephen Hemminger 
> Fixes: cc5e71075947 ("vhost: log dirty page correctly")
> Signed-off-by: Jason Wang 

Looks good. It is best to put the Addresses-Coverity-Id tag on these kind
of bug fixes so that the automated tools see it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin

2018-10-16 Thread Stephen Hemminger

On Tue, 16 Oct 2018 20:42:07 +0200
Sebastian Andrzej Siewior  wrote:

> On 2018-10-16 11:01:14 [-0700], Stephen Hemminger wrote:
> > On Tue, 16 Oct 2018 18:55:45 +0200
> > Sebastian Andrzej Siewior  wrote:
> >   
> > > Also, ptr->var++ is not an atomic operation even on 64bit CPUs. Which
> > > means if try_fill_recv() runs on CPU0 (via virtnet_receive()) then the
> > > worker might run on CPU1.  
> > 
> > On modern CPU's increment of native types is atomic but not locked.
> > u64_stats_update_begin is a no-op on UP and also if BIT_PER_LONG != 32  
> 
> On ARM64 you have load, inc, store. So if two CPUs increment the counter
> simultaneously we might lose one increment. That is why I asked if we
> care or not.
> 
> Sebastian

The point is that kicks is just a counter, not important as part of the
device operation. The point of the u64_stats is to avoid problems with
high/low 32 bit wrap on increment. So this is ok on ARM64.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin

2018-10-16 Thread Stephen Hemminger

On Tue, 16 Oct 2018 18:55:45 +0200
Sebastian Andrzej Siewior  wrote:

> Also, ptr->var++ is not an atomic operation even on 64bit CPUs. Which
> means if try_fill_recv() runs on CPU0 (via virtnet_receive()) then the
> worker might run on CPU1.

On modern CPU's increment of native types is atomic but not locked.
u64_stats_update_begin is a no-op on UP and also if BIT_PER_LONG != 32
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio_net: add local_bh_disable() around u64_stats_update_begin

2018-10-16 Thread Stephen Hemminger

On Tue, 16 Oct 2018 18:55:45 +0200
Sebastian Andrzej Siewior  wrote:

> on 32bit, lockdep notices:
> | 
> | WARNING: inconsistent lock state
> | 4.19.0-rc8+ #9 Tainted: GW
> | 
> | inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> | ip/1106 [HC0[0]:SC1[1]:HE1:SE0] takes:
> | (ptrval) (&syncp->seq#2){+.?.}, at: net_rx_action+0xc8/0x380
> | {SOFTIRQ-ON-W} state was registered at:
> |   lock_acquire+0x7e/0x170
> |   try_fill_recv+0x5fa/0x700
> |   virtnet_open+0xe0/0x180
> |   __dev_open+0xae/0x130
> |   __dev_change_flags+0x17f/0x200
> |   dev_change_flags+0x23/0x60
> |   do_setlink+0x2bb/0xa20
> |   rtnl_newlink+0x523/0x830
> |   rtnetlink_rcv_msg+0x14b/0x470
> |   netlink_rcv_skb+0x6e/0xf0
> |   rtnetlink_rcv+0xd/0x10
> |   netlink_unicast+0x16e/0x1f0
> |   netlink_sendmsg+0x1af/0x3a0
> |   ___sys_sendmsg+0x20f/0x240
> |   __sys_sendmsg+0x39/0x80
> |   sys_socketcall+0x13a/0x2a0
> |   do_int80_syscall_32+0x50/0x180
> |   restore_all+0x0/0xb2
> | irq event stamp: 3326
> | hardirqs last  enabled at (3326): [] net_rx_action+0x80/0x380
> | hardirqs last disabled at (3325): [] net_rx_action+0x5a/0x380
> | softirqs last  enabled at (3322): [] virtnet_napi_enable+0xd/0x60
> | softirqs last disabled at (3323): [] call_on_stack+0xd/0x50
> |
> | other info that might help us debug this:
> |  Possible unsafe locking scenario:
> |
> |CPU0
> |
> |   lock(&syncp->seq#2);
> |   
> | lock(&syncp->seq#2);
> |
> |  *** DEADLOCK ***
> 
> This is the "up" path which is not a hotpath. There is also
> refill_work().
> It might be unwise to add the local_bh_disable() to try_fill_recv()
> because if it is used mostly in BH so that local_bh_en+dis might be a
> waste of cycles.
> 
> Adding local_bh_disable() around try_fill_recv() for the non-BH call
> sites would render GFP_KERNEL pointless.
> 
> Also, ptr->var++ is not an atomic operation even on 64bit CPUs. Which
> means if try_fill_recv() runs on CPU0 (via virtnet_receive()) then the
> worker might run on CPU1.
> 
> Do we care or is this just stupid stats?  Any suggestions?
> 
> This warning appears since commit 461f03dc99cf6 ("virtio_net: Add kick 
> stats").
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
>  drivers/net/virtio_net.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index dab504ec5e502..d782160cfa882 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1206,9 +1206,11 @@ static bool try_fill_recv(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   break;
>   } while (rq->vq->num_free);
>   if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
> + local_bh_disable();
>   u64_stats_update_begin(&rq->stats.syncp);
>   rq->stats.kicks++;
>   u64_stats_update_end(&rq->stats.syncp);
> + local_bh_enable();
>   }
>  
>   return !oom;

Since there already is u64_stats_update_begin_irqsave inline, why not introduce
u64_stats_update_begin_bh which encapsulates the local_bh_disable
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-31 Thread Stephen Hemminger

On Wed, 30 May 2018 20:03:11 -0700
"Samudrala, Sridhar"  wrote:

> On 5/30/2018 7:06 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:14 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> Use the registration/notification framework supported by the generic
> >> failover infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala   
> > Why was this merged? It was never signed off by any of the netvsc 
> > maintainers,
> > and there were still issues unresolved.
> >
> > There are also namespaces issues I am fixing and this breaks them.
> > Will start my patch set with a revert for this. Sorry  
> 
> I would appreciate if you can make the fixes on top of this patch series. I 
> tried hard
> to make sure that netvsc functionality and behavior doesn't change.
> 
> It is possible that there could be some bugs introduced, but they can be 
> fixed.
> Looks like Wei already found a bug and submitted a fix for that.
> 

Ok, but several of these may clash with what you want for virtio.
Like:
- VF should be moved to namespace of virt device
- VF should be associated based on message from host with serial # not
  registration notifier and MAC address.
- control operations should use master device reference rather than
  searching based on MAC.

As you can see these are structural changes.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-30 Thread Stephen Hemminger

On Fri, 25 May 2018 16:06:58 -0700
"Samudrala, Sridhar"  wrote:

> On 5/25/2018 3:38 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:13 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 03ed492c4e14..0f4ba52b641d 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -1421,6 +1421,8 @@ struct net_device_ops {
> >>*   entity (i.e. the master device for bridged veth)
> >>* @IFF_MACSEC: device is a MACsec device
> >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> >> + * @IFF_FAILOVER: device is a failover master device
> >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
> >>*/
> >>   enum netdev_priv_flags {
> >>IFF_802_1Q_VLAN = 1<<0,
> >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
> >>IFF_PHONY_HEADROOM  = 1<<24,
> >>IFF_MACSEC  = 1<<25,
> >>IFF_NO_RX_HANDLER   = 1<<26,
> >> +  IFF_FAILOVER= 1<<27,
> >> +  IFF_FAILOVER_SLAVE  = 1<<28,
> >>   };  
> > Why is FAILOVER any different than other master/slave relationships.
> > I don't think you need to take up precious netdev flag bits for this.  
> 
> These are netdev priv flags.
> Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be 
> used
> with other failover mechanisms. Team also doesn't use this flags and it has 
> its own
> priv_flags.
> 

This change breaks userspace.
We already have worked with partners to ignore devices marked as IFF_SLAVE,
and IFF_SLAVE is visible to user space API's.

NAK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-30 Thread Stephen Hemminger

On Thu, 24 May 2018 09:55:14 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 

Why was this merged? It was never signed off by any of the netvsc maintainers,
and there were still issues unresolved.

There are also namespaces issues I am fixing and this breaks them.
Will start my patch set with a revert for this. Sorry

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-28 Thread Stephen Hemminger

On Fri, 25 May 2018 16:06:58 -0700
"Samudrala, Sridhar"  wrote:

> On 5/25/2018 3:38 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:13 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 03ed492c4e14..0f4ba52b641d 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -1421,6 +1421,8 @@ struct net_device_ops {
> >>*   entity (i.e. the master device for bridged veth)
> >>* @IFF_MACSEC: device is a MACsec device
> >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> >> + * @IFF_FAILOVER: device is a failover master device
> >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
> >>*/
> >>   enum netdev_priv_flags {
> >>IFF_802_1Q_VLAN = 1<<0,
> >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
> >>IFF_PHONY_HEADROOM  = 1<<24,
> >>IFF_MACSEC  = 1<<25,
> >>IFF_NO_RX_HANDLER   = 1<<26,
> >> +  IFF_FAILOVER= 1<<27,
> >> +  IFF_FAILOVER_SLAVE  = 1<<28,
> >>   };  
> > Why is FAILOVER any different than other master/slave relationships.
> > I don't think you need to take up precious netdev flag bits for this.  
> 
> These are netdev priv flags.
> Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be 
> used
> with other failover mechanisms. Team also doesn't use this flags and it has 
> its own
> priv_flags.
> 

They are already used by bonding and team.
I don't see why this can't reuse them.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-25 Thread Stephen Hemminger

On Fri, 25 May 2018 16:11:47 -0700
"Samudrala, Sridhar"  wrote:

> On 5/25/2018 3:34 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:14 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> --- a/drivers/net/hyperv/Kconfig
> >> +++ b/drivers/net/hyperv/Kconfig
> >> @@ -2,5 +2,6 @@ config HYPERV_NET
> >>tristate "Microsoft Hyper-V virtual network driver"
> >>depends on HYPERV
> >>select UCS2_STRING
> >> +  select FAILOVER  
> > When I take a working kernel config, add the patches then do
> > make oldconfig
> >
> > It is not autoselecting FAILOVER, it prompts me for it. This means
> > if user says no then a non-working netvsc device is made.  
> 
> I see
> Generic failover module (FAILOVER) [M/y/?] (NEW)
> 
> So the user is given an option to either build as a Module or part of the
> kernel. 'n' is not an option.

With most libraries there is no prompt at all.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-25 Thread Stephen Hemminger

On Thu, 24 May 2018 09:55:13 -0700
Sridhar Samudrala  wrote:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 03ed492c4e14..0f4ba52b641d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1421,6 +1421,8 @@ struct net_device_ops {
>   *   entity (i.e. the master device for bridged veth)
>   * @IFF_MACSEC: device is a MACsec device
>   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> + * @IFF_FAILOVER: device is a failover master device
> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>   */
>  enum netdev_priv_flags {
>   IFF_802_1Q_VLAN = 1<<0,
> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
>   IFF_PHONY_HEADROOM  = 1<<24,
>   IFF_MACSEC  = 1<<25,
>   IFF_NO_RX_HANDLER   = 1<<26,
> + IFF_FAILOVER= 1<<27,
> + IFF_FAILOVER_SLAVE  = 1<<28,
>  };

Why is FAILOVER any different than other master/slave relationships.
I don't think you need to take up precious netdev flag bits for this.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-25 Thread Stephen Hemminger

On Thu, 24 May 2018 09:55:13 -0700
Sridhar Samudrala  wrote:


> + spin_lock(&failover_lock);

Since register is not in fast path, this should be a mutex?


> +int failover_slave_unregister(struct net_device *slave_dev)
> +{
> + struct net_device *failover_dev;
> + struct failover_ops *fops;
> +
> + if (!netif_is_failover_slave(slave_dev))
> + goto done;
> +
> + ASSERT_RTNL();
> +
> + failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
> + if (!failover_dev)
> + goto done;

Since the slave device must have a master device set already, why not use
that instead of searching by MAC address on unregister or link change.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-25 Thread Stephen Hemminger

On Thu, 24 May 2018 09:55:14 -0700
Sridhar Samudrala  wrote:

> --- a/drivers/net/hyperv/Kconfig
> +++ b/drivers/net/hyperv/Kconfig
> @@ -2,5 +2,6 @@ config HYPERV_NET
>   tristate "Microsoft Hyper-V virtual network driver"
>   depends on HYPERV
>   select UCS2_STRING
> + select FAILOVER

When I take a working kernel config, add the patches then do
make oldconfig

It is not autoselecting FAILOVER, it prompts me for it. This means
if user says no then a non-working netvsc device is made.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger

On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> + if (netif_running(failover_dev)) {
> + err = dev_open(slave_dev);
> + if (err && (err != -EBUSY)) {
> + netdev_err(failover_dev, "Opening slave %s failed 
> err:%d\n",
> +slave_dev->name, err);
> + goto err_dev_open;
> + }
> + }
> +
> + netif_addr_lock_bh(failover_dev);
> + dev_uc_sync_multiple(slave_dev, failover_dev);
> + dev_uc_sync_multiple(slave_dev, failover_dev);
> + netif_addr_unlock_bh(failover_dev);
> +

The order of these is backwards, you want to sync addresses before bringing up.
Also, doing it this way does not allow udev/systemd the chance to rename VF 
devices.

The complexity of this whole failover mechanism does not make life easier,
more reliable, or safer for netvsc. I though that was the whole reason for 
having
common code.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger

On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> +static struct net_device *net_failover_get_bymac(u8 *mac,
> +  struct net_failover_ops **ops)
> +{
> + struct net_device *failover_dev;
> + struct net_failover *failover;
> +
> + spin_lock(&net_failover_lock);
> + list_for_each_entry(failover, &net_failover_list, list) {
> + failover_dev = rtnl_dereference(failover->failover_dev);
> + if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> + *ops = rtnl_dereference(failover->ops);
> + spin_unlock(&net_failover_lock);
> + return failover_dev;
> + }
> + }
> + spin_unlock(&net_failover_lock);
> + return NULL;
> +}

This is broken if non-ethernet devices such as Infiniband are present.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger

On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> This provides a generic interface for paravirtual drivers to listen
> for netdev register/unregister/link change events from pci ethernet
> devices with the same MAC and takeover their datapath. The notifier and
> event handling code is based on the existing netvsc implementation.
> 
> It exposes 2 sets of interfaces to the paravirtual drivers.
> 1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>the failover module provides interfaces to create/destroy additional
>master netdev and all the slave events are managed internally.
>   net_failover_create()
>   net_failover_destroy()
>A failover netdev is created that acts a master device and controls 2
>slave devices. The original virtio_net netdev is registered as 'standby'
>netdev and a passthru/vf device with the same MAC gets registered as
>'primary' netdev. Both 'standby' and 'failover' netdevs are associated
>with the same 'pci' device.  The user accesses the network interface via
>'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>default for transmits when it is available with link up and running.
> 2. For existing netvsc driver that uses 2 netdev model, no master netdev
>is created. The paravirtual driver registers each instance of netvsc
>as a 'failover' netdev  along with a set of ops to manage the slave
>events. There is no 'standby' netdev in this model. A passthru/vf device
>with the same MAC gets registered as 'primary' netdev.
>   net_failover_register()
>   net_failover_unregister()
> 
> Signed-off-by: Sridhar Samudrala 

You are conflating the net_failover device (3 device model) with
the generic network failover infrastructure into one file. There should be two
seperate files net/core/failover.c and drivers/net/failover.c which splits
the work into two parts (and acts a check for the api).


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-26 Thread Stephen Hemminger

On Thu, 26 Apr 2018 05:30:05 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Apr 2018 16:59:28 -0700
> > Sridhar Samudrala  wrote:
> >   
> > > Use the registration/notification framework supported by the generic
> > > failover infrastructure.
> > > 
> > > Signed-off-by: Sridhar Samudrala   
> > 
> > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > without modification.  
> 
> It looks like it should work. What kind of proof are you looking for?
> 

I tried this with working Ubuntu 17 on WS2016.
It boots if the failover driver is configured in (as module).
But if the configuration has:

$ grep FAILOVER .config
# CONFIG_NET_FAILOVER is not set
CONFIG_MAY_USE_NET_FAILOVER=y

The netvsc driver fails on boot with:

[0.826447] hv_vmbus: registering driver hv_netvsc
[0.829616] scsi 0:0:0:0: Direct-Access Msft Virtual Disk 1.0  
PQ: 0 ANSI: 5
[0.836291] input: Microsoft Vmbus HID-compliant Mouse as 
/devices/0006:045E:0621.0001/input/input1
[0.839139] hid-generic 0006:045E:0621.0001: input:  HID v0.01 
Mouse [Microsoft Vmbus HID-compliant Mouse] on
[0.964897] hv_vmbus: probe failed for device 
849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
[0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed 
with error -95
[1.112877] hv_vmbus: probe failed for device 
53557f8e-057d-425b-9265-01c0fd7e273e (-95)
[1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed 
with error -95

The system has two virtual networks. eth0 is on vswitch for management.
eth1 is on vswitch with SR-IOV for performance tests.

You probably need to just put the failover part in net/core and select it.

It is trivial to get an evaluation version of Windows Server 2016 and setup a 
Linux VM.
Please try it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Stephen Hemminger

On Wed, 25 Apr 2018 15:57:57 -0700
Siwei Liu  wrote:

> >
> > I think ideally the infrastructure should suppport flexible matching of
> > NICs - netvsc is already reported to be moving to some kind of serial
> > address.
> >  
> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
> It's just the Linux netvsc guest driver itself does not leverage that
> ID from the very beginging.
> 
> Regards,
> -Siwei

I am working on that.  The problem is that it requires some messy work
to go from VF netdevice back to PCI device and from there to the PCI hyperv
host infrastructure to find the serial number.

I was hoping that the serial number would also match the concept of PCI Express
device serial number. But that is a completely different ID :-( 
The PCI-E serial number is a hardware serial number more like MAC address.
The Hyper-V serial number is more like PCI slot value.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Stephen Hemminger

On Wed, 25 Apr 2018 16:59:28 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 

NAK unless you prove this works on legacy distributions and with DPDK 18.05
without modification.
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger

On Tue, 24 Apr 2018 04:42:22 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> > On Mon, 23 Apr 2018 12:44:39 -0700
> > Siwei Liu  wrote:
> >   
> > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  
> > > wrote:  
> > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> > > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > > >> "Michael S. Tsirkin"  wrote:
> > > >>
> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:   
> > > >> >  
> > > >> > > > >
> > > >> > > > >I will NAK patches to change to common code for netvsc 
> > > >> > > > >especially the
> > > >> > > > >three device model.  MS worked hard with distro vendors to 
> > > >> > > > >support transparent
> > > >> > > > >mode, ans we really can't have a new model; or do backport.
> > > >> > > > >
> > > >> > > > >Plus, DPDK is now dependent on existing model.
> > > >> > > >
> > > >> > > > Sorry, but nobody here cares about dpdk or other similar 
> > > >> > > > oddities.
> > > >> > >
> > > >> > > The network device model is a userspace API, and DPDK is a 
> > > >> > > userspace application.
> > > >> >
> > > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > > >> > AFAIK it's normally banging device registers directly.
> > > >> >
> > > >> > > You can't go breaking userspace even if you don't like the 
> > > >> > > application.
> > > >> >
> > > >> > Could you please explain how is the proposed patchset breaking
> > > >> > userspace? Ignoring DPDK for now, I don't think it changes the 
> > > >> > userspace
> > > >> > API at all.
> > > >> >
> > > >>
> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> > > >> devices
> > > >> to look for Linux netvsc device and the paired VF device and setup the
> > > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
> > > >> instance
> > > >> and sets up TAP support over the Linux netvsc device as well as the 
> > > >> Mellanox
> > > >> VF device.
> > > >>
> > > >> So it depends on existing 2 device model. You can't go to a 3 device 
> > > >> model
> > > >> or start hiding devices from userspace.
> > > >
> > > > Okay so how does the existing patch break that? IIUC does not go to
> > > > a 3 device model since netvsc calls failover_register directly.
> > > >
> > > >> Also, I am working on associating netvsc and VF device based on serial 
> > > >> number
> > > >> rather than MAC address. The serial number is how Windows works now, 
> > > >> and it makes
> > > >> sense for Linux and Windows to use the same mechanism if possible.
> > > >
> > > > Maybe we should support same for virtio ...
> > > > Which serial do you mean? From vpd?
> > > >
> > > > I guess you will want to keep supporting MAC for old hypervisors?  
> > 
> > The serial number has always been in the hypervisor since original support 
> > of SR-IOV
> > in WS2008.  So no backward compatibility special cases would be needed.  
> 
> Is that a serial from real hardware or a hypervisor thing?
> 
> 

It is a hypervisor thing in the PCI hyperv code and the hyperv Netvsc interface.
It might also be in the PCI spec, but the value in Hyper-V is being generated 
by the host.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger

On Mon, 23 Apr 2018 23:06:55 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  
> > wrote:  
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin"  wrote:
> > >>  
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially 
> > >> > > > >the
> > >> > > > >three device model.  MS worked hard with distro vendors to 
> > >> > > > >support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.  
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. 
> > >> > > >  
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace 
> > >> > > application.  
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >  
> > >> > > You can't go breaking userspace even if you don't like the 
> > >> > > application.  
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the 
> > >> > userspace
> > >> > API at all.
> > >> >  
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> > >> devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
> > >> instance
> > >> and sets up TAP support over the Linux netvsc device as well as the 
> > >> Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device 
> > >> model
> > >> or start hiding devices from userspace.  
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >  
> > >> Also, I am working on associating netvsc and VF device based on serial 
> > >> number
> > >> rather than MAC address. The serial number is how Windows works now, and 
> > >> it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.  
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> > >
> > > It all seems like a reasonable thing to support in the generic core.  
> > 
> > That's the reason why I chose explicit identifier rather than rely on
> > MAC address to bind/pair a device. MAC address can change. Even if it
> > can't, malicious guest user can fake MAC address to skip binding.
> > 
> > -Siwei  
> 
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

On Hyper-V guest can't really change MAC address if SR-IOV is enabled.
Also, Linux has permanent ether address in netdev which is what should
be used to avoid user messing with device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger

On Mon, 23 Apr 2018 12:44:39 -0700
Siwei Liu  wrote:

> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially 
> >> > > > >the
> >> > > > >three device model.  MS worked hard with distro vendors to support 
> >> > > > >transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.  
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace 
> >> > > application.  
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >  
> >> > > You can't go breaking userspace even if you don't like the 
> >> > > application.  
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >  
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> >> devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the 
> >> Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.  
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >  
> >> Also, I am working on associating netvsc and VF device based on serial 
> >> number
> >> rather than MAC address. The serial number is how Windows works now, and 
> >> it makes
> >> sense for Linux and Windows to use the same mechanism if possible.  
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?

The serial number has always been in the hypervisor since original support of 
SR-IOV
in WS2008.  So no backward compatibility special cases would be needed.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger

On Mon, 23 Apr 2018 20:24:56 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> > > >
> > > >I will NAK patches to change to common code for netvsc especially the
> > > >three device model.  MS worked hard with distro vendors to support 
> > > >transparent
> > > >mode, ans we really can't have a new model; or do backport.
> > > >
> > > >Plus, DPDK is now dependent on existing model.
> > > 
> > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> > 
> > The network device model is a userspace API, and DPDK is a userspace 
> > application.  
> 
> It is userspace but are you sure dpdk is actually poking at netdevs?
> AFAIK it's normally banging device registers directly.
> 
> > You can't go breaking userspace even if you don't like the application.  
> 
> Could you please explain how is the proposed patchset breaking
> userspace? Ignoring DPDK for now, I don't think it changes the userspace
> API at all.
> 

The DPDK has a device driver vdev_netvsc which scans the Linux network devices
to look for Linux netvsc device and the paired VF device and setup the
DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
and sets up TAP support over the Linux netvsc device as well as the Mellanox
VF device.

So it depends on existing 2 device model. You can't go to a 3 device model
or start hiding devices from userspace.

Also, I am working on associating netvsc and VF device based on serial number
rather than MAC address. The serial number is how Windows works now, and it 
makes
sense for Linux and Windows to use the same mechanism if possible.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger

On Fri, 20 Apr 2018 18:00:58 +0200
Jiri Pirko  wrote:

> Fri, Apr 20, 2018 at 05:28:02PM CEST, step...@networkplumber.org wrote:
> >On Thu, 19 Apr 2018 18:42:04 -0700
> >Sridhar Samudrala  wrote:
> >  
> >> Use the registration/notification framework supported by the generic
> >> failover infrastructure.
> >> 
> >> Signed-off-by: Sridhar Samudrala   
> >
> >Do what you want to other devices but leave netvsc alone.
> >Adding these failover ops does not reduce the code size, and really is
> >no benefit.  The netvsc device driver needs to be backported to several
> >other distributions and doing this makes that harder.  
> 
> We should not care about the backport burden when we are trying to make
> things right. And things are not right. The current netvsc approach is
> just plain wrong shortcut. It should have been done in a generic way
> from the very beginning. We are just trying to fix this situation.
> 
> Moreover, I believe that part of the fix is to convert netvsc to 3
> netdev solution too. 2 netdev model is wrong.
> 
> 
> >
> >I will NAK patches to change to common code for netvsc especially the
> >three device model.  MS worked hard with distro vendors to support 
> >transparent
> >mode, ans we really can't have a new model; or do backport.
> >
> >Plus, DPDK is now dependent on existing model.  
> 
> Sorry, but nobody here cares about dpdk or other similar oddities.

The network device model is a userspace API, and DPDK is a userspace 
application.
You can't go breaking userspace even if you don't like the application.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-20 Thread Stephen Hemminger

On Thu, 19 Apr 2018 18:42:04 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 

Do what you want to other devices but leave netvsc alone.
Adding these failover ops does not reduce the code size, and really is
no benefit.  The netvsc device driver needs to be backported to several
other distributions and doing this makes that harder.

I will NAK patches to change to common code for netvsc especially the
three device model.  MS worked hard with distro vendors to support transparent
mode, ans we really can't have a new model; or do backport.

Plus, DPDK is now dependent on existing model.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Stephen Hemminger

On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu  wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin  wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala  wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala 
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Stephen Hemminger

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Stephen Hemminger

On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu  wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> > Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Stephen Hemminger

On Wed, 4 Apr 2018 11:37:52 -0600
David Ahern  wrote:

> Networking vendors have out of tree kernel modules. Those modules use a
> netdev (call it a master netdev, a control netdev, cpu port, whatever)
> to pull packets from the ASIC and deliver to virtual netdevices
> representing physical ports. The master netdev should not be mucked with
> by a user. It should be ignored by certain s/w with lldpd as just an
> *example*.

Sorry, the linux kernel maintainers have a clear well defined attitude
about out of tree kernel modules...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Stephen Hemminger

On Sun,  1 Apr 2018 05:13:09 -0400
Si-Wei Liu  wrote:

> Hidden netdevice is not visible to userspace such that
> typical network utilites e.g. ip, ifconfig and et al,
> cannot sense its existence or configure it. Internally
> hidden netdev may associate with an upper level netdev
> that userspace has access to. Although userspace cannot
> manipulate the lower netdev directly, user may control
> or configure the underlying hidden device through the
> upper-level netdev. For identification purpose, the
> kobject for hidden netdev still presents in the sysfs
> hierarchy, however, no uevent message will be generated
> when the sysfs entry is created, modified or destroyed.
> 
> For that end, a separate namescope needs to be carved
> out for IFF_HIDDEN netdevs. As of now netdev name that
> starts with colon i.e. ':' is invalid in userspace,
> since socket ioctls such as SIOCGIFCONF use ':' as the
> separator for ifname. The absence of namescope started
> with ':' can rightly be used as the namescope for
> the kernel-only IFF_HIDDEN netdevs.
> 
> Signed-off-by: Si-Wei Liu 
> ---

I understand the use case. I proposed using . as a prefix before
but that ran into resistance. Using colon seems worse.

Rather than playing with names and all the issues that can cause,
why not make it an attribute flag of the device in netlink.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-26 Thread Stephen Hemminger

On Mon, 26 Feb 2018 08:19:24 +0100
Jiri Pirko  wrote:

> Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote:
> >On Thu, 22 Feb 2018 13:30:12 -0800
> >Alexander Duyck  wrote:
> >  
> >> > Again, I undertand your motivation. Yet I don't like your solution.
> >> > But if the decision is made to do this in-driver bonding. I would like
> >> > to see it baing done some generic way:
> >> > 1) share the same "in-driver bonding core" code with netvsc
> >> >put to net/core.
> >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >> >like active-backup mode only, one vf, one backup, vf netdev type
> >> >check (so noone could enslave a tap or anything else)
> >> > If user would need something more, he should employ team/bond.
> >
> >Sharing would be good, but netvsc world would really like to only have
> >one visible network device.  
> 
> Why do you mind? All would be the same, there would be just another
> netdevice unused by the vm user (same as the vf netdev).
> 

I mind because our requirement is no changes to userspace.
No special udev rules, no bonding script, no setup.

Things like cloudinit running on current distro's expect to see a single
eth0.  The VF device show up can also be an issue because distro's have
stupid rules like Network Manager trying to start DHCP on every interface.
We deal with that now by doing stuff like udev rules to get it to stop
but that is still causing user errors.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger

(pruned to reduce thread)

On Wed, 21 Feb 2018 16:17:19 -0800
Alexander Duyck  wrote:

> >>> FWIW two solutions that immediately come to mind is to export "backup"
> >>> as phys_port_name of the backup virtio link and/or assign a name to the
> >>> master like you are doing already.  I think team uses team%d and bond
> >>> uses bond%d, soft naming of master devices seems quite natural in this
> >>> case.  
> >>
> >> I figured I had overlooked something like that.. Thanks for pointing
> >> this out. Okay so I think the phys_port_name approach might resolve
> >> the original issue. If I am reading things correctly what we end up
> >> with is the master showing up as "ens1" for example and the backup
> >> showing up as "ens1nbackup". Am I understanding that right?
> >>
> >> The problem with the team/bond%d approach is that it creates a new
> >> netdevice and so it would require guest configuration changes.
> >>  
> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> >>> link is quite neat.  
> >>
> >> I agree. For non-"backup" virio_net devices would it be okay for us to
> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> >> behavior could be maintained although the function still exists.
> >>  
>  - When the 'active' netdev is unplugged OR not present on a destination
>    system after live migration, the user will see 2 virtio_net netdevs.  
> >>>
> >>> That's necessary and expected, all configuration applies to the master
> >>> so master must exist.  
> >>
> >> With the naming issue resolved this is the only item left outstanding.
> >> This becomes a matter of form vs function.
> >>
> >> The main complaint about the "3 netdev" solution is a bit confusing to
> >> have the 2 netdevs present if the VF isn't there. The idea is that
> >> having the extra "master" netdev there if there isn't really a bond is
> >> a bit ugly.  
> >
> > Is it this uglier in terms of user experience rather than
> > functionality? I don't want it dynamically changed between 2-netdev
> > and 3-netdev depending on the presence of VF. That gets back to my
> > original question and suggestion earlier: why not just hide the lower
> > netdevs from udev renaming and such? Which important observability
> > benefits users may get if exposing the lower netdevs?
> >
> > Thanks,
> > -Siwei  
> 
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
> 
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.
> 


Let's not make XDP the blocker for doing the best solution
from the end user point of view. XDP is just yet another offload
thing which needs to be handled.  The current backup device solution
used in netvsc doesn't handle the full range of offload options
(things like flow direction, DCB, etc); no one but the HW vendors
seems to care.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger

On Thu, 22 Feb 2018 13:30:12 -0800
Alexander Duyck  wrote:

> > Again, I undertand your motivation. Yet I don't like your solution.
> > But if the decision is made to do this in-driver bonding. I would like
> > to see it baing done some generic way:
> > 1) share the same "in-driver bonding core" code with netvsc
> >put to net/core.
> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >like active-backup mode only, one vf, one backup, vf netdev type
> >check (so noone could enslave a tap or anything else)
> > If user would need something more, he should employ team/bond.  

Sharing would be good, but netvsc world would really like to only have
one visible network device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-28 Thread Stephen Hemminger

On Fri, 26 Jan 2018 18:30:03 -0800
Jakub Kicinski  wrote:

> On Fri, 26 Jan 2018 15:30:35 -0800, Samudrala, Sridhar wrote:
> > On 1/26/2018 2:47 PM, Jakub Kicinski wrote:  
> > > On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote:
> > >> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote:
> >  and the VM is not expected to do any tuning/optimizations on the VF 
> >  driver
> >  directly,
> >  i think the current patch that follows the netvsc model of 2 
> >  netdevs(virtio
> >  and vf) should
> >  work fine.
> > >>> OK. For your use case that's fine. But that's too specific scenario
> > >>> with lots of restrictions IMHO, perhaps very few users will benefit
> > >>> from it, I'm not sure. If you're unwilling to move towards it, we'd
> > >>> take this one and come back with a generic solution that is able to
> > >>> address general use cases for VF/PT live migration .
> > >> I think that's a fine approach. Scratch your own itch!  I imagine a very
> > >> generic virtio-switchdev providing host routing info to guests could
> > >> address lots of usecases. A driver could bind to that one and enslave
> > >> arbitrary other devices.  Sounds reasonable.
> > >>
> > >> But given the fundamental idea of a failover was floated at least as
> > >> early as 2013, and made 0 progress since precisely because it kept
> > >> trying to address more and more features, and given netvsc is already
> > >> using the basic solution with some success, I'm not inclined to block
> > >> this specific effort waiting for the generic one.
> > > I think there is an agreement that the extra netdev will be useful for
> > > more advanced use cases, and is generally preferable.  What is the
> > > argument for not doing that from the start?  If it was made I must have
> > > missed it.  Is it just unwillingness to write the extra 300 lines of
> > > code?  Sounds like a pretty weak argument when adding kernel ABI is at
> > > stake...
> > 
> > I am still not clear on the need for the extra netdev created by 
> > virtio_net. The only advantage i can see is that the stats can be
> > broken between VF and virtio datapaths compared to the aggregrated
> > stats on virtio netdev as seen with the 2 netdev approach.  
> 
> Maybe you're not convinced but multiple arguments were made.
> 
> > With 2 netdev model, any VM image that has a working network 
> > configuration will transparently get VF based acceleration without
> > any changes.   
> 
> Nothing happens transparently.  Things may happen automatically.  The
> VF netdev doesn't disappear with netvsc.  The PV netdev transforms into
> something it did not use to be.  And configures and reports some
> information from the PV (e.g. speed) but PV doesn't pass traffic any
> longer.
> 
> > 3 netdev model breaks this configuration starting with the creation
> > and naming of the 2 devices to udev needing to be aware of master and
> > slave virtio-net devices.  
> 
> I don't understand this comment.  There is one virtio-net device and
> one "virtio-bond" netdev.  And user space has to be aware of the special
> automatic arrangement anyway, because it can't touch the VF.  It
> doesn't make any difference whether it ignores the VF or PV and VF.
> It simply can't touch the slaves, no matter how many there are.
> 
> > Also, from a user experience point of view, loading a virtio-net with
> > BACKUP feature enabled will now show 2 virtio-net netdevs.  
> 
> One virtio-net and one virtio-bond, which represents what's happening.
> 
> > For live migration with advanced usecases that Siwei is suggesting, i 
> > think we need a new driver with a new device type that can track the
> > VF specific feature settings even when the VF driver is unloaded.  

I see no added value of the 3 netdev model, there is no need for a bond
device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-22 Thread Stephen Hemminger

On Mon, 22 Jan 2018 15:27:40 -0800
"Samudrala, Sridhar"  wrote:

> On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:  
> >> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin  
> >> wrote:  
> >>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:  
> 
>  On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:  
> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:  
> >> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >>  wrote:  
> >>> This feature bit can be used by hypervisor to indicate virtio_net 
> >>> device to
> >>> act as a backup for another device with the same MAC address.
> >>>
> >>> Signed-off-by: Sridhar Samudrala 
> >>> ---
> >>>drivers/net/virtio_net.c| 2 +-
> >>>include/uapi/linux/virtio_net.h | 3 +++
> >>>2 files changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 12dfc5fee58e..f149a160a8c5 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >>>   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >>>   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >>> -   VIRTIO_NET_F_SPEED_DUPLEX
> >>> +   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >>>
> >>>static unsigned int features[] = {
> >>>   VIRTNET_FEATURES,
> >>> diff --git a/include/uapi/linux/virtio_net.h 
> >>> b/include/uapi/linux/virtio_net.h
> >>> index 5de6ed37695b..c7c35fd1a5ed 100644
> >>> --- a/include/uapi/linux/virtio_net.h
> >>> +++ b/include/uapi/linux/virtio_net.h
> >>> @@ -57,6 +57,9 @@
> >>>* Steering */
> >>>#define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >>>
> >>> +#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another 
> >>> device
> >>> +* with the same MAC.
> >>> +*/
> >>>#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and 
> >>> duplex */
> >>>
> >>>#ifndef VIRTIO_NET_NO_LEGACY  
> >> I'm not a huge fan of the name "backup" since that implies that the
> >> Virtio interface is only used if the VF is not present, and there are
> >> multiple instances such as dealing with east/west or
> >> broadcast/multicast traffic where it may be desirable to use the
> >> para-virtual interface rather then deal with PCI overhead/bottleneck
> >> to send the packet.  
> > Right now hypervisors mostly expect that yes, only one at a time is
> > used.  E.g. if you try to do multicast sending packets on both VF and
> > virtio then you will end up with two copies of each packet.  
>  I think we want to use only 1 interface to  send out any packet. In case 
>  of
>  broadcast/multicasts it would be an optimization to send them via virtio 
>  and
>  this patch series adds that optimization.  
> >>> Right that's what I think we should rather avoid for now.
> >>>
> >>> It's *not* an optimization if there's a single VM on this host,
> >>> or if a specific multicast group does not have any VMs on same
> >>> host.  
> >> Agreed. In my mind this is something that is controlled by the
> >> pass-thru interface once it is enslaved.  
> > It would be pretty tricky to control through the PT
> > interface since a PT interface pretends to be a physical
> > device, which has no concept of VMs.
> >  
> >>> I'd rather we just sent everything out on the PT if that's
> >>> there. The reason we have virtio in the picture is just so
> >>> we can migrate without downtime.  
> >> I wasn't saying we do that in all cases. That would be something that
> >> would have to be decided by the pass-thru interface. Ideally the
> >> virtio would provide just enough information to get itself into the
> >> bond and I see this being the mechanism for it to do so. From there
> >> the complexity mostly lies in the pass-thru interface to configure the
> >> correct transmit modes if for example you have multiple pass-thru
> >> interfaces or a more complex traffic setup due to things like
> >> SwitchDev.
> >>
> >> In my mind we go the bonding route and there are few use cases for all
> >> of this. First is the backup case that is being addressed here. That
> >> becomes your basic "copy netvsc" approach for this which would be
> >> default. It is how we would handle basic pass-thru back-up paths. If
> >> the host decides to send multicast/broadcast traffic from the host up
> >> through it that is a host side decision. I am okay with our default
> >> transmit behavior from the guest being

Re: [PATCH net-next] virtio_net: Add ethtool stats

2017-12-24 Thread Stephen Hemminger

On Wed, 20 Dec 2017 13:40:37 +0900
Toshiaki Makita  wrote:

> +
> +static const struct virtnet_gstats virtnet_gstrings_stats[] = {
> + { "rx_packets", VIRTNET_NETDEV_STAT(rx_packets) },
> + { "tx_packets", VIRTNET_NETDEV_STAT(tx_packets) },
> + { "rx_bytes",   VIRTNET_NETDEV_STAT(rx_bytes) },
> + { "tx_bytes",   VIRTNET_NETDEV_STAT(tx_bytes) },
> + { "rx_dropped", VIRTNET_NETDEV_STAT(rx_dropped) },
> + { "rx_length_errors",   VIRTNET_NETDEV_STAT(rx_length_errors) },
> + { "rx_frame_errors",VIRTNET_NETDEV_STAT(rx_frame_errors) },
> + { "tx_dropped", VIRTNET_NETDEV_STAT(tx_dropped) },
> + { "tx_fifo_errors", VIRTNET_NETDEV_STAT(tx_fifo_errors) },
> +};
> +

Please do not merge pre-existing global stats into ethtool.
It just duplicates existing functionality.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger

On Tue, 19 Dec 2017 14:37:50 -0800
"Samudrala, Sridhar"  wrote:

> On 12/19/2017 11:46 AM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 11:42:33 -0800
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:  
> >>> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> >>> David Miller  wrote:
> >>> 
> >>>> From: Stephen Hemminger 
> >>>> Date: Tue, 19 Dec 2017 09:55:48 -0800
> >>>> 
> >>>>> could be 10ms, just enough to let udev do its renaming  
> >>>> Please, move to some kind of notification or event based handling of
> >>>> this problem.
> >>>>
> >>>> No delay is safe, what if userspace gets swapped out or whatever
> >>>> else might make userspace stall unexpectedly?
> >>>> 
> >>> The plan is to remove the delay and do the naming in the kernel.
> >>> This was suggested by Lennart since udev is only doing naming policy
> >>> because kernel names were not repeatable.
> >>>
> >>> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >>>
> >>> Patch is pending.  
> >> Do we really need to delay the setup until the name is changed?
> >> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
> >>
> >> Thanks
> >> Sridhar  
> > You can call dev_set_mtu, but when dev_open is done the device name
> > can not be changed by userspace.  
> I did a quick test to remove the delay and also the dev_open() call and 
> i don't see
> any issues with virtio taking over the VF datapath.
> Only the netdev_info() messages may show old device name.
> 
> Any specific scenario where we need to explicitly call  VF's dev_open() 
> in the VF setup process?
> I tried i40evf driver loaded after virtio_net  AND  virtio_net loading 
> after i40evf.
> 
> Thanks
> Sridhar

It happens with hotplug. It is possible on Hyper-V to hotplug SR-IOV on
and off while guest is running. If SR-IOV is disabled in host then the
VF device is removed (hotplug) and the inverse. If the master device is
up then the VF device should be brought up by the master device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger

On Tue, 19 Dec 2017 11:42:33 -0800
"Samudrala, Sridhar"  wrote:

> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> > David Miller  wrote:
> >  
> >> From: Stephen Hemminger 
> >> Date: Tue, 19 Dec 2017 09:55:48 -0800
> >>  
> >>> could be 10ms, just enough to let udev do its renaming  
> >> Please, move to some kind of notification or event based handling of
> >> this problem.
> >>
> >> No delay is safe, what if userspace gets swapped out or whatever
> >> else might make userspace stall unexpectedly?
> >>  
> > The plan is to remove the delay and do the naming in the kernel.
> > This was suggested by Lennart since udev is only doing naming policy
> > because kernel names were not repeatable.
> >
> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >
> > Patch is pending.  
> Do we really need to delay the setup until the name is changed?
> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
> 
> Thanks
> Sridhar

You can call dev_set_mtu, but when dev_open is done the device name
can not be changed by userspace.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger

On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Tue, 19 Dec 2017 09:55:48 -0800
> 
> > could be 10ms, just enough to let udev do its renaming  
> 
> Please, move to some kind of notification or event based handling of
> this problem.
> 
> No delay is safe, what if userspace gets swapped out or whatever
> else might make userspace stall unexpectedly?
> 

The plan is to remove the delay and do the naming in the kernel.
This was suggested by Lennart since udev is only doing naming policy
because kernel names were not repeatable.

This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.

Patch is pending.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger

On Tue, 19 Dec 2017 20:07:01 +0200
"Michael S. Tsirkin"  wrote:

> On Tue, Dec 19, 2017 at 09:55:48AM -0800, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 09:41:39 -0800
> > "Samudrala, Sridhar"  wrote:
> >   
> > > On 12/19/2017 7:47 AM, Michael S. Tsirkin wrote:  
> > > > I'll need to look at this more, in particular the feature
> > > > bit is missing here. For now one question:
> > > >
> > > > On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote:
> > > >> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
> > > >>*/
> > > >>   DECLARE_EWMA(pkt_len, 0, 64)
> > > >>   
> > > >> +#define VF_TAKEOVER_INT   (HZ / 10)
> > > >> +
> > > >>   #define VIRTNET_DRIVER_VERSION "1.0.0"
> > > >>   
> > > >>   static const unsigned long guest_offloads[] = {
> > > > Why is this delay necessary? And why by 100ms?
> > > 
> > > This is based on netvsc implementation and here is the commit that
> > > added this delay.  Not sure if this needs to be 100ms.
> > > 
> > > commit 6123c66854c174e4982f98195100c1d990f9e5e6
> > > Author: stephen hemminger 
> > > Date:   Wed Aug 9 17:46:03 2017 -0700
> > > 
> > >      netvsc: delay setup of VF device
> > > 
> > >      When VF device is discovered, delay bring it automatically up in
> > >      order to allow userspace to some simple changes (like renaming).
> > > 
> > > 
> > >   
> > 
> > could be 10ms, just enough to let udev do its renaming  
> 
> Isn't there a way not to depend on udev completing its thing within a given 
> timeframe?

Not that I know. the path is quite indirect.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger

On Tue, 19 Dec 2017 09:41:39 -0800
"Samudrala, Sridhar"  wrote:

> On 12/19/2017 7:47 AM, Michael S. Tsirkin wrote:
> > I'll need to look at this more, in particular the feature
> > bit is missing here. For now one question:
> >
> > On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote:  
> >> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
> >>*/
> >>   DECLARE_EWMA(pkt_len, 0, 64)
> >>   
> >> +#define VF_TAKEOVER_INT   (HZ / 10)
> >> +
> >>   #define VIRTNET_DRIVER_VERSION "1.0.0"
> >>   
> >>   static const unsigned long guest_offloads[] = {  
> > Why is this delay necessary? And why by 100ms?  
> 
> This is based on netvsc implementation and here is the commit that
> added this delay.  Not sure if this needs to be 100ms.
> 
> commit 6123c66854c174e4982f98195100c1d990f9e5e6
> Author: stephen hemminger 
> Date:   Wed Aug 9 17:46:03 2017 -0700
> 
>      netvsc: delay setup of VF device
> 
>      When VF device is discovered, delay bring it automatically up in
>      order to allow userspace to some simple changes (like renaming).
> 
> 
> 

could be 10ms, just enough to let udev do its renaming
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-05 Thread Stephen Hemminger

On Tue, 5 Dec 2017 14:29:28 -0800
Jakub Kicinski  wrote:

> On Tue, 5 Dec 2017 11:59:17 +0200, achiad shochat wrote:
> >  I second Jacob - having a netdev of one device driver enslave a netdev
> >  of another device driver is an awkward a-symmetric model.
> >  Regardless of whether they share the same backend device.
> >  Only I am not sure the Linux Bond is the right choice.
> >  e.g one may well want to use the virtio device also when the
> >  pass-through device is available, e.g for multicasts, east-west
> >  traffic, etc.
> >  I'm not sure the Linux Bond fits that functionality.
> >  And, as I hear in this thread, it is hard to make it work out of the 
> >  box.
> >  So I think the right thing would be to write a new dedicated module
> >  for this purpose.
> > >
> > > This part I can sort of agree with. What if we were to look at
> > > providing a way to somehow advertise that the two devices were meant
> > > to be boded for virtualization purposes? For now lets call it a
> > > "virt-bond". Basically we could look at providing a means for virtio
> > > and VF drivers to advertise that they want this sort of bond. Then it
> > > would just be a matter of providing some sort of side channel to
> > > indicate where you want things like multicast/broadcast/east-west
> > > traffic to go.  
> > 
> > I like this approach.  
> 
> +1 on a separate driver, just enslaving devices to virtio may break
> existing setups.  If people are bonding from user space today, if they
> update their kernel it may surprise them how things get auto-mangled.
> 
> Is what Alex is suggesting a separate PV device that says "I would
> like to be a bond of those two interfaces"?  That would make the HV
> intent explicit and kernel decisions more understandable.

So far, in my experience it still works.
As long as the kernel slaving happens first, it will work.
The attempt to bond an already slaved device will fail and no scripts seem
to check the error return.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-03 Thread Stephen Hemminger

On Sun, 3 Dec 2017 11:14:37 +0200
achiad shochat  wrote:

> On 3 December 2017 at 07:05, Michael S. Tsirkin  wrote:
> > On Fri, Dec 01, 2017 at 12:08:59PM -0800, Shannon Nelson wrote:  
> >> On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote:  
> >> > On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote:  
> >> > > Re. problem #2:
> >> > > Indeed the best way to address it seems to be to enslave the VF driver
> >> > > netdev under a persistent anchor netdev.
> >> > > And it's indeed desired to allow (but not enforce) PV netdev and VF
> >> > > netdev to work in conjunction.
> >> > > And it's indeed desired that this enslavement logic work out-of-the 
> >> > > box.
> >> > > But in case of PV+VF some configurable policies must be in place (and
> >> > > they'd better be generic rather than differ per PV technology).
> >> > > For example - based on which characteristics should the PV+VF coupling
> >> > > be done? netvsc uses MAC address, but that might not always be the
> >> > > desire.  
> >> >
> >> > It's a policy but not guest userspace policy.
> >> >
> >> > The hypervisor certainly knows.
> >> >
> >> > Are you concerned that someone might want to create two devices with the
> >> > same MAC for an unrelated reason?  If so, hypervisor could easily set a
> >> > flag in the virtio device to say "this is a backup, use MAC to find
> >> > another device".  
> >>
> >> This is something I was going to suggest: a flag or other configuration on
> >> the virtio device to help control how this new feature is used.  I can
> >> imagine this might be useful to control from either the hypervisor side or
> >> the VM side.
> >>
> >> The hypervisor might want to (1) disable it (force it off), (2) enable it
> >> for VM choice, or (3) force it on for the VM.  In case (2), the VM might be
> >> able to chose whether it wants to make use of the feature, or stick with 
> >> the
> >> bonding solution.
> >>
> >> Either way, the kernel is making a feature available, and the user (VM or
> >> hypervisor) is able to control it by selecting the feature based on the
> >> policy desired.
> >>
> >> sln  
> >
> > I'm not sure what's the feature that is available here.
> >
> > I saw this as a flag that says "this device shares backend with another
> > network device which can be found using MAC, and that backend should be
> > preferred".  kernel then forces configuration which uses that other
> > backend - as long as it exists.
> >
> > However, please Cc virtio-dev mailing list if we are doing this since
> > this is a spec extension.
> >
> > --
> > MST  
> 
> 
> Can someone please explain why assume a virtio device is there at all??
> I specified a case where there isn't any.
> 
> I second Jacob - having a netdev of one device driver enslave a netdev
> of another device driver is an awkward a-symmetric model.
> Regardless of whether they share the same backend device.
> Only I am not sure the Linux Bond is the right choice.
> e.g one may well want to use the virtio device also when the
> pass-through device is available, e.g for multicasts, east-west
> traffic, etc.
> I'm not sure the Linux Bond fits that functionality.
> And, as I hear in this thread, it is hard to make it work out of the box.
> So I think the right thing would be to write a new dedicated module
> for this purpose.
> 
> Re policy -
> Indeed the HV can request a policy from the guest but that's not a
> claim for the virtio device enslaving the pass-through device.
> Any policy can be queried by the upper enslaving device.
> 
> Bottom line - I do not see a single reason to have the virtio netdev
> (nor netvsc or any other PV netdev) enslave another netdev by itself.
> If we'd do it right with netvsc from the beginning we wouldn't need
> this discussion at all...

There are several issues with transparent migration.
The first is that the SR-IOV device needs to be shut off for earlier
in the migration process.
Next, the SR-IOV device in the migrated go guest environment maybe different.
It might not exist at all, it might be at a different PCI address, or it
could even be a different vendor/speed/model.
Keeping a virtual network device around allows persisting the connectivity,
during the process.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Stephen Hemminger

On Wed, 29 Nov 2017 19:51:38 -0800
Jakub Kicinski  wrote:

> On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> > On 2017年11月29日 03:27, Jesse Brandeburg wrote:  
> > > Hi, I'd like to get some feedback on a proposal to enhance
> > > virtio-net to ease configuration of a VM and that would enable
> > > live migration of passthrough network SR-IOV devices.
> > >
> > > Today we have SR-IOV network devices (VFs) that can be passed
> > > into a VM in order to enable high performance networking direct
> > > within the VM. The problem I am trying to address is that this
> > > configuration is generally difficult to live-migrate.  There is
> > > documentation [1] indicating that some OS/Hypervisor vendors will
> > > support live migration of a system with a direct assigned
> > > networking device.  The problem I see with these implementations
> > > is that the network configuration requirements that are passed on
> > > to the owner of the VM are quite complicated.  You have to set up
> > > bonding, you have to configure it to enslave two interfaces,
> > > those interfaces (one is virtio-net, the other is SR-IOV
> > > device/driver like ixgbevf) must support MAC address changes
> > > requested in the VM, and on and on...
> > >
> > > So, on to the proposal:
> > > Modify virtio-net driver to be a single VM network device that
> > > enslaves an SR-IOV network device (inside the VM) with the same
> > > MAC address. This would cause the virtio-net driver to appear and
> > > work like a simplified bonding/team driver.  The live migration
> > > problem would be solved just like today's bonding solution, but
> > > the VM user's networking config would be greatly simplified.
> > >
> > > At it's simplest, it would appear something like this in the VM.
> > >
> > > ==
> > > = vnet0  =
> > >   =
> > > (virtio- =   |
> > >   net)=   |
> > >   =  ==
> > >   =  = ixgbef =
> > > ==  ==
> > >
> > > (forgive the ASCII art)
> > >
> > > The fast path traffic would prefer the ixgbevf or other SR-IOV
> > > device path, and fall back to virtio's transmit/receive when
> > > migrating.
> > >
> > > Compared to today's options this proposal would
> > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > speeds
> > > 2) simplify end user configuration in the VM (most if not all of
> > > the set up to enable migration would be done in the hypervisor)
> > > 3) allow live migration via a simple link down and maybe a PCI
> > > hot-unplug of the SR-IOV device, with failover to the
> > > virtio-net driver core
> > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > between vendors if the VM os has driver support for all the
> > > required SR-IOV devices.
> > >
> > > Runtime operation proposed:
> > > -  virtio-net driver loads, SR-IOV driver loads
> > > - virtio-net finds other NICs that match it's MAC address by
> > >both examining existing interfaces, and sets up a new device
> > > notifier
> > > - virtio-net enslaves the first NIC with the same MAC address
> > > - virtio-net brings up the slave, and makes it the "preferred"
> > > path
> > > - virtio-net follows the behavior of an active backup bond/team
> > > - virtio-net acts as the interface to the VM
> > > - live migration initiates
> > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > - failover to virtio-net as primary path
> > > - migration continues to new host
> > > - new host is started with virio-net as primary
> > > - if no SR-IOV, virtio-net stays primary
> > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > - virtio-net notices new NIC and starts over at enslave step above
> > >
> > > Future ideas (brainstorming):
> > > - Optimize Fast east-west by having special rules to direct
> > > east-west traffic through virtio-net traffic path
> > >
> > > Thanks for reading!
> > > Jesse
> > 
> > Cc netdev.
> > 
> > Interesting, and this method is actually used by netvsc now:
> > 
> > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > Author: stephen hemminger 
> > Date:   Tue Aug 1 19:58:53 2017 -0700
&

[PATCH] uapi: add SPDX identifier to vm_sockets_diag.h

2017-11-24 Thread Stephen Hemminger

New file seems to have missed the SPDX license scan and update.

Signed-off-by: Stephen Hemminger 
---
 include/uapi/linux/vm_sockets_diag.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
index 14cd7dc5a187..0b4dd54f3d1e 100644
--- a/include/uapi/linux/vm_sockets_diag.h
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 /* AF_VSOCK sock_diag(7) interface for querying open sockets */
 
 #ifndef _UAPI__VM_SOCKETS_DIAG_H__
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH net-next 4/4] mlx4: sizeof style usage

2017-08-20 Thread Stephen Hemminger via Virtualization

Yes, good catch.

-Original Message-
From: Tariq Toukan [mailto:tar...@mellanox.com] 
Sent: Sunday, August 20, 2017 3:27 AM
To: Stephen Hemminger ; mlind...@marvell.com; 
m...@redhat.com; jasow...@redhat.com
Cc: net...@vger.kernel.org; linux-r...@vger.kernel.org; 
virtualization@lists.linux-foundation.org; Stephen Hemminger 

Subject: Re: [PATCH net-next 4/4] mlx4: sizeof style usage

[You don't often get email from tar...@mellanox.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

Thanks Stephen.
Sorry for the late reply, I was on vacation.
I know this is already accepted, but still I have one comment.

On 15/08/2017 8:29 PM, Stephen Hemminger wrote:
> The kernel coding style is to treat sizeof as a function
> (ie. with parenthesis) not as an operator.
>
> Also use kcalloc and kmalloc_array
>
> Signed-off-by: Stephen Hemminger 
> ---
> @@ -726,7 +726,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct 
> mlx4_eq *eq)
>   }
>   memcpy(&priv->mfunc.master.comm_arm_bit_vector,
>  eqe->event.comm_channel_arm.bit_vec,
> -sizeof eqe->event.comm_channel_arm.bit_vec);
> +sizeof(eqe)->event.comm_channel_arm.bit_vec);

I think the brackets here are misplaced.
Shouldn't they be as follows?

sizeof(eqe->event.comm_channel_arm.bit_vec));

>   queue_work(priv->mfunc.master.comm_wq,
>  &priv->mfunc.master.comm_work);
>   break;

Thanks,
Tariq
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next 4/4] mlx4: sizeof style usage

2017-08-15 Thread Stephen Hemminger

The kernel coding style is to treat sizeof as a function
(ie. with parenthesis) not as an operator.

Also use kcalloc and kmalloc_array

Signed-off-by: Stephen Hemminger 
---
 drivers/net/ethernet/mellanox/mlx4/alloc.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_resources.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c| 20 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.h   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/intf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mcg.c   | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mr.c| 10 -
 drivers/net/ethernet/mellanox/mlx4/qp.c| 12 +--
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 24 +++---
 15 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c 
b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index b651c1210555..6dabd983e7e0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -186,7 +186,7 @@ int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, 
u32 mask,
bitmap->effective_len = bitmap->avail;
spin_lock_init(&bitmap->lock);
bitmap->table = kzalloc(BITS_TO_LONGS(bitmap->max) *
-   sizeof (long), GFP_KERNEL);
+   sizeof(long), GFP_KERNEL);
if (!bitmap->table)
return -ENOMEM;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 674773b28b2e..97aed30ead21 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2637,7 +2637,7 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev)
int err = 0;
 
priv->cmd.context = kmalloc(priv->cmd.max_cmds *
-  sizeof (struct mlx4_cmd_context),
+  sizeof(struct mlx4_cmd_context),
   GFP_KERNEL);
if (!priv->cmd.context)
return -ENOMEM;
@@ -2695,7 +2695,7 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct 
mlx4_dev *dev)
 {
struct mlx4_cmd_mailbox *mailbox;
 
-   mailbox = kmalloc(sizeof *mailbox, GFP_KERNEL);
+   mailbox = kmalloc(sizeof(*mailbox), GFP_KERNEL);
if (!mailbox)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c 
b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index 86d2d42d658d..5a47f9669621 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -44,7 +44,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int 
size, int stride,
struct mlx4_en_dev *mdev = priv->mdev;
struct net_device *dev = priv->dev;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
context->flags = cpu_to_be32(7 << 16 | rss << MLX4_RSS_QPC_FLAG_OFFSET);
context->pd = cpu_to_be32(mdev->priv_pdn);
context->mtu_msgmax = 0xff;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index bf1638044a7a..dcb8f8f84a97 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1056,7 +1056,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
*priv, int qpn,
}
qp->event = mlx4_en_sqp_event;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
qpn, ring->cqn, -1, context);
context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 73faa3d77921..bcf422efd3b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -643,7 +643,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc 
*tx_desc,
 void *fragptr)
 {
struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
-   int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
+   int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof(*inl);
unsigned int hlen = skb_headlen(skb);
 
if (skb->len <= spc) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/et

[PATCH net-next 3/4] skge: add paren around sizeof arg

2017-08-15 Thread Stephen Hemminger

Signed-off-by: Stephen Hemminger 
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 5d7d94de4e00..8a835e82256a 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3516,7 +3516,7 @@ static const char *skge_board_name(const struct skge_hw 
*hw)
if (skge_chips[i].id == hw->chip_id)
return skge_chips[i].name;
 
-   snprintf(buf, sizeof buf, "chipid 0x%x", hw->chip_id);
+   snprintf(buf, sizeof(buf), "chipid 0x%x", hw->chip_id);
return buf;
 }
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next 1/4] tun/tap: use paren's with sizeof

2017-08-15 Thread Stephen Hemminger

Although sizeof is an operator in C. The kernel coding style convention
is to always use it like a function and add parenthesis.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/tap.c | 2 +-
 drivers/net/tun.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 0d039411e64c..21b71ae947fd 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1215,7 +1215,7 @@ int tap_queue_resize(struct tap_dev *tap)
int n = tap->numqueues;
int ret, i = 0;
 
-   arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL);
+   arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL);
if (!arrays)
return -ENOMEM;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5892284eb8d0..f5017121cd57 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2737,7 +2737,7 @@ static int tun_queue_resize(struct tun_struct *tun)
int n = tun->numqueues + tun->numdisabled;
int ret, i;
 
-   arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL);
+   arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL);
if (!arrays)
return -ENOMEM;
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next 2/4] virtio: put paren around sizeof

2017-08-15 Thread Stephen Hemminger

Kernel coding style is to put paren around operand of sizeof.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a3f3c66b4530..4302f313d9a7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -319,7 +319,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 
hdr_len = vi->hdr_len;
if (vi->mergeable_rx_bufs)
-   hdr_padded_len = sizeof *hdr;
+   hdr_padded_len = sizeof(*hdr);
else
hdr_padded_len = sizeof(struct padded_vnet_hdr);
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next 0/4] various sizeof cleanups

2017-08-15 Thread Stephen Hemminger

Noticed some places that were using sizeof as an operator.
This is legal C but is not the convention used in the kernel.

Stephen Hemminger (4):
  tun/tap: use paren's with sizeof
  virtio: put paren around sizeof
  skge: add paren around sizeof arg
  mlx4: sizeof style usage

 drivers/net/ethernet/marvell/skge.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/alloc.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_resources.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c| 20 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.h   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/intf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mcg.c   | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mr.c| 10 -
 drivers/net/ethernet/mellanox/mlx4/qp.c| 12 +--
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 24 +++---
 drivers/net/tap.c  |  2 +-
 drivers/net/tun.c  |  2 +-
 drivers/net/virtio_net.c   |  2 +-
 19 files changed, 60 insertions(+), 60 deletions(-)

-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h

2017-03-03 Thread Stephen Hemminger


Minor coding comments

> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index d324dce..4ff25436 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -178,6 +178,56 @@ void hyperv_cleanup(void);
>  #endif
>  #ifdef CONFIG_HYPERV_TSCPAGE
>  struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
> +static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
> +{
> + u64 scale, offset, current_tick, cur_tsc;
> + u32 sequence;
> +
> + /*
> +  * The protocol for reading Hyper-V TSC page is specified in Hypervisor
> +  * Top-Level Functional Specification ver. 3.0 and above. To get the
> +  * reference time we must do the following:
> +  * - READ ReferenceTscSequence
> +  *   A special '0' value indicates the time source is unreliable and we
> +  *   need to use something else. The currently published specification
> +  *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
> +  *   instead of '0' as the special value, see commit c35b82ef0294.
> +  * - ReferenceTime =
> +  *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
> +  * - READ ReferenceTscSequence again. In case its value has changed
> +  *   since our first reading we need to discard ReferenceTime and repeat
> +  *   the whole sequence as the hypervisor was updating the page in
> +  *   between.
> +  */
> + while (1) {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);
> + if (!sequence)
> + break;

It would be clearer to just return U64_MAX here (and not fall out)
since this is only case here. Also since this failure only occurs if host
clock is not available, probably should be unlikely.

> + /*
> +  * Make sure we read sequence before we read other values from
> +  * TSC page.
> +  */
> + smp_rmb();
> +
> + scale = READ_ONCE(tsc_pg->tsc_scale);
> + offset = READ_ONCE(tsc_pg->tsc_offset);
> + cur_tsc = rdtsc_ordered();

Since you already have smp_ barriers and rdtsc_ordered is a barrier,
the compiler barriers (READ_ONCE()) shouldn't be necessary.

> +
> + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> +
> + /*
> +  * Make sure we read sequence after we read all other values
> +  * from TSC page.
> +  */
> + smp_rmb();
> +
> + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> + return current_tick;
> + }

Why not make do { } while out of this.

do {
...
} while (unlikely(READ_ONCE(tsc_pg->tsc_sequence) != sequence);
return current_tick;

Also don't need to calculate tick value until have good data. As in:

static inline u32 hv_clock_sequence(const struct ms_hyperv_tsc_page *tsc_pg)
{
u32 sequence =
return sequence;
}

static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
{
u64 scale, offset, cur_tsc;
u32 start;

/*
 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
 * Top-Level Functional Specification ver. 3.0 and above. To get the
 * reference time we must do the following:
 * - READ ReferenceTscSequence
 *   A special '0' value indicates the time source is unreliable and we
 *   need to use something else. The currently published specification
 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
 *   instead of '0' as the special value, see commit c35b82ef0294.
 * - ReferenceTime =
 *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
 * - READ ReferenceTscSequence again. In case its value has changed
 *   since our first reading we need to discard ReferenceTime and repeat
 *   the whole sequence as the hypervisor was updating the page in
 *   between.
 */
do {
start = READ_ONCE(tsc_pg->tsc_sequence);
smp_rmb();

if (unlikely(!start))
return U64_MAX;

scale = tsc_pg->tsc_scale;
offset = tsc_pg->tsc_offset;

/*
 * Make sure we read sequence after we read all other values
 * from TSC page.
 */
smp_rmb();
} while (unlikely(READ_ONCE(tsc_pg->tsc_sequence != start)));

cur_tsc = rdtsc_ordered();
return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
}

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-10 Thread Stephen Hemminger via Virtualization

Since sequence count algorithm is done by hypervisor, better to not reuse 
seqcount.
Still concerned that the code is racy.

-Original Message-
From: Thomas Gleixner [mailto:t...@linutronix.de] 
Sent: Friday, February 10, 2017 4:28 AM
To: Vitaly Kuznetsov 
Cc: Stephen Hemminger ; x...@kernel.org; Andy 
Lutomirski ; Ingo Molnar ; H. Peter 
Anvin ; KY Srinivasan ; Haiyang Zhang 
; Dexuan Cui ; 
linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; 
virtualization@lists.linux-foundation.org
Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

On Fri, 10 Feb 2017, Vitaly Kuznetsov wrote:

> Stephen Hemminger  writes:
> 
> > Why not use existing seqlock's?
> >
> 
> To be honest I don't quite understand how we could use it -- the 
> sequence locking here is done against the page updated by the 
> hypersior, we're not creating new structures (so I don't understand 
> how we could use struct seqcount which we don't have) but I may be 
> misunderstanding something.

You can't use seqlock, but you might be able to use seqcount. Though I doubt it 
given the 0 check 

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Stephen Hemminger

On Thu, 9 Feb 2017 21:14:25 +0100 (CET)
Thomas Gleixner  wrote:

> On Thu, 9 Feb 2017, Stephen Hemminger wrote:
> 
> > The actual code looks fine, but the style police will not like you.
> > { should be at start of line on functions.
> > And #else should be at start of line,
> > 
> > But maybe this was just more of exchange mangling the mail.  
> 
> Looks like.
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
> > +   return tsc_pg;
> > +}
> > +  
> 
> That's how it reads in a proper mail client connected to a proper mail
> server:
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> > +{
> > +   return tsc_pg;
> > +}  
> 
> :)


Yup. it looks like the mail server is trying to be "helpful" by eliminating 
extra white space.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Stephen Hemminger

On Thu, 9 Feb 2017 14:55:50 -0800
Andy Lutomirski  wrote:

> On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan  wrote:
> >
> >  
> >> -Original Message-
> >> From: Thomas Gleixner [mailto:t...@linutronix.de]
> >> Sent: Thursday, February 9, 2017 9:08 AM
> >> To: Vitaly Kuznetsov 
> >> Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar
> >> ; H. Peter Anvin ; KY Srinivasan
> >> ; Haiyang Zhang ; Stephen
> >> Hemminger ; Dexuan Cui
> >> ; linux-ker...@vger.kernel.org;
> >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
> >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
> >> method
> >>
> >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:  
> >> > +#ifdef CONFIG_HYPERV_TSCPAGE
> >> > +static notrace u64 vread_hvclock(int *mode)
> >> > +{
> >> > +   const struct ms_hyperv_tsc_page *tsc_pg =
> >> > +   (const struct ms_hyperv_tsc_page *)&hvclock_page;
> >> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
> >> > +
> >> > +   while (1) {
> >> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
> >> > +   if (!sequence)
> >> > +   break;
> >> > +
> >> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
> >> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
> >> > +   rdtscll(cur_tsc);
> >> > +
> >> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> >> > +
> >> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> >> > +   return current_tick;  
> >>
> >> That sequence stuff lacks still a sensible explanation. It's fundamentally
> >> different from the sequence counting we do in the kernel, so documentation
> >> for it is really required.  
> >
> > The host is updating multiple fields in this shared TSC page and the 
> > sequence number is
> > used to ensure that the guest sees a consistent set values published. If I 
> > remember
> > correctly, Xen has a similar mechanism.  
> 
> So what's the actual protocol?  When the hypervisor updates the page,
> does it freeze all guest cpus?  If not, how does it maintain
> atomicity?

The protocol looks a lot like Linux seqlock, but it has an extra protection
which is missing here.

The host needs to update sequence number twice in order to guarantee ordering.
Otherwise it is possible that Host and guest can race.

Host
Write offset
Write scale
Set tsc_sequence = N
  Guest
read sequence = N
Read scale
Write scale
Write offset

Read Offset
Check sequence == N
Set tsc_sequence = N +1

Look like the current host side protocol is wrong.

The solution that Andi Kleen invented, and I used in seqlock was for the writer 
to update
sequence at start and end of transaction. If sequence number is odd, then the 
reader knows
it is looking at stale data.
Host
Write offset
Write scale
Set tsc_sequence = N (end of 
transaction)
  Guest
read sequence = N
Spin until sequence is even (N is even)
Read scale
Set tsc_sequence += 1
Write scale
Write offset

Read Offset
Check sequence == N? (fails is N + 1)
Set tsc_sequence += 1 (end of 
transaction)
read sequence = N+2
Spin until sequence is even (ie N +2)
Read scale  
Read Offset
Check sequence == N +2? (yes ok).

Also it is faster to just read scale and offset with this loop and save
the reading of TSC and doing multiply until after scale/offset has been 
acquired.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Stephen Hemminger via Virtualization

Why not use existing seqlock's?

-Original Message-
From: Thomas Gleixner [mailto:t...@linutronix.de] 
Sent: Thursday, February 9, 2017 9:08 AM
To: Vitaly Kuznetsov 
Cc: x...@kernel.org; Andy Lutomirski ; Ingo Molnar 
; H. Peter Anvin ; KY Srinivasan 
; Haiyang Zhang ; Stephen Hemminger 
; Dexuan Cui ; 
linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; 
virtualization@lists.linux-foundation.org
Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:
> +#ifdef CONFIG_HYPERV_TSCPAGE
> +static notrace u64 vread_hvclock(int *mode) {
> + const struct ms_hyperv_tsc_page *tsc_pg =
> + (const struct ms_hyperv_tsc_page *)&hvclock_page;
> + u64 sequence, scale, offset, current_tick, cur_tsc;
> +
> + while (1) {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);
> + if (!sequence)
> + break;
> +
> + scale = READ_ONCE(tsc_pg->tsc_scale);
> + offset = READ_ONCE(tsc_pg->tsc_offset);
> + rdtscll(cur_tsc);
> +
> + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> +
> + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> + return current_tick;

That sequence stuff lacks still a sensible explanation. It's fundamentally 
different from the sequence counting we do in the kernel, so documentation for 
it is really required.

Thanks,

tglx
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Stephen Hemminger via Virtualization

The actual code looks fine, but the style police will not like you.
{ should be at start of line on functions.
And #else should be at start of line,

But maybe this was just more of exchange mangling the mail.

-Original Message-
From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com] 
Sent: Thursday, February 9, 2017 6:11 AM
To: x...@kernel.org; Andy Lutomirski 
Cc: Thomas Gleixner ; Ingo Molnar ; H. 
Peter Anvin ; KY Srinivasan ; Haiyang Zhang 
; Stephen Hemminger ; Dexuan 
Cui ; linux-ker...@vger.kernel.org; 
de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
Subject: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg available. 
Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to make #ifdef-s 
simple.

Signed-off-by: Vitaly Kuznetsov 
---
 arch/x86/hyperv/hv_init.c   | 9 +++--
 arch/x86/include/asm/mshyperv.h | 8 
 drivers/hv/Kconfig  | 3 +++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 
b371d0e..0ce8485 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,10 +27,15 @@
 #include 
 
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 
 static struct ms_hyperv_tsc_page *tsc_pg;
 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
+   return tsc_pg;
+}
+
 static u64 read_hv_clock_tsc(struct clocksource *arg)  {
u64 current_tick;
@@ -136,7 +141,7 @@ void hyperv_init(void)
/*
 * Register Hyper-V specific clocksource.
 */
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) {
union hv_x64_msr_hypercall_contents tsc_msr;
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h 
index f8dc370..14dd92c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs);  bool 
hv_is_hypercall_page_setup(void);  void hyperv_cleanup(void);  #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void); #else static inline 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
+   return NULL;
+}
+#endif
 #endif
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig index 0403b51..c29cd53 
100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,9 @@ config HYPERV
  Select this option to run Linux as a Hyper-V client operating
  system.
 
+config HYPERV_TSCPAGE
+   def_bool HYPERV && X86_64
+
 config HYPERV_UTILS
tristate "Microsoft Hyper-V Utilities driver"
depends on HYPERV && CONNECTOR && NLS
--
2.9.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH 05/14] netvsc: remove no longer needed receive staging buffers

2017-02-06 Thread Stephen Hemminger via Virtualization

The netvsc part is already in net-next.  This patch is not needed.
The part that removes the per-channel state can be in another patch.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 12/14] vmbus: expose hv_begin/end_read

2017-02-01 Thread Stephen Hemminger

In order to implement NAPI in netvsc, the driver needs access to
control host interrupt mask.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/hyperv_vmbus.h |  4 
 drivers/hv/ring_buffer.c  | 20 
 include/linux/hyperv.h| 30 ++
 3 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 6a9b54677218..e15a130de3c9 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -292,10 +292,6 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
 void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
struct hv_ring_buffer_debug_info *debug_info);
 
-void hv_begin_read(struct hv_ring_buffer_info *rbi);
-
-u32 hv_end_read(struct hv_ring_buffer_info *rbi);
-
 /*
  * Maximum channels is determined by the size of the interrupt page
  * which is PAGE_SIZE. 1/2 of PAGE_SIZE is for send endpoint interrupt
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 146fd8ab2a2a..47ab69089115 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -32,26 +32,6 @@
 
 #include "hyperv_vmbus.h"
 
-void hv_begin_read(struct hv_ring_buffer_info *rbi)
-{
-   rbi->ring_buffer->interrupt_mask = 1;
-   virt_mb();
-}
-
-u32 hv_end_read(struct hv_ring_buffer_info *rbi)
-{
-
-   rbi->ring_buffer->interrupt_mask = 0;
-   virt_mb();
-
-   /*
-* Now check to see if the ring buffer is still empty.
-* If it is not, we raced and we need to process new
-* incoming messages.
-*/
-   return hv_get_bytes_to_read(rbi);
-}
-
 /*
  * When we write to the ring buffer, check if the host needs to
  * be signaled. Here is the details of this protocol:
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 9b0165b11c5c..dc50997a3fba 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1473,6 +1473,36 @@ static inline  void hv_signal_on_read(struct 
vmbus_channel *channel)
 }
 
 /*
+ * Mask off host interrupt callback notifications
+ */
+static inline void hv_begin_read(struct hv_ring_buffer_info *rbi)
+{
+   rbi->ring_buffer->interrupt_mask = 1;
+
+   /* make sure mask update is not reordered */
+   virt_mb();
+}
+
+/*
+ * Re-enable host callback and return number of outstanding bytes
+ */
+static inline u32 hv_end_read(struct hv_ring_buffer_info *rbi)
+{
+
+   rbi->ring_buffer->interrupt_mask = 0;
+
+   /* make sure mask update is not reordered */
+   virt_mb();
+
+   /*
+* Now check to see if the ring buffer is still empty.
+* If it is not, we raced and we need to process new
+* incoming messages.
+*/
+   return hv_get_bytes_to_read(rbi);
+}
+
+/*
  * An API to support in-place processing of incoming VMBUS packets.
  */
 #define VMBUS_PKT_TRAILER  8
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 13/14] vmbus: constify parameters where possible

2017-02-01 Thread Stephen Hemminger

Functions that just query state of ring buffer can have parameters
marked const.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/hyperv_vmbus.h |  6 +++---
 drivers/hv/ring_buffer.c  | 22 ++
 include/linux/hyperv.h| 12 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index e15a130de3c9..884f83bba1ab 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -283,14 +283,14 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count);
+   const struct kvec *kv_list, u32 kv_count);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
   u64 *requestid, bool raw);
 
-void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
-   struct hv_ring_buffer_debug_info *debug_info);
+void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+struct hv_ring_buffer_debug_info *debug_info);
 
 /*
  * Maximum channels is determined by the size of the interrupt page
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 47ab69089115..ee3e488d9dee 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -96,11 +96,9 @@ hv_set_next_write_location(struct hv_ring_buffer_info 
*ring_info,
 
 /* Get the next read location for the specified ring buffer. */
 static inline u32
-hv_get_next_read_location(struct hv_ring_buffer_info *ring_info)
+hv_get_next_read_location(const struct hv_ring_buffer_info *ring_info)
 {
-   u32 next = ring_info->ring_buffer->read_index;
-
-   return next;
+   return ring_info->ring_buffer->read_index;
 }
 
 /*
@@ -108,8 +106,8 @@ hv_get_next_read_location(struct hv_ring_buffer_info 
*ring_info)
  * This allows the caller to skip.
  */
 static inline u32
-hv_get_next_readlocation_withoffset(struct hv_ring_buffer_info *ring_info,
-u32 offset)
+hv_get_next_readlocation_withoffset(const struct hv_ring_buffer_info 
*ring_info,
+   u32 offset)
 {
u32 next = ring_info->ring_buffer->read_index;
 
@@ -130,7 +128,7 @@ hv_set_next_read_location(struct hv_ring_buffer_info 
*ring_info,
 
 /* Get the size of the ring buffer. */
 static inline u32
-hv_get_ring_buffersize(struct hv_ring_buffer_info *ring_info)
+hv_get_ring_buffersize(const struct hv_ring_buffer_info *ring_info)
 {
return ring_info->ring_datasize;
 }
@@ -147,7 +145,7 @@ hv_get_ring_bufferindices(struct hv_ring_buffer_info 
*ring_info)
  * Assume there is enough room. Handles wrap-around in src case only!!
  */
 static u32 hv_copyfrom_ringbuffer(
-   struct hv_ring_buffer_info  *ring_info,
+   const struct hv_ring_buffer_info *ring_info,
void*dest,
u32 destlen,
u32 start_read_offset)
@@ -171,7 +169,7 @@ static u32 hv_copyfrom_ringbuffer(
 static u32 hv_copyto_ringbuffer(
struct hv_ring_buffer_info  *ring_info,
u32 start_write_offset,
-   void*src,
+   const void  *src,
u32 srclen)
 {
void *ring_buffer = hv_get_ring_buffer(ring_info);
@@ -186,8 +184,8 @@ static u32 hv_copyto_ringbuffer(
 }
 
 /* Get various debug metrics for the specified ring buffer. */
-void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
-   struct hv_ring_buffer_debug_info *debug_info)
+void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+struct hv_ring_buffer_debug_info *debug_info)
 {
u32 bytes_avail_towrite;
u32 bytes_avail_toread;
@@ -264,7 +262,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info 
*ring_info)
 
 /* Write to the ring buffer. */
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count)
+   const struct kvec *kv_list, u32 kv_count)
 {
int i = 0;
u32 bytes_avail_towrite;
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dc50997a3fba..32a9cbc66b65 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -137,8 +137,8 @@ struct hv_ring_buffer_info {
  * for the specified ring buffer
  */
 static inline void
-hv_get_ringbuffer_availbytes(struct hv_ring_buffer_info *rbi,
- u32 *read, u32 *write)
+hv_get_ringbuffer_availbytes(const struct hv_ring_buffer_info *rbi,
+

[PATCH 14/14] vmbus: replace modulus operation with subtraction

2017-02-01 Thread Stephen Hemminger

Takes less clock cycles to check for ring wrap and subtract than to
do a modulus instruction.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/ring_buffer.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index ee3e488d9dee..8ab6298fd5ae 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -112,7 +112,8 @@ hv_get_next_readlocation_withoffset(const struct 
hv_ring_buffer_info *ring_info,
u32 next = ring_info->ring_buffer->read_index;
 
next += offset;
-   next %= ring_info->ring_datasize;
+   if (next >= ring_info->ring_datasize)
+   next -= ring_info->ring_datasize;
 
return next;
 }
@@ -156,7 +157,8 @@ static u32 hv_copyfrom_ringbuffer(
memcpy(dest, ring_buffer + start_read_offset, destlen);
 
start_read_offset += destlen;
-   start_read_offset %= ring_buffer_size;
+   if (start_read_offset >= ring_buffer_size)
+   start_read_offset -= ring_buffer_size;
 
return start_read_offset;
 }
@@ -178,7 +180,8 @@ static u32 hv_copyto_ringbuffer(
memcpy(ring_buffer + start_write_offset, src, srclen);
 
start_write_offset += srclen;
-   start_write_offset %= ring_buffer_size;
+   if (start_write_offset >= ring_buffer_size)
+   start_write_offset -= ring_buffer_size;
 
return start_write_offset;
 }
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 11/14] vmbus: remove conditional locking of vmbus_write

2017-02-01 Thread Stephen Hemminger

All current usage of vmbus write uses the acquire_lock flag, therefore
having it be optional is unnecessary. This also fixes a sparse warning
since sparse doesn't like when a function has conditional locking.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel.c  | 13 -
 drivers/hv/channel_mgmt.c |  1 -
 drivers/hv/hyperv_vmbus.h |  3 +--
 drivers/hv/ring_buffer.c  | 11 ---
 include/linux/hyperv.h| 15 ---
 5 files changed, 9 insertions(+), 34 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 18cc1c78260d..81a80c82f1bd 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -651,7 +651,6 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
u32 packetlen_aligned = ALIGN(packetlen, sizeof(u64));
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
int num_vecs = ((bufferlen != 0) ? 3 : 1);
 
 
@@ -670,7 +669,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock);
+   return hv_ringbuffer_write(channel, bufferlist, num_vecs);
 }
 EXPORT_SYMBOL(vmbus_sendpacket_ctl);
 
@@ -716,12 +715,10 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
 
if (pagecount > MAX_PAGE_BUFFER_COUNT)
return -EINVAL;
 
-
/*
 * Adjust the size down since vmbus_channel_packet_page_buffer is the
 * largest size we support
@@ -753,7 +750,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl);
 
@@ -789,7 +786,6 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
 
packetlen = desc_size + bufferlen;
packetlen_aligned = ALIGN(packetlen, sizeof(u64));
@@ -809,7 +805,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc);
 
@@ -827,7 +823,6 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
u32 pfncount = NUM_PAGES_SPANNED(multi_pagebuffer->offset,
 multi_pagebuffer->len);
 
@@ -866,7 +861,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer);
 
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index b2bb5aafaa2f..f33465d78a02 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -332,7 +332,6 @@ static struct vmbus_channel *alloc_channel(void)
if (!channel)
return NULL;
 
-   channel->acquire_ring_lock = true;
spin_lock_init(&channel->inbound_lock);
spin_lock_init(&channel->lock);
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 558a798c407c..6a9b54677218 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -283,8 +283,7 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list,
-   u32 kv_count, bool lock);
+   struct kvec *kv_list, u32 kv_count);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 30ca55aefd24..146fd8ab2a2a 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -284,7 +284

[PATCH 10/14] vmbus: add direct isr callback mode

2017-02-01 Thread Stephen Hemminger

Change the simple boolean batched_reading into a tri-value.
For future NAPI support in netvsc driver, the callback needs to
occur directly in interrupt handler.

Batched mode is also changed to disable host interrupts immediately
in interrupt routine (to avoid unnecessary host signals), and the
tasklet is rescheduled if more data is detected.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel_mgmt.c|  7 ---
 drivers/hv/connection.c  | 27 ---
 drivers/hv/hv_util.c |  3 +--
 drivers/hv/vmbus_drv.c   | 26 --
 drivers/uio/uio_hv_generic.c |  2 +-
 include/linux/hyperv.h   | 31 +--
 6 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 2f6270d76b79..b2bb5aafaa2f 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -820,13 +820,6 @@ static void vmbus_onoffer(struct 
vmbus_channel_message_header *hdr)
}
 
/*
-* By default we setup state to enable batched
-* reading. A specific service can choose to
-* disable this prior to opening the channel.
-*/
-   newchannel->batched_reading = true;
-
-   /*
 * Setup state for signalling the host.
 */
newchannel->sig_event = (struct hv_input_signal_event *)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 27e72dc07e12..a8366fec1458 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -300,9 +300,7 @@ struct vmbus_channel *relid2channel(u32 relid)
 void vmbus_on_event(unsigned long data)
 {
struct vmbus_channel *channel = (void *) data;
-   void *arg;
-   bool read_state;
-   u32 bytes_to_read;
+   void (*callback_fn)(void *);
 
/*
 * A channel once created is persistent even when there
@@ -312,9 +310,13 @@ void vmbus_on_event(unsigned long data)
 * Thus, checking and invoking the driver specific callback takes
 * care of orderly unloading of the driver.
 */
-   if (channel->onchannel_callback != NULL) {
-   arg = channel->channel_callback_context;
-   read_state = channel->batched_reading;
+   callback_fn = READ_ONCE(channel->onchannel_callback);
+   if (unlikely(callback_fn == NULL))
+   return;
+
+   (*callback_fn)(channel->channel_callback_context);
+
+   if (channel->callback_mode == HV_CALL_BATCHED) {
/*
 * This callback reads the messages sent by the host.
 * We can optimize host to guest signaling by ensuring:
@@ -326,16 +328,11 @@ void vmbus_on_event(unsigned long data)
 *state is set we check to see if additional packets are
 *available to read. In this case we repeat the process.
 */
+   if (hv_end_read(&channel->inbound) != 0) {
+   hv_begin_read(&channel->inbound);
 
-   do {
-   if (read_state)
-   hv_begin_read(&channel->inbound);
-   channel->onchannel_callback(arg);
-   if (read_state)
-   bytes_to_read = hv_end_read(&channel->inbound);
-   else
-   bytes_to_read = 0;
-   } while (read_state && (bytes_to_read != 0));
+   tasklet_schedule(&channel->callback_event);
+   }
}
 }
 
diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index d42ede78a9dd..8410191b4992 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -409,8 +409,7 @@ static int util_probe(struct hv_device *dev,
 * Turn off batched reading for all util drivers before we open the
 * channel.
 */
-
-   set_channel_read_state(dev->channel, false);
+   set_channel_read_mode(dev->channel, HV_CALL_DIRECT);
 
hv_set_drvdata(dev, srv);
 
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index eaf1a10b0245..f7f6b9144b07 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -887,6 +887,18 @@ void vmbus_on_msg_dpc(unsigned long data)
 
 
 /*
+ * Direct callback for channels using other deferred processing
+ */
+static void vmbus_channel_isr(struct vmbus_channel *channel)
+{
+   void (*callback_fn)(void *);
+
+   callback_fn = READ_ONCE(channel->onchannel_callback);
+   if (likely(callback_fn != NULL))
+   (*callback_fn)(channel->channel_callback_context);
+}
+
+/*
  * Schedule all channels with events pending
  */
 static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
@@ -927,9 +939,19 @@ static void vmbus_chan_sched(struct hv_per_cpu_context 
*hv_cpu)
 
/* Find channel based on rel

[PATCH 09/14] vmbus: change to per channel tasklet

2017-02-01 Thread Stephen Hemminger

Make the event handling tasklet per channel rather than per-cpu.
This allows for better fairness when getting lots of data on the same
cpu.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel.c  |  2 +-
 drivers/hv/channel_mgmt.c | 16 +-
 drivers/hv/connection.c   | 78 ++-
 drivers/hv/hv.c   |  2 --
 drivers/hv/hyperv_vmbus.h |  1 -
 drivers/hv/vmbus_drv.c| 58 ++-
 include/linux/hyperv.h|  3 +-
 7 files changed, 64 insertions(+), 96 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 789c75f6df26..18cc1c78260d 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -530,7 +530,7 @@ static int vmbus_close_internal(struct vmbus_channel 
*channel)
int ret;
 
/*
-* process_chn_event(), running in the tasklet, can race
+* vmbus_on_event(), running in the tasklet, can race
 * with vmbus_close_internal() in the case of SMP guest, e.g., when
 * the former is accessing channel->inbound.ring_buffer, the latter
 * could be freeing the ring_buffer pages.
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 579ad2560a39..2f6270d76b79 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -339,6 +339,9 @@ static struct vmbus_channel *alloc_channel(void)
INIT_LIST_HEAD(&channel->sc_list);
INIT_LIST_HEAD(&channel->percpu_list);
 
+   tasklet_init(&channel->callback_event,
+vmbus_on_event, (unsigned long)channel);
+
return channel;
 }
 
@@ -347,6 +350,7 @@ static struct vmbus_channel *alloc_channel(void)
  */
 static void free_channel(struct vmbus_channel *channel)
 {
+   tasklet_kill(&channel->callback_event);
kfree(channel);
 }
 
@@ -380,21 +384,15 @@ static void vmbus_release_relid(u32 relid)
 
 void hv_event_tasklet_disable(struct vmbus_channel *channel)
 {
-   struct hv_per_cpu_context *hv_cpu;
-
-   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
-   tasklet_disable(&hv_cpu->event_dpc);
+   tasklet_disable(&channel->callback_event);
 }
 
 void hv_event_tasklet_enable(struct vmbus_channel *channel)
 {
-   struct hv_per_cpu_context *hv_cpu;
-
-   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
-   tasklet_enable(&hv_cpu->event_dpc);
+   tasklet_enable(&channel->callback_event);
 
/* In case there is any pending event */
-   tasklet_schedule(&hv_cpu->event_dpc);
+   tasklet_schedule(&channel->callback_event);
 }
 
 void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 158f12823baf..27e72dc07e12 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -260,29 +260,6 @@ void vmbus_disconnect(void)
 }
 
 /*
- * Map the given relid to the corresponding channel based on the
- * per-cpu list of channels that have been affinitized to this CPU.
- * This will be used in the channel callback path as we can do this
- * mapping in a lock-free fashion.
- */
-static struct vmbus_channel *pcpu_relid2channel(u32 relid)
-{
-   struct hv_per_cpu_context *hv_cpu
-   = this_cpu_ptr(hv_context.cpu_context);
-   struct vmbus_channel *found_channel = NULL;
-   struct vmbus_channel *channel;
-
-   list_for_each_entry(channel, &hv_cpu->chan_list, percpu_list) {
-   if (channel->offermsg.child_relid == relid) {
-   found_channel = channel;
-   break;
-   }
-   }
-
-   return found_channel;
-}
-
-/*
  * relid2channel - Get the channel object given its
  * child relative id (ie channel id)
  */
@@ -318,25 +295,16 @@ struct vmbus_channel *relid2channel(u32 relid)
 }
 
 /*
- * process_chn_event - Process a channel event notification
+ * vmbus_on_event - Process a channel event notification
  */
-static void process_chn_event(u32 relid)
+void vmbus_on_event(unsigned long data)
 {
-   struct vmbus_channel *channel;
+   struct vmbus_channel *channel = (void *) data;
void *arg;
bool read_state;
u32 bytes_to_read;
 
/*
-* Find the channel based on this relid and invokes the
-* channel callback to process the event
-*/
-   channel = pcpu_relid2channel(relid);
-
-   if (!channel)
-   return;
-
-   /*
 * A channel once created is persistent even when there
 * is no driver handling the device. An unloading driver
 * sets the onchannel_callback to NULL on the same CPU
@@ -344,7 +312,6 @@ static void process_chn_event(u32 relid)
 * Thus, checking and invoking the driver specific callback takes
 * care of orderly unloading of the driver.
 */
-
if (chann

[PATCH 08/14] vmbus: put related per-cpu variable together

2017-02-01 Thread Stephen Hemminger

The hv_context structure had several arrays which were per-cpu
and was allocating small structures (tasklet_struct). Instead use
a single per-cpu array.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel_mgmt.c |  35 -
 drivers/hv/connection.c   |  20 ---
 drivers/hv/hv.c   | 130 --
 drivers/hv/hyperv_vmbus.h |  53 +++
 drivers/hv/vmbus_drv.c|  39 --
 5 files changed, 143 insertions(+), 134 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index de90a9900fee..579ad2560a39 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -353,9 +353,10 @@ static void free_channel(struct vmbus_channel *channel)
 static void percpu_channel_enq(void *arg)
 {
struct vmbus_channel *channel = arg;
-   int cpu = smp_processor_id();
+   struct hv_per_cpu_context *hv_cpu
+   = this_cpu_ptr(hv_context.cpu_context);
 
-   list_add_tail(&channel->percpu_list, &hv_context.percpu_list[cpu]);
+   list_add_tail(&channel->percpu_list, &hv_cpu->chan_list);
 }
 
 static void percpu_channel_deq(void *arg)
@@ -379,19 +380,21 @@ static void vmbus_release_relid(u32 relid)
 
 void hv_event_tasklet_disable(struct vmbus_channel *channel)
 {
-   struct tasklet_struct *tasklet;
-   tasklet = hv_context.event_dpc[channel->target_cpu];
-   tasklet_disable(tasklet);
+   struct hv_per_cpu_context *hv_cpu;
+
+   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
+   tasklet_disable(&hv_cpu->event_dpc);
 }
 
 void hv_event_tasklet_enable(struct vmbus_channel *channel)
 {
-   struct tasklet_struct *tasklet;
-   tasklet = hv_context.event_dpc[channel->target_cpu];
-   tasklet_enable(tasklet);
+   struct hv_per_cpu_context *hv_cpu;
+
+   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
+   tasklet_enable(&hv_cpu->event_dpc);
 
/* In case there is any pending event */
-   tasklet_schedule(tasklet);
+   tasklet_schedule(&hv_cpu->event_dpc);
 }
 
 void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid)
@@ -726,9 +729,12 @@ static void vmbus_wait_for_unload(void)
break;
 
for_each_online_cpu(cpu) {
-   page_addr = hv_context.synic_message_page[cpu];
-   msg = (struct hv_message *)page_addr +
-   VMBUS_MESSAGE_SINT;
+   struct hv_per_cpu_context *hv_cpu
+   = per_cpu_ptr(hv_context.cpu_context, cpu);
+
+   page_addr = hv_cpu->synic_message_page;
+   msg = (struct hv_message *)page_addr
+   + VMBUS_MESSAGE_SINT;
 
message_type = READ_ONCE(msg->header.message_type);
if (message_type == HVMSG_NONE)
@@ -752,7 +758,10 @@ static void vmbus_wait_for_unload(void)
 * messages after we reconnect.
 */
for_each_online_cpu(cpu) {
-   page_addr = hv_context.synic_message_page[cpu];
+   struct hv_per_cpu_context *hv_cpu
+   = per_cpu_ptr(hv_context.cpu_context, cpu);
+
+   page_addr = hv_cpu->synic_message_page;
msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT;
msg->header.message_type = HVMSG_NONE;
}
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 1766ef03e78d..158f12823baf 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -93,12 +93,10 @@ static int vmbus_negotiate_version(struct 
vmbus_channel_msginfo *msginfo,
 * all the CPUs. This is needed for kexec to work correctly where
 * the CPU attempting to connect may not be CPU 0.
 */
-   if (version >= VERSION_WIN8_1) {
-   msg->target_vcpu = hv_context.vp_index[get_cpu()];
-   put_cpu();
-   } else {
+   if (version >= VERSION_WIN8_1)
+   msg->target_vcpu = hv_context.vp_index[smp_processor_id()];
+   else
msg->target_vcpu = 0;
-   }
 
/*
 * Add to list before we send the request since we may
@@ -269,12 +267,12 @@ void vmbus_disconnect(void)
  */
 static struct vmbus_channel *pcpu_relid2channel(u32 relid)
 {
+   struct hv_per_cpu_context *hv_cpu
+   = this_cpu_ptr(hv_context.cpu_context);
+   struct vmbus_channel *found_channel = NULL;
struct vmbus_channel *channel;
-   struct vmbus_channel *found_channel  = NULL;
-   int cpu = smp_processor_id();
-   struct list_head *pcpu_head = &hv_context.percpu_list[cpu];
 
-   list_for_each_entry(channel, pcpu_head, percpu_list) {
+   list_for_each_entry(channel, &

[PATCH 07/14] vmbus: callback is in softirq not workqueue

2017-02-01 Thread Stephen Hemminger

The callback is done via tasklet not workqueue.

Signed-off-by: Stephen Hemminger 
---
 include/linux/hyperv.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 39d493ce550d..b30808f740f9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -729,9 +728,7 @@ struct vmbus_channel {
 
struct vmbus_close_msg close_msg;
 
-   /* Channel callback are invoked in this workqueue context */
-   /* HANDLE dataWorkQueue; */
-
+   /* Channel callback's invoked in softirq context */
void (*onchannel_callback)(void *context);
void *channel_callback_context;
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 06/14] vmbus: remove per channel state

2017-02-01 Thread Stephen Hemminger

The netvsc no longer needs per channel state hook to track receive buffer.

Signed-off-by: Stephen Hemminger 
---
 include/linux/hyperv.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 8c6a1505b876..39d493ce550d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -823,10 +823,6 @@ struct vmbus_channel {
 */
struct vmbus_channel *primary_channel;
/*
-* Support per-channel state for use by vmbus drivers.
-*/
-   void *per_channel_state;
-   /*
 * To support per-cpu lookup mapping of relid to channel,
 * link up channels based on their CPU affinity.
 */
@@ -903,16 +899,6 @@ static inline void set_channel_read_state(struct 
vmbus_channel *c, bool state)
c->batched_reading = state;
 }
 
-static inline void set_per_channel_state(struct vmbus_channel *c, void *s)
-{
-   c->per_channel_state = s;
-}
-
-static inline void *get_per_channel_state(struct vmbus_channel *c)
-{
-   return c->per_channel_state;
-}
-
 static inline void set_channel_pending_send_size(struct vmbus_channel *c,
 u32 size)
 {
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 05/14] netvsc: remove no longer needed receive staging buffers

2017-02-01 Thread Stephen Hemminger

Since commit aed8c164ca5199 ("Drivers: hv: ring_buffer: count on wrap
around mappings") it is no longer necessary to handle ring wrapping
by having a special receive buffer.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h   |  5 ---
 drivers/net/hyperv/netvsc.c   | 83 ++-
 drivers/net/hyperv/rndis_filter.c | 11 --
 3 files changed, 11 insertions(+), 88 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 3958adade7eb..cce70ceba6d5 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -748,11 +748,6 @@ struct netvsc_device {
 
int ring_size;
 
-   /* The primary channel callback buffer */
-   unsigned char *cb_buffer;
-   /* The sub channel callback buffer */
-   unsigned char *sub_cb_buf;
-
struct multi_send_data msd[VRSS_CHANNEL_MAX];
u32 max_pkt; /* max number of pkt in one send, e.g. 8 */
u32 pkt_align; /* alignment bytes, e.g. 8 */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index e326e68f9f6d..7487498b663c 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -67,12 +67,6 @@ static struct netvsc_device *alloc_net_device(void)
if (!net_device)
return NULL;
 
-   net_device->cb_buffer = kzalloc(NETVSC_PACKET_SIZE, GFP_KERNEL);
-   if (!net_device->cb_buffer) {
-   kfree(net_device);
-   return NULL;
-   }
-
net_device->mrc[0].buf = vzalloc(NETVSC_RECVSLOT_MAX *
 sizeof(struct recv_comp_data));
 
@@ -93,7 +87,6 @@ static void free_netvsc_device(struct netvsc_device *nvdev)
for (i = 0; i < VRSS_CHANNEL_MAX; i++)
vfree(nvdev->mrc[i].buf);
 
-   kfree(nvdev->cb_buffer);
kfree(nvdev);
 }
 
@@ -584,7 +577,6 @@ void netvsc_device_remove(struct hv_device *device)
vmbus_close(device->channel);
 
/* Release all resources */
-   vfree(net_device->sub_cb_buf);
free_netvsc_device(net_device);
 }
 
@@ -1256,16 +1248,11 @@ static void netvsc_process_raw_pkt(struct hv_device 
*device,
 
 void netvsc_channel_cb(void *context)
 {
-   int ret;
-   struct vmbus_channel *channel = (struct vmbus_channel *)context;
+   struct vmbus_channel *channel = context;
u16 q_idx = channel->offermsg.offer.sub_channel_index;
struct hv_device *device;
struct netvsc_device *net_device;
-   u32 bytes_recvd;
-   u64 request_id;
struct vmpacket_descriptor *desc;
-   unsigned char *buffer;
-   int bufferlen = NETVSC_PACKET_SIZE;
struct net_device *ndev;
bool need_to_commit = false;
 
@@ -1277,65 +1264,19 @@ void netvsc_channel_cb(void *context)
net_device = get_inbound_net_device(device);
if (!net_device)
return;
+
ndev = hv_get_drvdata(device);
-   buffer = get_per_channel_state(channel);
-
-   do {
-   desc = get_next_pkt_raw(channel);
-   if (desc != NULL) {
-   netvsc_process_raw_pkt(device,
-  channel,
-  net_device,
-  ndev,
-  desc->trans_id,
-  desc);
-
-   put_pkt_raw(channel, desc);
-   need_to_commit = true;
-   continue;
-   }
-   if (need_to_commit) {
-   need_to_commit = false;
-   commit_rd_index(channel);
-   }
 
-   ret = vmbus_recvpacket_raw(channel, buffer, bufferlen,
-  &bytes_recvd, &request_id);
-   if (ret == 0) {
-   if (bytes_recvd > 0) {
-   desc = (struct vmpacket_descriptor *)buffer;
-   netvsc_process_raw_pkt(device,
-  channel,
-  net_device,
-  ndev,
-  request_id,
-  desc);
-   } else {
-   /*
-* We are done for this pass.
-*/
-   break;
-   }
-
-   } else if (ret == -ENOBUFS) {
-   if (bufferlen > NETVSC_PACKET_SIZE)
-   kfree(buffer);
-   /* Handle large packet */
-   buffer = kmallo

[PATCH 04/14] vmbus: remove unused kickq argument to sendpacket

2017-02-01 Thread Stephen Hemminger

Since sendpacket no longer uses kickq argument remove it.
Remove it no longer used xmit_more in sendpacket in netvsc as well.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel.c| 19 +--
 drivers/net/hyperv/netvsc.c | 21 +++--
 include/linux/hyperv.h  |  6 ++
 3 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index e26285cde8e0..789c75f6df26 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -643,8 +643,8 @@ void vmbus_close(struct vmbus_channel *channel)
 EXPORT_SYMBOL_GPL(vmbus_close);
 
 int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer,
-  u32 bufferlen, u64 requestid,
-  enum vmbus_packet_type type, u32 flags, bool kick_q)
+u32 bufferlen, u64 requestid,
+enum vmbus_packet_type type, u32 flags)
 {
struct vmpacket_descriptor desc;
u32 packetlen = sizeof(struct vmpacket_descriptor) + bufferlen;
@@ -693,7 +693,7 @@ int vmbus_sendpacket(struct vmbus_channel *channel, void 
*buffer,
   enum vmbus_packet_type type, u32 flags)
 {
return vmbus_sendpacket_ctl(channel, buffer, bufferlen, requestid,
-   type, flags, true);
+   type, flags);
 }
 EXPORT_SYMBOL(vmbus_sendpacket);
 
@@ -705,11 +705,9 @@ EXPORT_SYMBOL(vmbus_sendpacket);
  * explicitly.
  */
 int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel,
-struct hv_page_buffer pagebuffers[],
-u32 pagecount, void *buffer, u32 bufferlen,
-u64 requestid,
-u32 flags,
-bool kick_q)
+   struct hv_page_buffer pagebuffers[],
+   u32 pagecount, void *buffer, u32 bufferlen,
+   u64 requestid, u32 flags)
 {
int i;
struct vmbus_channel_packet_page_buffer desc;
@@ -769,9 +767,10 @@ int vmbus_sendpacket_pagebuffer(struct vmbus_channel 
*channel,
 u64 requestid)
 {
u32 flags = VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED;
+
return vmbus_sendpacket_pagebuffer_ctl(channel, pagebuffers, pagecount,
-  buffer, bufferlen, requestid,
-  flags, true);
+  buffer, bufferlen,
+  requestid, flags);
 
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer);
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 5a1cc089acb7..e326e68f9f6d 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -723,8 +723,6 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device 
*net_device,
char *dest = start + (section_index * net_device->send_section_size)
 + pend_size;
int i;
-   bool is_data_pkt = (skb != NULL) ? true : false;
-   bool xmit_more = (skb != NULL) ? skb->xmit_more : false;
u32 msg_size = 0;
u32 padding = 0;
u32 remain = packet->total_data_buflen % net_device->pkt_align;
@@ -732,7 +730,7 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device 
*net_device,
packet->page_buf_cnt;
 
/* Add padding */
-   if (is_data_pkt && xmit_more && remain &&
+   if (skb && skb->xmit_more && remain &&
!packet->cp_partial) {
padding = net_device->pkt_align - remain;
rndis_msg->msg_len += padding;
@@ -772,7 +770,6 @@ static inline int netvsc_send_pkt(
int ret;
struct hv_page_buffer *pgbuf;
u32 ring_avail = hv_ringbuf_avail_percent(&out_channel->outbound);
-   bool xmit_more = (skb != NULL) ? skb->xmit_more : false;
 
nvmsg.hdr.msg_type = NVSP_MSG1_TYPE_SEND_RNDIS_PKT;
if (skb != NULL) {
@@ -796,16 +793,6 @@ static inline int netvsc_send_pkt(
if (out_channel->rescind)
return -ENODEV;
 
-   /*
-* It is possible that once we successfully place this packet
-* on the ringbuffer, we may stop the queue. In that case, we want
-* to notify the host independent of the xmit_more flag. We don't
-* need to be precise here; in the worst case we may signal the host
-* unnecessarily.
-*/
-   if (ring_avail < (RING_AVAIL_PERCENT_LOWATER + 1))
-   xmit_more = false;
-
if (packet->page_buf_cnt) {
pgbuf = packet->cp_partial ? (*pb) +
packet->rmsg_pgcnt : (*pb);
@@ -815,15 +

[PATCH 03/14] vmbus: remove no longer used signal_policy

2017-02-01 Thread Stephen Hemminger

The explicit signal policy is no longer used. A different mechanism
will be added later when xmit_more is supported.

Signed-off-by: Stephen Hemminger 
---
 include/linux/hyperv.h | 18 --
 1 file changed, 18 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 85b26f06e172..423fc96cc26a 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -670,11 +670,6 @@ struct hv_input_signal_event_buffer {
struct hv_input_signal_event event;
 };
 
-enum hv_signal_policy {
-   HV_SIGNAL_POLICY_DEFAULT = 0,
-   HV_SIGNAL_POLICY_EXPLICIT,
-};
-
 enum hv_numa_policy {
HV_BALANCED = 0,
HV_LOCALIZED,
@@ -837,13 +832,6 @@ struct vmbus_channel {
 */
struct list_head percpu_list;
/*
-* Host signaling policy: The default policy will be
-* based on the ring buffer state. We will also support
-* a policy where the client driver can have explicit
-* signaling control.
-*/
-   enum hv_signal_policy  signal_policy;
-   /*
 * On the channel send side, many of the VMBUS
 * device drivers explicity serialize access to the
 * outgoing ring buffer. Give more control to the
@@ -904,12 +892,6 @@ static inline bool is_hvsock_channel(const struct 
vmbus_channel *c)
  VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
 }
 
-static inline void set_channel_signal_state(struct vmbus_channel *c,
-   enum hv_signal_policy policy)
-{
-   c->signal_policy = policy;
-}
-
 static inline void set_channel_affinity_state(struct vmbus_channel *c,
  enum hv_numa_policy policy)
 {
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 01/14] vmbus: use kernel bitops for traversing interrupt mask

2017-02-01 Thread Stephen Hemminger

Use standard kernel operations for find first set bit to traverse
the channel bit array. This has added benefit of speeding up
lookup on 64 bit and because it uses find first set instruction.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel.c  |  8 ++-
 drivers/hv/connection.c   | 55 +++
 drivers/hv/hyperv_vmbus.h | 16 --
 drivers/hv/vmbus_drv.c|  4 +---
 4 files changed, 29 insertions(+), 54 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index be34547cdb68..a016c5c0e472 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -47,12 +47,8 @@ void vmbus_setevent(struct vmbus_channel *channel)
 * For channels marked as in "low latency" mode
 * bypass the monitor page mechanism.
 */
-   if ((channel->offermsg.monitor_allocated) &&
-   (!channel->low_latency)) {
-   /* Each u32 represents 32 channels */
-   sync_set_bit(channel->offermsg.child_relid & 31,
-   (unsigned long *) vmbus_connection.send_int_page +
-   (channel->offermsg.child_relid >> 5));
+   if (channel->offermsg.monitor_allocated && !channel->low_latency) {
+   vmbus_send_interrupt(channel->offermsg.child_relid);
 
/* Get the child to parent monitor page */
monitorpage = vmbus_connection.monitor_pages[1];
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 307a5a8937f6..1766ef03e78d 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -379,17 +379,11 @@ static void process_chn_event(u32 relid)
  */
 void vmbus_on_event(unsigned long data)
 {
-   u32 dword;
-   u32 maxdword;
-   int bit;
-   u32 relid;
-   u32 *recv_int_page = NULL;
-   void *page_addr;
-   int cpu = smp_processor_id();
-   union hv_synic_event_flags *event;
+   unsigned long *recv_int_page;
+   u32 maxbits, relid;
 
if (vmbus_proto_version < VERSION_WIN8) {
-   maxdword = MAX_NUM_CHANNELS_SUPPORTED >> 5;
+   maxbits = MAX_NUM_CHANNELS_SUPPORTED;
recv_int_page = vmbus_connection.recv_int_page;
} else {
/*
@@ -397,35 +391,24 @@ void vmbus_on_event(unsigned long data)
 * can be directly checked to get the id of the channel
 * that has the interrupt pending.
 */
-   maxdword = HV_EVENT_FLAGS_DWORD_COUNT;
-   page_addr = hv_context.synic_event_page[cpu];
-   event = (union hv_synic_event_flags *)page_addr +
+   int cpu = smp_processor_id();
+   void *page_addr = hv_context.synic_event_page[cpu];
+   union hv_synic_event_flags *event
+   = (union hv_synic_event_flags *)page_addr +
 VMBUS_MESSAGE_SINT;
-   recv_int_page = event->flags32;
-   }
-
 
+   maxbits = HV_EVENT_FLAGS_COUNT;
+   recv_int_page = event->flags;
+   }
 
-   /* Check events */
-   if (!recv_int_page)
+   if (unlikely(!recv_int_page))
return;
-   for (dword = 0; dword < maxdword; dword++) {
-   if (!recv_int_page[dword])
-   continue;
-   for (bit = 0; bit < 32; bit++) {
-   if (sync_test_and_clear_bit(bit,
-   (unsigned long *)&recv_int_page[dword])) {
-   relid = (dword << 5) + bit;
-
-   if (relid == 0)
-   /*
-* Special case - vmbus
-* channel protocol msg
-*/
-   continue;
 
+   for_each_set_bit(relid, recv_int_page, maxbits) {
+   if (sync_test_and_clear_bit(relid, recv_int_page)) {
+   /* Special case - vmbus channel protocol msg */
+   if (relid != 0)
process_chn_event(relid);
-   }
}
}
 }
@@ -491,12 +474,8 @@ void vmbus_set_event(struct vmbus_channel *channel)
 {
u32 child_relid = channel->offermsg.child_relid;
 
-   if (!channel->is_dedicated_interrupt) {
-   /* Each u32 represents 32 channels */
-   sync_set_bit(child_relid & 31,
-   (unsigned long *)vmbus_connection.send_int_page +
-   (child_relid >> 5));
-   }
+   if (!channel->is_dedicated_interrupt)
+   vmbus_send_interrupt(child_relid);
 
hv_do_hypercall(HVCALL_SIGNAL_EVENT, channel->sig_event, NULL);
 }
diff --git a/

[PATCH 02/14] vmbus: drop no longer used kick_q argument

2017-02-01 Thread Stephen Hemminger

The flag to cause notification of host is unused after
commit a01a291a282f7c2e ("Drivers: hv: vmbus: Base host signaling
strictly on the ring state"). Therefore remove it from the ring
buffer internal API.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/channel.c  | 13 -
 drivers/hv/hyperv_vmbus.h |  5 ++---
 drivers/hv/ring_buffer.c  |  8 +++-
 3 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index a016c5c0e472..e26285cde8e0 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -670,9 +670,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, num_vecs,
-  lock, kick_q);
-
+   return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock);
 }
 EXPORT_SYMBOL(vmbus_sendpacket_ctl);
 
@@ -757,8 +755,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, kick_q);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl);
 
@@ -813,8 +810,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, true);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc);
 
@@ -871,8 +867,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = &aligned_data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, true);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer);
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 2749a4142889..c375ec89db6f 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -275,9 +275,8 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list,
-   u32 kv_count, bool lock,
-   bool kick_q);
+   struct kvec *kv_list,
+   u32 kv_count, bool lock);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 2cd402986858..30ca55aefd24 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -77,8 +77,7 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi)
  * host logic is fixed.
  */
 
-static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel,
-  bool kick_q)
+static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel)
 {
struct hv_ring_buffer_info *rbi = &channel->outbound;
 
@@ -285,8 +284,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info 
*ring_info)
 
 /* Write to the ring buffer. */
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count, bool lock,
-   bool kick_q)
+   struct kvec *kv_list, u32 kv_count, bool lock)
 {
int i = 0;
u32 bytes_avail_towrite;
@@ -352,7 +350,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
if (lock)
spin_unlock_irqrestore(&outring_info->ring_lock, flags);
 
-   hv_signal_on_write(old_write, channel, kick_q);
+   hv_signal_on_write(old_write, channel);
 
if (channel->rescind)
return -ENODEV;
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 00/14] hyperv: vmbus related patches

2017-02-01 Thread Stephen Hemminger

This is a rebase/resend of earlier patches. I skipped the pure
cosmetic patches for now.  Mostly this is consolidation earlier
changes, removing dead code etc.  The important part is the
change for allowing a vmbus channel to get callback directly
in interrupt mode; this is necessary for NAPI support.

Stephen Hemminger (14):
  vmbus: use kernel bitops for traversing interrupt mask
  vmbus: drop no longer used kick_q argument
  vmbus: remove no longer used signal_policy
  vmbus: remove unused kickq argument to sendpacket
  netvsc: remove no longer needed receive staging buffers
  vmbus: remove per channel state
  vmbus: callback is in softirq not workqueue
  vmbus: put related per-cpu variable together
  vmbus: change to per channel tasklet
  vmbus: add direct isr callback mode
  vmbus: remove conditional locking of vmbus_write
  vmbus: expose hv_begin/end_read
  vmbus: constify parameters where possible
  vmbus: replace modulus operation with subtraction

Starting point was top of current char-misc-next branch.

 drivers/hv/channel.c  |  47 +
 drivers/hv/channel_mgmt.c |  41 ++--
 drivers/hv/connection.c   | 134 +-
 drivers/hv/hv.c   | 124 +++
 drivers/hv/hv_util.c  |   3 +-
 drivers/hv/hyperv_vmbus.h |  80 ---
 drivers/hv/ring_buffer.c  |  66 ++-
 drivers/hv/vmbus_drv.c| 115 ++--
 drivers/net/hyperv/hyperv_net.h   |   5 --
 drivers/net/hyperv/netvsc.c   | 104 -
 drivers/net/hyperv/rndis_filter.c |  11 
 drivers/uio/uio_hv_generic.c  |   2 +-
 include/linux/hyperv.h| 134 +-
 13 files changed, 338 insertions(+), 528 deletions(-)

-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next] net: make ndo_get_stats64 a void function

2017-01-05 Thread Stephen Hemminger

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/bonding/bond_main.c  | 10 --
 drivers/net/dummy.c  |  5 ++---
 drivers/net/ethernet/alacritech/slicoss.c|  6 ++
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 10 --
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |  6 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  4 +---
 drivers/net/ethernet/atheros/alx/main.c  |  6 ++
 drivers/net/ethernet/broadcom/b44.c  |  5 ++---
 drivers/net/ethernet/broadcom/bnx2.c |  3 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c|  6 ++
 drivers/net/ethernet/broadcom/tg3.c  |  8 +++-
 drivers/net/ethernet/brocade/bna/bnad.c  |  6 ++
 drivers/net/ethernet/calxeda/xgmac.c |  5 ++---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c |  5 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  7 +++
 drivers/net/ethernet/cisco/enic/enic_main.c  |  8 +++-
 drivers/net/ethernet/ec_bhf.c|  4 +---
 drivers/net/ethernet/emulex/benet/be_main.c  |  5 ++---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c   |  6 ++
 drivers/net/ethernet/hisilicon/hns/hns_enet.c|  6 ++
 drivers/net/ethernet/ibm/ehea/ehea_main.c|  5 ++---
 drivers/net/ethernet/intel/e1000e/e1000.h|  4 ++--
 drivers/net/ethernet/intel/e1000e/netdev.c   |  5 ++---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c  |  6 ++
 drivers/net/ethernet/intel/i40e/i40e.h   |  5 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c  | 18 ++
 drivers/net/ethernet/intel/igb/igb_main.c| 10 --
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|  7 ---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c|  6 ++
 drivers/net/ethernet/marvell/mvneta.c|  4 +---
 drivers/net/ethernet/marvell/mvpp2.c |  4 +---
 drivers/net/ethernet/marvell/sky2.c  |  6 ++
 drivers/net/ethernet/mediatek/mtk_eth_soc.c  |  6 ++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  4 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c |  3 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 +---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   |  3 +--
 drivers/net/ethernet/myricom/myri10ge/myri10ge.c |  9 -
 drivers/net/ethernet/neterion/vxge/vxge-main.c   |  4 +---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  |  6 ++
 drivers/net/ethernet/nvidia/forcedeth.c  |  4 +---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 10 --
 drivers/net/ethernet/qlogic/qede/qede_main.c |  7 ++-
 drivers/net/ethernet/qualcomm/emac/emac.c|  6 ++
 drivers/net/ethernet/realtek/8139too.c   |  9 +++--
 drivers/net/ethernet/realtek/r8169.c |  4 +---
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c  |  8 ++--
 drivers/net/ethernet/sfc/efx.c   |  6 ++
 drivers/net/ethernet/sfc/falcon/efx.c|  6 ++
 drivers/net/ethernet/sun/niu.c   |  6 ++
 drivers/net/ethernet/synopsys/dwc_eth_qos.c  |  4 +---
 drivers/net/ethernet/tile/tilepro.c  |  4 ++--
 drivers/net/ethernet/via/via-rhine.c |  8 +++-
 drivers/net/fjes/fjes_main.c |  7 ++-
 drivers/net/hyperv/netvsc_drv.c  |  6 ++
 drivers/net/ifb.c|  6 ++
 drivers/net/ipvlan/ipvlan_main.c |  5 ++---
 drivers/net/loopback.c   |  5 ++---
 drivers/net/macsec.c |  6 ++
 drivers/net/macvlan.c|  5 ++---
 drivers/net/nlmon.c  |  4 +---
 drivers/net/ppp/ppp_generic.c|  4 +---
 drivers/net/slip/slip.c  |  3 +--
 drivers/net/team/team.c  |  3 +--
 drivers/net/tun.c|  3 +--
 drivers/net/veth.c   |  6 ++
 drivers/net/virtio_net.c |  6 ++
 drivers/net/vmxnet3/vmxnet3_ethtool.c|  4 +---
 drivers/net/vmxnet3/vmxnet3_int.h|  4 ++--
 drivers/net/vrf.c

Re: [PATCH net-next V3 3/3] tun: rx batching

2016-12-31 Thread Stephen Hemminger

On Fri, 30 Dec 2016 13:20:51 +0800
Jason Wang  wrote:

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cd8e02c..a268ed9 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,10 @@
>  
>  #include 
>  
> +static int rx_batched;
> +module_param(rx_batched, int, 0444);
> +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
> +
>  /* Uncomment to enable debugging */

I like the concept or rx batching. But controlling it via a module parameter
is one of the worst API choices.  Ethtool would be better to use because that is
how other network devices control batching.

If you do ethtool, you could even extend it to have an number of packets
and max latency value.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 -next 1/2] virtio: Start feature MTU support

2016-03-19 Thread Stephen Hemminger

On Thu, 17 Mar 2016 17:10:55 -0400
Aaron Conole  wrote:

> Stephen Hemminger  writes:
> 
> > On Tue, 15 Mar 2016 17:04:12 -0400
> > Aaron Conole  wrote:
> >
> >> --- a/include/uapi/linux/virtio_net.h
> >> +++ b/include/uapi/linux/virtio_net.h
> >> @@ -55,6 +55,7 @@
> >>  #define VIRTIO_NET_F_MQ   22  /* Device supports Receive Flow
> >> * Steering */
> >>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> >> +#define VIRTIO_NET_F_MTU 25   /* Device supports Default MTU 
> >> Negotiation */
> >>  
> >>  #ifndef VIRTIO_NET_NO_LEGACY
> >>  #define VIRTIO_NET_F_GSO  6   /* Host handles pkts w/ any GSO type */
> >> @@ -73,6 +74,8 @@ struct virtio_net_config {
> >> * Legal values are between 1 and 0x8000
> >> */
> >>__u16 max_virtqueue_pairs;
> >> +  /* Default maximum transmit unit advice */
> >> +  __u16 mtu;
> >>  } __attribute__((packed));
> >>  
> >>  /*
> >
> > You can't change user visible headers without breaking ABI.
> > This structure might be used by other user code. Also how can this
> > work if host is using old size of structure.
> 
> How else can this field be added and remain compliant with the spec? The
> spec requires that mtu be passed in the virtio_net_config field.
> 
> As for old sizeof, I think the absence of the VIRTIO_NET_F_MTU bit being
> asserted is confirmation that mtu is not valid (at least, it is implied
> in the spec).

Michael is right as long as the code checks for MTU flag before
referencing the mtu field, everything is fine.  Actually, the structure
is never used directly only by fetching fields with offsetof

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v2 -next 1/2] virtio: Start feature MTU support

2016-03-18 Thread Stephen Hemminger

On Tue, 15 Mar 2016 17:04:12 -0400
Aaron Conole  wrote:

> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -55,6 +55,7 @@
>  #define VIRTIO_NET_F_MQ  22  /* Device supports Receive Flow
>* Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
> +#define VIRTIO_NET_F_MTU 25  /* Device supports Default MTU Negotiation */
>  
>  #ifndef VIRTIO_NET_NO_LEGACY
>  #define VIRTIO_NET_F_GSO 6   /* Host handles pkts w/ any GSO type */
> @@ -73,6 +74,8 @@ struct virtio_net_config {
>* Legal values are between 1 and 0x8000
>*/
>   __u16 max_virtqueue_pairs;
> + /* Default maximum transmit unit advice */
> + __u16 mtu;
>  } __attribute__((packed));
>  
>  /*

You can't change user visible headers without breaking ABI.
This structure might be used by other user code. Also how can this
work if host is using old size of structure.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next] virtio_net: add ethtool support for set and get of settings

2016-02-03 Thread Stephen Hemminger

On Tue,  2 Feb 2016 13:51:20 +0100
Nikolay Aleksandrov  wrote:

> +static bool virtnet_validate_speed(u32 speed)
> +{
> + switch (speed) {
> + case SPEED_10:
> + case SPEED_100:
> + case SPEED_1000:
> + case SPEED_2500:
> + case SPEED_5000:
> + case SPEED_1:
> + case SPEED_2:
> + case SPEED_25000:
> + case SPEED_4:
> + case SPEED_5:
> + case SPEED_56000:
> + case SPEED_10:
> + case SPEED_UNKNOWN:
> + return true;
> + }
> +
> + return false;
> +}

Why limit to only known values. This switch() will get out of
date when some vendor introduces 64G or some other weird value.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC V7 PATCH 7/7] vhost_net: add interrupt coalescing support

2015-05-26 Thread Stephen Hemminger

On Mon, 25 May 2015 01:24:04 -0400
Jason Wang  wrote:

> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/net.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 7d137a4..5ee28b7 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -320,6 +320,9 @@ static void handle_tx(struct vhost_net *net)
>   hdr_size = nvq->vhost_hlen;
>   zcopy = nvq->ubufs;
>  
> + /* Finish pending interrupts first */
> + vhost_check_coalesce_and_signal(vq->dev, vq, false);
> +
>   for (;;) {
>   /* Release DMAs done buffers first */
>   if (zcopy)
> @@ -415,6 +418,7 @@ static void handle_tx(struct vhost_net *net)
>   }
>   }
>  out:
> + vhost_check_coalesce_and_signal(vq->dev, vq, true);
>   mutex_unlock(&vq->mutex);
>  }
>  
> @@ -554,6 +558,9 @@ static void handle_rx(struct vhost_net *net)
>   vq->log : NULL;
>   mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>  
> + /* Finish pending interrupts first */
> + vhost_check_coalesce_and_signal(vq->dev, vq, false);
> +
>   while ((sock_len = peek_head_len(sock->sk))) {
>   sock_len += sock_hlen;
>   vhost_len = sock_len + vhost_hlen;
> @@ -638,6 +645,7 @@ static void handle_rx(struct vhost_net *net)
>   }
>   }
>  out:
> + vhost_check_coalesce_and_signal(vq->dev, vq, true);
>   mutex_unlock(&vq->mutex);
>  }
>  

Could you implement ethtool control of these coalescing parameters?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next] virtio: change comment in transmit

2015-03-24 Thread Stephen Hemminger

The original comment was not really informative or funny
as well as sexist. Replace it with a better explanation of
why the driver does stop and what the impacts are.

Signed-off-by: Stephen Hemminger 

--- a/drivers/net/virtio_net.c  2015-03-24 15:20:25.174671000 -0700
+++ b/drivers/net/virtio_net.c  2015-03-24 16:17:28.478525333 -0700
@@ -939,8 +939,12 @@ static netdev_tx_t start_xmit(struct sk_
skb_orphan(skb);
nf_reset(skb);
 
-   /* Apparently nice girls don't return TX_BUSY; stop the queue
-* before it gets out of hand.  Naturally, this wastes entries. */
+   /* It is better to stop queue if running out of space
+* instead of forcing queuing layer to requeue the skb
+* by returning TX_BUSY (and cause a BUG message).
+* Since most packets only take 1 or 2 ring slots
+* this means 16 slots are typically wasted.
+*/
if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
netif_stop_subqueue(dev, qnum);
if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: what is the driver of vm's virtual ethernet?

2014-04-02 Thread Stephen Hemminger

What happens is that real (or virtual) network devices call the common
network receive function - netif_receive_skb
This function has a hook rx_handler which is called if the device is part
of a bridge.
The bridge code the processes the packets and either floods them,
redirects, or drops the packets
based on the learned forwarding table.

On Sat, Mar 29, 2014 at 7:10 PM, longguang.yue  wrote:

> hi,all
> each port of bridge, has its packets process function called
> br_handle_frame.
> i want to know before this function called   who and how it get the
> packets?
> if it is a real physical ethernet, it  must be its driver, but for virtual
> ethernet , what is the driver?
>
>
> thanks
>
>
>
> ___
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH net-next 1/3] virtio_net: set multicast filter list to host

2013-12-09 Thread Stephen Hemminger

The virtio_net driver never sends the multicast address list to
the host. This is because send command takes a pointer to scatter list
to send but only inserts that one entry into the outgoing scatter list.

This bug has been there since:
commit f565a7c259d71cc186753653d978c646d2354b36
Author: Alex Williamson 
Date:   Wed Feb 4 09:02:45 2009 +

virtio_net: Add a MAC filter table

Signed-off-by: Stephen Hemminger 

--- a/drivers/net/virtio_net.c  2013-12-09 16:12:03.897891975 -0800
+++ b/drivers/net/virtio_net.c  2013-12-09 16:12:36.353164803 -0800
@@ -893,7 +893,7 @@ static bool virtnet_send_command(struct
sg_init_one(&hdr, &ctrl, sizeof(ctrl));
sgs[out_num++] = &hdr;
 
-   if (out)
+   for (; out; out = sg_next(out))
sgs[out_num++] = out;
if (in)
sgs[out_num + in_num++] = in;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

1 2 >

1 - 100 of 155 matches

Mail list logo