Re: [PATCH net-next] net: bcmgenet: Do not return from void function

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 21:48:51 -0700

> A stray return was added in the macro bcmgenet_##name##_writel where it
> should not, drop it.
> 
> Reported-by: kbuild test robot 
> Fixes: 69d2ea9c7989 ("net: bcmgenet: Use correct I/O accessors")
> Signed-off-by: Florian Fainelli 

Applied, thanks.


Re: [PATCH net-next v3 0/3] NCSI VLAN Filtering Support

2017-08-29 Thread David Miller
From: Samuel Mendoza-Jonas 
Date: Wed, 30 Aug 2017 14:37:21 +1000

> On Mon, 2017-08-28 at 16:50 -0700, David Miller wrote:
>> From: Samuel Mendoza-Jonas 
>> Date: Mon, 28 Aug 2017 16:18:40 +1000
>> 
>> > This series (mainly patch 2) adds VLAN filtering to the NCSI 
>> > implementation.
>> > A fair amount of code already exists in the NCSI stack for VLAN filtering 
>> > but
>> > none of it is actually hooked up. This goes the final mile and fixes a few
>> > bugs in the existing code found along the way (patch 1).
>> > 
>> > Patch 3 adds the appropriate flag and callbacks to the ftgmac100 driver to
>> > enable filtering as it's a large consumer of NCSI (and what I've been
>> > testing on).
>> > 
>> > v3:   - Add comment describing change to ncsi_find_filter()
>> >   - Catch NULL in clear_one_vid() from ncsi_get_filter()
>> >   - Simplify state changes when kicking updated channel
>> 
>> Series applied.
> 
> Thanks David,
> 
> The kbuild bot caught a build error where the add/kill callbacks aren't
> defined without CONFIG_NET_NCSI:
> 
>>> ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
>>> undefined!
>>> ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
>>> undefined!
> 
> It's a quick fixup to patch 3 as below, would you like me to send it as a v4?

You must submit a formal fixup patch to fix bugs if I've said that I've
already applied your patch.


Re: Fwd: DA850-evm MAC Address is random

2017-08-29 Thread Sekhar Nori
On Wednesday 30 August 2017 06:19 AM, Adam Ford wrote:
> On Tue, Aug 29, 2017 at 10:20 AM, Adam Ford  wrote:
>> On Tue, Aug 29, 2017 at 10:16 AM, Sekhar Nori  wrote:
>>> On Tuesday 29 August 2017 05:32 PM, Adam Ford wrote:
 On Tue, Aug 29, 2017 at 6:42 AM, Sekhar Nori  wrote:
> On Tuesday 29 August 2017 03:53 PM, Adam Ford wrote:
>> On Tue, Aug 29, 2017 at 3:23 AM, Sekhar Nori  wrote:
>>> On Tuesday 29 August 2017 02:42 AM, Tony Lindgren wrote:
 * Adam Ford  [170828 13:33]:
> On Mon, Aug 28, 2017 at 1:54 PM, Grygorii Strashko
>  wrote:
>> Cc: Sekhar
>>
>> On 08/28/2017 10:32 AM, Adam Ford wrote:
>>>
>>> The davinvi_emac MAC address seems to attempt a call to
>>> ti_cm_get_macid in cpsw-common.c but it returns the message
>>> 'davinci_emac davinci_emac.1: incompatible machine/device type for
>>> reading mac address ' and then generates a random MAC address.
>>>
>>> The function appears to lookup varions boards using
>>> 'of_machine_is_compaible' and supports dm8148, am33xx, am3517, 
>>> dm816,
>>> am4372 and dra7.  I don't see the ti,davinci-dm6467-emac which is
>>> what's shown in the da850 device tree.
>>>
>>> Is there a patch somewhere for supporting the da850-evm?
>>
>>
>> Not sure if MAC address can be read from Control module.
>> May be Sekhar can say more?
>
> My understanding is that the MAC address is programmed by Logic PD
> into the SPI flash.  The Bootloader reads this from either SPI or its
> env variables.  Looking at the partition info listed in the
> da850-evm.dts file, it appears as if they've reserved space for it.
> Unfortunately, I don't see any code that reads it out.  I was hoping
>>>
>>> This code is present in U-Boot sources at
>>> board/davinci/da8xxevm/da850evm.c. See the function get_mac_addr() and
>>> its usage in misc_init_r().
>>>
> there might be a way to just pass cmdline parameter from the
> bootloader to the kernel to accept the MAC address.
>
>>
>>>
>>> If not, is there a way to pass the MAC address from U-Boot to the
>>> driver so it doesn't generate a random MAC?
>>
>>
>> "local-mac-address" dt porp
>
> The downside here, is that we'd have to have the Bootloader modify the
> device tree.

 That piece of code exists somewhere in u-boot already. Note how
>>>
>>> Yes, it is fdt_fixup_ethernet() and its usage is in common/image-fdt.c.
>>>
 we are populating the mac address for USB Ethernet drivers in
 u-boot and then the Ethernet driver code parses it. See commit
 055d31de7158 ("ARM: omap3: beagleboard-xm: dt: Add ethernet to
 the device tree") for some more information.

 I think u-boot needs the ethernet alias for finding the interface.
>>>
>>> That's exactly what was missing. I have sent a patch for fixing that and
>>> copied you there.
>>
>> Thanks for doing that.
>>
>>>
>>> Adam, if I can get your Tested-by, I will make an attempt to send it for
>>> v4.13 itself.
>>
>> I will test it.  Do need to run some instruction or do something
>> special in U-Boot to pass this in the proper place for the kernel to
>> pull it?  Tony's patch reference showed
>> command for fdt set, but I am not sure I fully understand the
>> parameters that went along with that.
>
> Nope, just applying the patch and booting the with the new dtb should
> result in the random mac address going away.

 Unfortunately, I am not seeing any change with the patch (at least
 with Kernel 4.12.9 from stable).

 netconsole: network logging started
 davinci_emac davinci_emac.1: incompatible machine/device type for
 reading mac address
 davinci_emac davinci_emac.1: using random MAC addr: ee:74:c3:3a:15:be

 Looking at the source for cpsw-common.c function ti_cm_get_macid()
 doesn't have a case for the  ti,davinci-dm6467-emac so I wonder if
 there might be more to it.
>>>
>>> Hmm, it did solve the issue for me when I tried latest -next. And
>>> reverting the patch brought back the random mac address usage. Could you
>>> try latest mainline or -next?
>>>
>>> Meanwhile let me see whats going on with the observations you have.
>>
>> I will try again with -next this afternoon and see what I can find.
>> Can you tell me which U-Boot version you're using? I want to match
>> your setup. I want to see if something is missing during the hand-off
>> between the Bootloader and Linux.
>>
> 
> I wonder if U-Boot isn't pushing something to Linux because it 

Re: [PATCH net-next 0/4] nsh: headers, GSO

2017-08-29 Thread Yang, Yi
On Wed, Aug 30, 2017 at 06:17:07AM +0800, David Miller wrote:
> From: Jiri Benc 
> Date: Mon, 28 Aug 2017 21:43:20 +0200
> 
> > This adds header structs and helpers for NSH together with GSO support.
> > 
> > Note there is no code in this patchset that actually manipulates the NSH
> > headers. That was sent to netdev by Yi Yang ("[PATCH net-next v6 0/3]
> > openvswitch: add NSH support"). The aim of this series is to lay the
> > groundwork and ease the implementation for him.
> > 
> > In addition to openvswitch, the NSH support should be added to tc (flower to
> > match, act_nsh to push/pop NSH headers). That will come later. There's
> > currently no plan to support NSH by other means than those two.
> > 
> > The patch 3 in this patchset was written by Yi Yang, I took it from the
> > aforementioned series and slightly modified it - see the note in the patch.
> 
> Series applied, thanks Jiri.

Hi, Jiri

David has merged your NSH GSO series, do you have one more patch series
to fix VxLAN-gpe+NSH GSO issue? I think we still need do something in
skb_udp_tunnel_segment in net/ipv4/udp_offload.c to support
VxLAN-gpe+NSH GSO.

I'll post new OVS NSH kernel datapath patch series if NSH GSO is done.


Re: [PATCH net-next v3 02/13] phy: add the mvebu cp110 comphy driver

2017-08-29 Thread Kishon Vijay Abraham I
Hi,

On Tuesday 29 August 2017 06:42 PM, Antoine Tenart wrote:
> Hi Kishon,
> 
> On Tue, Aug 29, 2017 at 05:55:06PM +0530, Kishon Vijay Abraham I wrote:
>> On Tuesday 29 August 2017 04:53 PM, Antoine Tenart wrote:
>>> On Tue, Aug 29, 2017 at 04:34:17PM +0530, Kishon Vijay Abraham I wrote:
 On Monday 28 August 2017 08:27 PM, Antoine Tenart wrote:
> +static const struct mvebu_comhy_conf mvebu_comphy_cp110_modes[] = {
> + /* lane 0 */
> + MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1),
> + /* lane 1 */
> + MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1),
> + /* lane 2 */
> + MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1),
> + MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1),
> + /* lane 3 */
> + MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2),
> + /* lane 4 */
> + MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2),
> + MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2),
> + MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1),
> + /* lane 5 */
> + MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1),
> +};

 IMHO all the lane and mode configuration should come from dt. That would 
 make
 it more reusable when comphy is configured differently.
>>>
>>> These connexions between engines and the comphy lanes are inside the
>>> SoC. They won't change for a given SoC, and the actual configuration is
>>> at the board level to know what is connected to the output of a given
>>> lane, which is already described into the dt (the lane phandle).
>>>
>>> So I think we can keep this inside the driver, and we'll had other
>>> tables if the same comphy is ever used in another SoC.
>>>
>>> What do you think?
>>
>> I'd like to avoid adding tables for every SoC. These are configuration 
>> details
>> and can come from dt.
> 
> Actually this is per CP design, not SoC (this one is used in both 7k and
> 8k SoCs from Marvell, and probably others). I'm still not convinced this
> is a good idea to put this into the dt. First of all we would end up with
> something like (and this is only for a single lane, out of *6*):
> 
> cpm_comphy: phy@phy@12 {
>   compatible = "marvell,comphy-cp110";
>   reg = <0x12 0x6000>;
>   marvell,system-controller = <_syscon0>;
>   #address-cells = <1>;
>   #size-cells = <0>;
> 
>   cpm_comphy0: phy@0 {
>   reg = <0>;
>   #phy-cells = <1>;
> 
>   mode@0 {
>   phy-mode = PHY_MODE_SGMII;
>   selector = <0x1>;
>   pipe-selector = <0x0>;
>   port = <0>;
>   };
> 
>   mode@1 {
>   phy-mode = PHY_MODE_HS_SGMII;
>   selector = <0x1>;
>   pipe-selector = <0x0>;
>   port = <0>;
>   };
> 
>   mode@2 {
>   phy-mode = PHY_MODE_RXAUI;
>   selector = <0x2>;
>   pipe-selector = <0x0>;
>   port = <0>;
>   };
> 
>   mode@3 {
>   phy-mode = PHY_MODE_10GKR;
>   selector = <0x2>;
>   pipe-selector = <0x0>;
>   port = <0>;
>   };
> 
>   mode@4 {
>   phy-mode = PHY_MODE_SATA;
>   selector = <0x4>;
>   pipe-selector = <0x0>;
>   port = <1>;
>   };
> 
>   mode@5 {
>   phy-mode = PHY_MODE_USB;
>   selector = <0x0>;
>   pipe-selector = <0x1>;
>   port = <2>;
>   };
> 
>   mode@6 {
>   phy-mode = PHY_MODE_USB;
>   selector = <0x0>;
>   pipe-selector = <0x2>;
>   port = <0>;
>   };

I think we should just select the mode that a particular lane has been
configured here instead of populating all the modes. But I think that doesn't
make sense since the mode is set by the consumer and the initial mode is
INVALID. So ignore my comment on having it in dt.
> 
>   ... + PCIe, other ports ...
>   };
> 
>   cpm_comphy1: phy@1 {
>   ...
>   };
> 
>   ...
> };
> 
> And this "configuration" makes us think the comphy driver would be
> somehow generic with only these parameters to update when using a
> different CP. In reality we do have one other comphy in another CP, and
> it requires more than just updating the above parameters: the init
> functions also are SoC specific. So the table used in the patch proposed
> only is a small part of this "configuration". In fact it's not a
> configuration at all, but only a mode-to-bit indirection, used in the
> comphy init functions.
> 
> What is proposed instead, is very close to what actually changes a lot,
> and what a designer can change: the CP 

[net-next:master 427/429] arch/sh/include/asm/io.h:30:43: warning: 'return' with a value, in function returning void

2017-08-29 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   eaa72dc47488d599439cd0fd0f8c4f1bcb3906bb
commit: 69d2ea9c798983c4a7157278ec84ff969d1cd8e8 [427/429] net: bcmgenet: Use 
correct I/O accessors
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 69d2ea9c798983c4a7157278ec84ff969d1cd8e8
# save the attached .config to linux build tree
make.cross ARCH=sh 

All warnings (new ones prefixed by >>):

   In file included from include/linux/io.h:25:0,
from include/linux/irq.h:24,
from arch/sh/include/asm/hardirq.h:5,
from include/linux/hardirq.h:8,
from include/linux/interrupt.h:12,
from drivers/net//ethernet/broadcom/genet/bcmgenet.c:18:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h: In function 
'bcmgenet_ext_writel':
>> arch/sh/include/asm/io.h:30:43: warning: 'return' with a value, in function 
>> returning void
#define __raw_writel(v,a) (__chk_io_ptr(a), *(volatile u32 __force *)(a) = 
(v))
  
^
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion 
of macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:692:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(ext, GENET_EXT_OFF);
^~
   In file included from drivers/net//ethernet/broadcom/genet/bcmgenet.c:49:0:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:683:20: note: declared here
static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv, \
   ^
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:692:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(ext, GENET_EXT_OFF);
^~
   In file included from include/linux/io.h:25:0,
from include/linux/irq.h:24,
from arch/sh/include/asm/hardirq.h:5,
from include/linux/hardirq.h:8,
from include/linux/interrupt.h:12,
from drivers/net//ethernet/broadcom/genet/bcmgenet.c:18:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h: In function 
'bcmgenet_umac_writel':
>> arch/sh/include/asm/io.h:30:43: warning: 'return' with a value, in function 
>> returning void
#define __raw_writel(v,a) (__chk_io_ptr(a), *(volatile u32 __force *)(a) = 
(v))
  
^
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion 
of macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:693:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(umac, GENET_UMAC_OFF);
^~
   In file included from drivers/net//ethernet/broadcom/genet/bcmgenet.c:49:0:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:683:20: note: declared here
static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv, \
   ^
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:693:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(umac, GENET_UMAC_OFF);
^~
   In file included from include/linux/io.h:25:0,
from include/linux/irq.h:24,
from arch/sh/include/asm/hardirq.h:5,
from include/linux/hardirq.h:8,
from include/linux/interrupt.h:12,
from drivers/net//ethernet/broadcom/genet/bcmgenet.c:18:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h: In function 
'bcmgenet_sys_writel':
>> arch/sh/include/asm/io.h:30:43: warning: 'return' with a value, in function 
>> returning void
#define __raw_writel(v,a) (__chk_io_ptr(a), *(volatile u32 __force *)(a) = 
(v))
  
^
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion 
of macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:694:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(sys, GENET_SYS_OFF);
^~
   In file included from drivers/net//ethernet/broadcom/genet/bcmgenet.c:49:0:
   drivers/net//ethernet/broadcom/genet/bcmgenet.h:683:20: note: declared here
static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv, \
   ^
   

Re: [PATCH net-next v3 02/13] phy: add the mvebu cp110 comphy driver

2017-08-29 Thread Kishon Vijay Abraham I
Hi Antoine,

On Monday 28 August 2017 08:27 PM, Antoine Tenart wrote:
> On the CP110 unit, which can be found on various Marvell platforms such
> as the 7k and 8k (currently), a comphy (common PHYs) hardware block can
> be found. This block provides a number of PHYs which can be used in
> various modes by other controllers (network, SATA ...). These common
> PHYs must be configured for the controllers using them to work correctly
> either at boot time, or when the system runs to switch the mode used.
> This patch adds a driver for this comphy hardware block, providing
> callbacks for the its PHYs so that consumers can configure the modes
> used.
> 
> As of this commit, two modes are supported by the comphy driver: sgmii
> and 10gkr.

Have one more minor comment in addition to my previous comments..
> 
> Signed-off-by: Antoine Tenart 
> ---
>  drivers/phy/marvell/Kconfig  |  10 +
>  drivers/phy/marvell/Makefile |   1 +
>  drivers/phy/marvell/phy-mvebu-cp110-comphy.c | 656 
> +++
>  3 files changed, 667 insertions(+)
>  create mode 100644 drivers/phy/marvell/phy-mvebu-cp110-comphy.c
> 
> diff --git a/drivers/phy/marvell/Kconfig b/drivers/phy/marvell/Kconfig
> index 048d8893bc2e..26755f3d1a9a 100644
> --- a/drivers/phy/marvell/Kconfig
> +++ b/drivers/phy/marvell/Kconfig
> @@ -21,6 +21,16 @@ config PHY_BERLIN_USB
>   help
> Enable this to support the USB PHY on Marvell Berlin SoCs.
>  
> +config PHY_MVEBU_CP110_COMPHY
> + tristate "Marvell CP110 comphy driver"
> + depends on ARCH_MVEBU && OF
> + select GENERIC_PHY
> + help
> +   This driver allows to control the comphy, an hardware block providing
> +   shared serdes PHYs on Marvell Armada 7k/8k (in the CP110). Its serdes
> +   lanes can be used by various controllers (Ethernet, sata, usb,
> +   PCIe...).
> +
>  config PHY_MVEBU_SATA
>   def_bool y
>   depends on ARCH_DOVE || MACH_DOVE || MACH_KIRKWOOD
> diff --git a/drivers/phy/marvell/Makefile b/drivers/phy/marvell/Makefile
> index 3fc188f59118..0cf6a7cbaf9f 100644
> --- a/drivers/phy/marvell/Makefile
> +++ b/drivers/phy/marvell/Makefile
> @@ -1,6 +1,7 @@
>  obj-$(CONFIG_ARMADA375_USBCLUSTER_PHY)   += phy-armada375-usb2.o
>  obj-$(CONFIG_PHY_BERLIN_SATA)+= phy-berlin-sata.o
>  obj-$(CONFIG_PHY_BERLIN_USB) += phy-berlin-usb.o
> +obj-$(CONFIG_PHY_MVEBU_CP110_COMPHY) += phy-mvebu-cp110-comphy.o
>  obj-$(CONFIG_PHY_MVEBU_SATA) += phy-mvebu-sata.o
>  obj-$(CONFIG_PHY_PXA_28NM_HSIC)  += phy-pxa-28nm-hsic.o
>  obj-$(CONFIG_PHY_PXA_28NM_USB2)  += phy-pxa-28nm-usb2.o
> diff --git a/drivers/phy/marvell/phy-mvebu-cp110-comphy.c 
> b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
> new file mode 100644
> index ..41556e790856
> --- /dev/null
> +++ b/drivers/phy/marvell/phy-mvebu-cp110-comphy.c
> @@ -0,0 +1,656 @@
> +/*
> + * Copyright (C) 2017 Marvell
> + *
> + * Antoine Tenart 
> + *
> + * This file is licensed under the terms of the GNU General Public
> + * License version 2. This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* Relative to priv->base */
> +#define MVEBU_COMPHY_SERDES_CFG0(n)  (0x0 + (n) * 0x1000)
> +#define MVEBU_COMPHY_SERDES_CFG0_PU_PLL  BIT(1)
> +#define MVEBU_COMPHY_SERDES_CFG0_GEN_RX(n)   ((n) << 3)
> +#define MVEBU_COMPHY_SERDES_CFG0_GEN_TX(n)   ((n) << 7)
> +#define MVEBU_COMPHY_SERDES_CFG0_PU_RX   BIT(11)
> +#define MVEBU_COMPHY_SERDES_CFG0_PU_TX   BIT(12)
> +#define MVEBU_COMPHY_SERDES_CFG0_HALF_BUSBIT(14)
> +#define MVEBU_COMPHY_SERDES_CFG1(n)  (0x4 + (n) * 0x1000)
> +#define MVEBU_COMPHY_SERDES_CFG1_RESET   BIT(3)
> +#define MVEBU_COMPHY_SERDES_CFG1_RX_INIT BIT(4)
> +#define MVEBU_COMPHY_SERDES_CFG1_CORE_RESET  BIT(5)
> +#define MVEBU_COMPHY_SERDES_CFG1_RF_RESETBIT(6)
> +#define MVEBU_COMPHY_SERDES_CFG2(n)  (0x8 + (n) * 0x1000)
> +#define MVEBU_COMPHY_SERDES_CFG2_DFE_EN  BIT(4)
> +#define MVEBU_COMPHY_SERDES_STATUS0(n)   (0x18 + (n) * 0x1000)
> +#define MVEBU_COMPHY_SERDES_STATUS0_TX_PLL_RDY   BIT(2)
> +#define MVEBU_COMPHY_SERDES_STATUS0_RX_PLL_RDY   BIT(3)
> +#define MVEBU_COMPHY_SERDES_STATUS0_RX_INIT  BIT(4)
> +#define MVEBU_COMPHY_PWRPLL_CTRL(n)  (0x804 + (n) * 0x1000)
> +#define MVEBU_COMPHY_PWRPLL_CTRL_RFREQ(n)((n) << 0)
> +#define MVEBU_COMPHY_PWRPLL_PHY_MODE(n)  ((n) << 5)
> +#define MVEBU_COMPHY_IMP_CAL(n)  (0x80c + (n) * 0x1000)
> +#define MVEBU_COMPHY_IMP_CAL_TX_EXT(n)   ((n) << 10)
> +#define MVEBU_COMPHY_IMP_CAL_TX_EXT_EN   BIT(15)
> +#define MVEBU_COMPHY_DFE_RES(n)  (0x81c + 

[PATCH net-next] net: bcmgenet: Do not return from void function

2017-08-29 Thread Florian Fainelli
A stray return was added in the macro bcmgenet_##name##_writel where it
should not, drop it.

Reported-by: kbuild test robot 
Fixes: 69d2ea9c7989 ("net: bcmgenet: Use correct I/O accessors")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/genet/bcmgenet.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h 
b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index 2bfeaefcca0f..4c49d0b97748 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -684,7 +684,7 @@ static inline void bcmgenet_##name##_writel(struct 
bcmgenet_priv *priv, \
u32 val, u32 off)   \
 {  \
if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) \
-   return __raw_writel(val, priv->base + offset + off);\
+   __raw_writel(val, priv->base + offset + off);   \
else\
writel_relaxed(val, priv->base + offset + off); \
 }
-- 
2.11.0



[PATCH net-next 3/3 v11] drivers: net: ethernet: qualcomm: rmnet: Initial implementation

2017-08-29 Thread Subash Abhinov Kasiviswanathan
RmNet driver provides a transport agnostic MAP (multiplexing and
aggregation protocol) support in embedded module. Module provides
virtual network devices which can be attached to any IP-mode
physical device. This will be used to provide all MAP functionality
on future hardware in a single consistent location.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 Documentation/networking/rmnet.txt |  82 
 drivers/net/ethernet/qualcomm/Kconfig  |   2 +
 drivers/net/ethernet/qualcomm/Makefile |   2 +
 drivers/net/ethernet/qualcomm/rmnet/Kconfig|  12 +
 drivers/net/ethernet/qualcomm/rmnet/Makefile   |  10 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 419 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  56 +++
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 271 +
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h   |  26 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  88 +
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 107 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 105 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  45 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 170 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h|  29 ++
 15 files changed, 1424 insertions(+)
 create mode 100644 Documentation/networking/rmnet.txt
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Kconfig
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h

diff --git a/Documentation/networking/rmnet.txt 
b/Documentation/networking/rmnet.txt
new file mode 100644
index 000..6b341ea
--- /dev/null
+++ b/Documentation/networking/rmnet.txt
@@ -0,0 +1,82 @@
+1. Introduction
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+
+a. MAP packet (data / control)
+
+MAP header has the same endianness of the IP packet.
+
+Packet format -
+
+Bit 0 1   2-7  8 - 15   16 - 31
+Function   Command / Data   Reserved Pad   Multiplexer IDPayload length
+Bit32 - x
+Function Raw  Bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Control packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits are usually zeroed out and to be ignored by receiver.
+
+Padding is number of bytes to be added for 4 byte alignment if required by
+hardware.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. MAP packet (command specific)
+
+Bit 0 1   2-7  8 - 15   16 - 31
+Function   Command Reserved Pad   Multiplexer IDPayload length
+Bit  32 - 3940 - 4546 - 47   48 - 63
+Function   Command nameReserved   Command Type   Reserved
+Bit  64 - 95
+Function   Transaction ID
+Bit  96 - 127
+Function   Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types -
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+
+c. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered 

[PATCH net-next 2/3 v11] net: arp: Add support for raw IP device

2017-08-29 Thread Subash Abhinov Kasiviswanathan
Define the raw IP type. This is needed for raw IP net devices
like rmnet.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 include/uapi/linux/if_arp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index cf73510..a2a6356 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -59,6 +59,7 @@
 #define ARPHRD_LAPB516 /* LAPB */
 #define ARPHRD_DDCMP517/* Digital's DDCMP protocol */
 #define ARPHRD_RAWHDLC 518 /* Raw HDLC */
+#define ARPHRD_RAWIP519/* Raw IP   */
 
 #define ARPHRD_TUNNEL  768 /* IPIP tunnel  */
 #define ARPHRD_TUNNEL6 769 /* IP6IP6 tunnel*/
-- 
1.9.1



[PATCH net-next 0/3 v11] Add support for rmnet driver

2017-08-29 Thread Subash Abhinov Kasiviswanathan
This patch series adds support for the rmnet driver which is required to
support recent chipsets using Qualcomm Technologies, Inc. modems. The data
from hardware follows the multiplexing and aggregation protocol (MAP).

This driver can be used to register onto any physical network device in
IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.

rmnet driver helps to decode these packets and queue them to network
stack (and encode and transmit it to the physical device).

v1: Same as the RFC patch with some minor fixes for issues reported by
kbuild test robot.

v1->v2: Change datatypes and remove config IOCTL as mentioned by David.
Also fix checkpatch issues and remove some unused code.

v2->v3: Move location to drivers/net and rename to rmnet. Change the
userspace - netlink communication from custom netlink to rtnl_link_ops.
Refactor some code. Use a fixed config for ingress and egress.

v3->v4: Move location to drivers/net/ethernet/qualcomm/.
Fix comments from Stephen and Jiri -
Split the ether and arp type changes into seperate patches.
Remove debug and custom logging and switch to standard netdevice log.
Remove module parameters. Refactor and change some code style issues.

v4->v5: Rename some structs and variables. Move the initializer
before the for loop start. Put the arp type in correct sequence.

v5->v6: Fix comments from Dan -
Use the upper link API. As a result, remove all the refcounting logic.
Device refcount is explicitly held on real_dev on rx_handler
registration only. Modifiy the flow control struct. Remove the unused
ethernet mode handling.

v6->v7: Fix comments from David - Add newline to end of Makefile. Remove
inline from .c files. Move the module init/exit to rmnet config. Fix an
error reported by kbuild test robot for an unused file.

v7->v8: Use a smaller value for ETH_P_MAP as mentioned by David. Change
netdev_info to netdev_dbg as mentioned by Andew. Fix comments from
Stephen regarding netdev_priv and sparse related errors of using 0 as NULL

v8->v9: Fix comments from David - Remove the CFLAG rule. Change the way
rmnet devices are freed. Instead of using a workqueue to unregister devices
individually, go through the list and free all devices within the rtnl_lock().

v9->v10: Actually fix the locking as mentioned by David. The locking scheme is
mentioned in a comment in rmnet_config.c. Change comment near MAP type
definition as mentioned by Dan. Refactor some code.

v10->v11: Allow RMNET to compile as a module as mentioned by David

Subash Abhinov Kasiviswanathan (3):
  net: ether: Add support for multiplexing and aggregation type
  net: arp: Add support for raw IP device
  drivers: net: ethernet: qualcomm: rmnet: Initial implementation

 Documentation/networking/rmnet.txt |  82 
 drivers/net/ethernet/qualcomm/Kconfig  |   2 +
 drivers/net/ethernet/qualcomm/Makefile |   2 +
 drivers/net/ethernet/qualcomm/rmnet/Kconfig|  12 +
 drivers/net/ethernet/qualcomm/rmnet/Makefile   |  10 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 419 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  56 +++
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 271 +
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h   |  26 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  88 +
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 107 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 105 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  45 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 170 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h|  29 ++
 include/uapi/linux/if_arp.h|   1 +
 include/uapi/linux/if_ether.h  |   3 +
 17 files changed, 1428 insertions(+)
 create mode 100644 Documentation/networking/rmnet.txt
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Kconfig
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h

-- 
1.9.1



[PATCH net-next 1/3 v11] net: ether: Add support for multiplexing and aggregation type

2017-08-29 Thread Subash Abhinov Kasiviswanathan
Define the Qualcomm multiplexing and aggregation (MAP) ether type 0x00F9.
This is needed for receiving data in the MAP protocol like RMNET. This is
not an officially registered ID.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 include/uapi/linux/if_ether.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 5bc9bfd..30526db 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -137,6 +137,9 @@
 #define ETH_P_IEEE802154 0x00F6/* IEEE802.15.4 frame   
*/
 #define ETH_P_CAIF 0x00F7  /* ST-Ericsson CAIF protocol*/
 #define ETH_P_XDSA 0x00F8  /* Multiplexed DSA protocol */
+#define ETH_P_MAP  0x00F9  /* Qualcomm multiplexing and
+* aggregation protocol
+*/
 
 /*
  * This is an Ethernet frame header.
-- 
1.9.1



Re: [PATCH net-next v3 0/3] NCSI VLAN Filtering Support

2017-08-29 Thread Samuel Mendoza-Jonas
On Mon, 2017-08-28 at 16:50 -0700, David Miller wrote:
> From: Samuel Mendoza-Jonas 
> Date: Mon, 28 Aug 2017 16:18:40 +1000
> 
> > This series (mainly patch 2) adds VLAN filtering to the NCSI implementation.
> > A fair amount of code already exists in the NCSI stack for VLAN filtering 
> > but
> > none of it is actually hooked up. This goes the final mile and fixes a few
> > bugs in the existing code found along the way (patch 1).
> > 
> > Patch 3 adds the appropriate flag and callbacks to the ftgmac100 driver to
> > enable filtering as it's a large consumer of NCSI (and what I've been
> > testing on).
> > 
> > v3:   - Add comment describing change to ncsi_find_filter()
> >   - Catch NULL in clear_one_vid() from ncsi_get_filter()
> >   - Simplify state changes when kicking updated channel
> 
> Series applied.

Thanks David,

The kbuild bot caught a build error where the add/kill callbacks aren't
defined without CONFIG_NET_NCSI:

>> ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
>> undefined!
>> ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] 
>> undefined!

It's a quick fixup to patch 3 as below, would you like me to send it as a v4?


diff --git a/include/net/ncsi.h b/include/net/ncsi.h
index 1f96af46df49..2b13b6b91a4d 100644
--- a/include/net/ncsi.h
+++ b/include/net/ncsi.h
@@ -36,6 +36,14 @@ int ncsi_start_dev(struct ncsi_dev *nd);
 void ncsi_stop_dev(struct ncsi_dev *nd);
 void ncsi_unregister_dev(struct ncsi_dev *nd);
 #else /* !CONFIG_NET_NCSI */
+int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 proto, u16 vid)
+{
+   return -ENOTTY;
+}
+int ncsi_vlan_rx_kill_vid(struct net_device *dev, __be16 proto, u16 vid)
+{
+   return -ENOTTY;
+}
 static inline struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
void (*notifier)(struct ncsi_dev *nd))
 {




Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-08-29 Thread Alexei Starovoitov
On Tue, Aug 29, 2017 at 09:38:16PM -0600, David Ahern wrote:
> On 8/29/17 8:58 PM, Alexei Starovoitov wrote:
> > On Tue, Aug 29, 2017 at 07:03:43PM -0600, David Ahern wrote:
> >> On 8/28/17 10:11 PM, Alexei Starovoitov wrote:
> >>>
> >>> Agree on the above, but you're mixing semantics of the new recurse
> >>> flag and implementation of it. Ex: we don't have to copy this flag
> >>> from prog->attr into cgroup. So this reset or non-reset discussion
> >>> only makes sense in the context of your current implementation.
> >>> We can implement the logic differently. Like don't copy that flag
> >>> at all and at attach time walk parent->parent->parent and see
> >>> what programs are attached. All of them should have prog->attr & 
> >>> recurse_bit set
> >>> In such implementation detach from 'b' is a nop from reset/non-reset
> >>> point of view. When socket creation in 'c' is invoked the program
> >>> 'c' is called first then the code keeps walking parents until root
> >>> invoking 'a' along the way.
> >>
> >> So you are suggesting there is no recursive flag per cgroup? How do you
> >> know you need to walk cgroups? How do you know when to stop running
> >> programs?
> > 
> > you're talking about implementation, right?
> > My 'proposed' implemenation of walking from cgroup all the way to the root
> > is just an example. It's not efficient. More below...
> > 
> >>> I'm not saying it will be an efficient implementation. The point
> >>> is to discuss UAPI independent of implementation.
> >>>
>  ###
> 
>  Also, let's agree on this intention. Based on the new ground rule, I
>  want to point out this example:
> 
>  If 'a' gets a program installed with no recurse flag set, ONLY processes
>  in 'a' have the 'a' program run. Processes in groups 'b', 'c' and 'd'
>  all stop at cgroup 'b' program.
> >>>
> >>> I'm proposing that such situation should not be allowed to happen.
> >>> In a->b->c->d cgroup scenario if override+recurse prog attached to 'b'
> >>> then only the same override+recurse can be attached to c, d, a.
> >>> So at detach time there can be gaps (like only 'b' and 'd' have
> >>> override+recurse progs), but walking up until root from any point
> >>> will guarantee that only override+recurse programs are seen.
> >>>
> >>
> >> That seems very limiting to me. Seems like you are suggesting the entire
> >> cgroup tree is recursive or non-recursive, but never a mix.
> > 
> > Entire cgroup subtree. Yes. It's the simplest uapi I could think of.
> 
> So 10 email exchanges later you agree on the UAPI I implemented in this
> patch: user opts in to recursive behavior via a new flag at attach time,
> and once a recursive program is installed at some point in the cgroup
> tree it applies to all descendant cgroups.
> 
> So all of these exchanges weren't about the UAPI, but your disagreement
> in my implementation. The only user visible change here is only programs
> marked recursive are run versus going back to the first cgroup marked
> non-recursive.

you cannot be serious. Your implemention is not at all what i'm proposing
above as a simplest uapi. Should we all go back and re-read from the beginning?

> > Easy to understand and argue about and I think it's solving your use case.
> > It's also easily extendable. New combination and features won't break
> > the users. It feels you're in rush to get this stuff for this merge
> > window, therefore I want to agree on something that is simple,
> > non-controversial and extensible.
> 
> I am in no-rush, but this does not to fall by the wayside like the net
> namespace specification.

It's more than that! I think you only looking at it from vrf perspective
whereas cgroup-bpf became a corner stone feature and enabler for tcp-bpf
which in turn became a stepping stone for bpf_sk_redirect.
So no, I'm not going to let something half baked that touches
the core idea of cgroup-bpf slide in.
Tejun and Andy also need to take a look, so yes it will take long
time for everyone to agree on this core uapi.

> Given the pending release of 4.13 net-next will close which gives a 2+
> week window to work on v3 before the next merge window. Plenty of time
> for me to work it into the many other things on my plate.

As I proposed several emails ago, please repost patches 2 and 3 that
I already acked and we can continue discussing this patch without
the agitation.



Re: multi-queue over IFF_NO_QUEUE "virtual" devices

2017-08-29 Thread Florian Fainelli
Le 08/07/17 à 15:26, Florian Fainelli a écrit :
> Hi,
> 
> Most DSA supported Broadcom switches have multiple queues per ports
> (usually 8) and each of these queues can be configured with different
> pause, drop, hysteresis thresholds and so on in order to make use of the
> switch's internal buffering scheme and have some queues achieve some
> kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
> priority than LAN to WAN for Q0).
> 
> This is obviously very workload specific, so I'd want maximum
> programmability as much as possible.
> 
> This brings me to a few questions:
> 
> 1) If we have the DSA slave network devices currently flagged with
> IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
> can control exactly which switch egress queue is used on a per-flow
> basis, would that be a problem (this is the dynamic selection of the TX
> queue)?

So I have this part figured out, with a bunch of changes network devices
created by DSA are now multiqueue aware and the Broadcom tag layer is
capable of extracting the queue index, passing it in the tag where
expected and having the switch forward to the appropriate switch port
and queue within that port. It also sets the queue mapping in the SKB
for later consumption by the master network device driver: bcmsysport.c
because of 2).

> 
> 2) The conduit interface (CPU) port network interface has a congestion
> control scheme which requires each of its TX queues (32 or 16) to be
> statically mapped to each of the underlying switch port queues because
> the congestion/ HW needs to inspect the queue depths of the switch to
> accept/reject a packet at the CPU's TX ring level. Do we have a good way
> with tc to map a virtual/stacked device's queue(s) on-top of its
> physical/underlying device's queues (this is the static queue mapping
> necessary for congestion to work)?

That part I have not figured out yet, with some static mapping I can
obtain the results that I want and was even considering the possibility
of doing something like this:

- register a network device notifier with bcmsysport.c (master network
device) for this setup
- expose a helper function allowing me to obtain a given DSA network
device port index
- whenever DSA creates network devices reconfigure the ring and queue
mapping of the TX queues managed by bcmsysport.c with the DSA network
device port index that has just been registered and just do a 1-1
mapping of the 8 queues

You would end-up with something like:

gphy (port 0) queues 0-7 mapped to systemport queues 0-7
rgmii_1 (port 1) queues 0-7 mapped to systemport queues 8-15
rgmii_2 (port 2) queues 0-7 mapped to systemport queues 16 through 23
moca (port 7) queues 0-7 mapped to systemport queues 24-31

This should be working because bcmsysport's TX queues are not under
direct control by the user, they are used via DSA created network
devices which indicate the queue they want to use. When the DSA
interfaces are brought down, their respective systemport queues now
become unused. This also works because the number of physical ports of
the switch times the number of queues is matching the number of TX
queues from systemport (like if someone designed it with that exact
purpose in mind ;)).

The only problem with that approach of course is that it embeds a policy
within the systemport driver.

Ideally I would really like to configure this via tc by setting up a
mapping between queues of one network devices to queues of another
network device, is that a possible thing, Jamal, Cong, Jiri, do you know?
-- 
Florian


[net-next:master 427/429] drivers/net/ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion of macro '__raw_writel'

2017-08-29 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   eaa72dc47488d599439cd0fd0f8c4f1bcb3906bb
commit: 69d2ea9c798983c4a7157278ec84ff969d1cd8e8 [427/429] net: bcmgenet: Use 
correct I/O accessors
config: blackfin-allyesconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 69d2ea9c798983c4a7157278ec84ff969d1cd8e8
# save the attached .config to linux build tree
make.cross ARCH=blackfin 

All error/warnings (new ones prefixed by >>):

   In file included from arch/blackfin/mach-bf533/include/mach/blackfin.h:15:0,
from arch/blackfin/include/asm/irqflags.h:11,
from include/linux/irqflags.h:15,
from arch/blackfin/include/asm/bitops.h:33,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from drivers/net/ethernet/broadcom/genet/bcmgenet.c:13:
   drivers/net/ethernet/broadcom/genet/bcmgenet.h: In function 
'bcmgenet_ext_writel':
>> arch/blackfin/include/asm/def_LPBlackfin.h:38:2: error: expected expression 
>> before '__asm__'
 __asm__ __volatile__( \
 ^
>> arch/blackfin/include/asm/def_LPBlackfin.h:51:33: note: in expansion of 
>> macro '_bfin_writeX'
#define bfin_write32(addr, val) _bfin_writeX(addr, val, 32,  )
^~~~
>> arch/blackfin/include/asm/io.h:20:33: note: in expansion of macro 
>> 'bfin_write32'
#define __raw_writel(val, addr) bfin_write32(addr, val)
^~~~
>> drivers/net/ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion of 
>> macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net/ethernet/broadcom/genet/bcmgenet.h:692:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(ext, GENET_EXT_OFF);
^~
>> arch/blackfin/include/asm/def_LPBlackfin.h:38:2: warning: 'return' with a 
>> value, in function returning void
 __asm__ __volatile__( \
 ^
>> arch/blackfin/include/asm/def_LPBlackfin.h:51:33: note: in expansion of 
>> macro '_bfin_writeX'
#define bfin_write32(addr, val) _bfin_writeX(addr, val, 32,  )
^~~~
>> arch/blackfin/include/asm/io.h:20:33: note: in expansion of macro 
>> 'bfin_write32'
#define __raw_writel(val, addr) bfin_write32(addr, val)
^~~~
>> drivers/net/ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion of 
>> macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net/ethernet/broadcom/genet/bcmgenet.h:692:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(ext, GENET_EXT_OFF);
^~
   In file included from drivers/net/ethernet/broadcom/genet/bcmgenet.c:49:0:
   drivers/net/ethernet/broadcom/genet/bcmgenet.h:683:20: note: declared here
static inline void bcmgenet_##name##_writel(struct bcmgenet_priv *priv, \
   ^
   drivers/net/ethernet/broadcom/genet/bcmgenet.h:692:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(ext, GENET_EXT_OFF);
^~
   In file included from arch/blackfin/mach-bf533/include/mach/blackfin.h:15:0,
from arch/blackfin/include/asm/irqflags.h:11,
from include/linux/irqflags.h:15,
from arch/blackfin/include/asm/bitops.h:33,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from drivers/net/ethernet/broadcom/genet/bcmgenet.c:13:
   drivers/net/ethernet/broadcom/genet/bcmgenet.h: In function 
'bcmgenet_umac_writel':
>> arch/blackfin/include/asm/def_LPBlackfin.h:38:2: error: expected expression 
>> before '__asm__'
 __asm__ __volatile__( \
 ^
>> arch/blackfin/include/asm/def_LPBlackfin.h:51:33: note: in expansion of 
>> macro '_bfin_writeX'
#define bfin_write32(addr, val) _bfin_writeX(addr, val, 32,  )
^~~~
>> arch/blackfin/include/asm/io.h:20:33: note: in expansion of macro 
>> 'bfin_write32'
#define __raw_writel(val, addr) bfin_write32(addr, val)
^~~~
>> drivers/net/ethernet/broadcom/genet/bcmgenet.h:687:10: note: in expansion of 
>> macro '__raw_writel'
  return __raw_writel(val, priv->base + offset + off); \
 ^~~~
   drivers/net/ethernet/broadcom/genet/bcmgenet.h:693:1: note: in expansion of 
macro 'GENET_IO_MACRO'
GENET_IO_MACRO(umac, GENET_UMAC_OFF);
^~
>> arch/blackfin/include/asm/def_LPBlackfin.h:38:2: warning: 'return' with a 
>> value, in function returning 

Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-08-29 Thread David Ahern
On 8/29/17 8:58 PM, Alexei Starovoitov wrote:
> On Tue, Aug 29, 2017 at 07:03:43PM -0600, David Ahern wrote:
>> On 8/28/17 10:11 PM, Alexei Starovoitov wrote:
>>>
>>> Agree on the above, but you're mixing semantics of the new recurse
>>> flag and implementation of it. Ex: we don't have to copy this flag
>>> from prog->attr into cgroup. So this reset or non-reset discussion
>>> only makes sense in the context of your current implementation.
>>> We can implement the logic differently. Like don't copy that flag
>>> at all and at attach time walk parent->parent->parent and see
>>> what programs are attached. All of them should have prog->attr & 
>>> recurse_bit set
>>> In such implementation detach from 'b' is a nop from reset/non-reset
>>> point of view. When socket creation in 'c' is invoked the program
>>> 'c' is called first then the code keeps walking parents until root
>>> invoking 'a' along the way.
>>
>> So you are suggesting there is no recursive flag per cgroup? How do you
>> know you need to walk cgroups? How do you know when to stop running
>> programs?
> 
> you're talking about implementation, right?
> My 'proposed' implemenation of walking from cgroup all the way to the root
> is just an example. It's not efficient. More below...
> 
>>> I'm not saying it will be an efficient implementation. The point
>>> is to discuss UAPI independent of implementation.
>>>
 ###

 Also, let's agree on this intention. Based on the new ground rule, I
 want to point out this example:

 If 'a' gets a program installed with no recurse flag set, ONLY processes
 in 'a' have the 'a' program run. Processes in groups 'b', 'c' and 'd'
 all stop at cgroup 'b' program.
>>>
>>> I'm proposing that such situation should not be allowed to happen.
>>> In a->b->c->d cgroup scenario if override+recurse prog attached to 'b'
>>> then only the same override+recurse can be attached to c, d, a.
>>> So at detach time there can be gaps (like only 'b' and 'd' have
>>> override+recurse progs), but walking up until root from any point
>>> will guarantee that only override+recurse programs are seen.
>>>
>>
>> That seems very limiting to me. Seems like you are suggesting the entire
>> cgroup tree is recursive or non-recursive, but never a mix.
> 
> Entire cgroup subtree. Yes. It's the simplest uapi I could think of.

So 10 email exchanges later you agree on the UAPI I implemented in this
patch: user opts in to recursive behavior via a new flag at attach time,
and once a recursive program is installed at some point in the cgroup
tree it applies to all descendant cgroups.

So all of these exchanges weren't about the UAPI, but your disagreement
in my implementation. The only user visible change here is only programs
marked recursive are run versus going back to the first cgroup marked
non-recursive.

> Easy to understand and argue about and I think it's solving your use case.
> It's also easily extendable. New combination and features won't break
> the users. It feels you're in rush to get this stuff for this merge
> window, therefore I want to agree on something that is simple,
> non-controversial and extensible.

I am in no-rush, but this does not to fall by the wayside like the net
namespace specification.

Given the pending release of 4.13 net-next will close which gives a 2+
week window to work on v3 before the next merge window. Plenty of time
for me to work it into the many other things on my plate.


Re: [PATCH net-next] ipv6: Use rt6i_idev index for echo replies to a local address

2017-08-29 Thread Eric Dumazet
On Mon, 2017-08-28 at 13:53 -0700, David Ahern wrote:
> Tariq repored local pings to linklocal address is failing:
> $ ifconfig ens8
> ens8: flags=4163  mtu 1500
> inet 11.141.16.6  netmask 255.255.0.0  broadcast 11.141.255.255
> inet6 fe80::7efe:90ff:fecb:7502  prefixlen 64  scopeid 0x20
> ether 7c:fe:90:cb:75:02  txqueuelen 1000  (Ethernet)
> RX packets 12  bytes 1164 (1.1 KiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 30  bytes 2484 (2.4 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> $  /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8
> PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes
> 
> --- fe80::7efe:90ff:fecb:7502%ens8 ping statistics ---

Note that the presence of this leading --- had the effect of truncating
the merged patch from this point. (all tags were ignored)

> 3 packets transmitted, 0 received, 100% packet loss, time 2043ms
> 
> icmpv6_echo_reply needs to use the rt6i_idev dev index for local traffic
> similar to how icmp6_send does. Convert the change for icmp6_send into a
> helper that can be used in both places. Add the long over due
> skb_rt6_info helper to convert dst on an skb to rt6_info similar to
> skb_rtable for ipv4.
> 
> Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on
>device with address")
> Reported-by: Tariq Toukan 
> Signed-off-by: David Ahern 
> ---





Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-29 Thread Willem de Bruijn
On Tue, Aug 29, 2017 at 9:45 PM, Jason Wang  wrote:
>
>
> On 2017年08月30日 03:35, Willem de Bruijn wrote:
>>
>> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
>>  wrote:
>>>
>>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin 
>>> wrote:

 On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>
>> We don't enable network watchdog on virtio but we could and maybe
>> should.
>
> Can you elaborate?

 The issue is that holding onto buffers for very long times makes
 guests
 think they are stuck. This is funamentally because from guest point
 of
 view this is a NIC, so it is supposed to transmit things out in
 a timely manner. If host backs the virtual NIC by something that is
 not
 a NIC, with traffic shaping etc introducing unbounded latencies,
 guest will be confused.
>>>
>>> That assumes that guests are fragile in this regard. A linux guest
>>> does not make such assumptions.
>>
>> Yes it does. Examples above:
>>  > > - a single slow flow can occupy the whole ring, you will
>> not
>>  > >   be able to make any new buffers available for the fast
>> flow
>
> Oh, right. Though those are due to vring_desc pool exhaustion
> rather than an upper bound on latency of any single packet.
>
> Limiting the number of zerocopy packets in flight to some fraction
> of the ring ensures that fast flows can always grab a slot.
> Running
> out of ubuf_info slots reverts to copy, so indirectly does this. But
> I read it correclty the zerocopy pool may be equal to or larger than
> the descriptor pool. Should we refine the zcopy_used test
>
>  (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>
> to also return false if the number of outstanding ubuf_info is greater
> than, say, vq->num >> 1?


 We'll need to think about where to put the threshold, but I think it's
 a good idea.

 Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
 resources.

 In a sense it still means once you run out of slots zcopt gets disabled
 possibly permanently.

 Need to experiment with some numbers.
>>>
>>> I can take a stab with two flows, one delayed in a deep host qdisc
>>> queue. See how this change affects the other flow and also how
>>> sensitive that is to the chosen threshold value.
>>
>> Incomplete results at this stage, but I do see this correlation between
>> flows. It occurs even while not running out of zerocopy descriptors,
>> which I cannot yet explain.
>>
>> Running two threads in a guest, each with a udp socket, each
>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>
>> Sender A sends 1B datagrams.
>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>> to trigger zcopy_used in vhost net.
>>
>> A local receive process on the host receives both flows. To avoid
>> a deep copy when looping the packet onto the receive path,
>> changed skb_orphan_frags_rx to always return false (gross hack).
>>
>> The flow with the larger packets is redirected through netem on ifb0:
>>
>>modprobe ifb
>>ip link set dev ifb0 up
>>tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>
>>tc qdisc add dev tap0 ingress
>>tc filter add dev tap0 parent : protocol ip \
>>u32 match ip dport 8000 0x \
>>action mirred egress redirect dev ifb0
>>
>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>
>> no filter
>>rx.A: ~840,000
>>rx.B: ~840,000
>
>
> Just to make sure I understand the case here. What did rx.B mean here? I
> thought all traffic sent by Sender B has been redirected to ifb0?

It has been, but the packet still arrives at the destination socket.
IFB is a special virtual device that applies traffic shaping and
then reinjects it back at the point it was intercept by mirred.

rx.B is indeed arrival rate at the receiver, similar to rx.A.

>>
>> limit 1
>>rx.A: ~500,000
>>rx.B: ~3100
>>ifb0: 3273 sent, 371141 dropped
>>
>> limit 100
>>rx.A: ~9000
>>rx.B: ~4200
>>ifb0: 4630 sent, 1491 dropped
>>
>> limit 1000
>>rx.A: ~6800
>>rx.B: ~4200
>>ifb0: 4651 sent, 0 dropped
>>
>> Sender B is always correctly rate limited to 1 MBps or less. With a
>> short queue, it ends up dropping a lot and sending even less.
>>
>> When a queue builds up for sender B, sender A throughput is strongly
>> correlated with queue length. With queue length 1, it can send almost
>> at unthrottled speed. But even at limit 100 its throughput is on the
>> same order as sender B.
>>
>> What is surprising to me is that this happens even though the number
>> of ubuf_info in use at limit 100 is around 100 at all times. In other
>> words,
>> it does not exhaust the pool.
>>
>> When 

Re: [PATCH v2 net-next 2/6] udp: Constify skb argument in lookup functions

2017-08-29 Thread Tom Herbert
On Tue, Aug 29, 2017 at 5:58 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Tue, 29 Aug 2017 16:27:07 -0700
>
>> For UDP socket lookup functions, and associateed functions that take an
>> skbuf as argument, declare the skb argument as constant.
>>
>> One caveat is that reuseport_select_sock can be called from the UDP
>> lookup functions with an skb argument. This function temporarily
>> modifies the skbuff data pointer (in bpf_run via a pull/push sequence).
>> To resolve compiler warning I added a local skbuf declaration that is
>> not const and assigned to the skb argument with an explicit cast.
>>
>> Signed-off-by: Tom Herbert 
>
> Please don't do this.
>
> If reuseport_select_sock() modifies anything in the SKB, especially
> skb->data, it infects the entire call chain.  So you can't mark it
> const in this family of calls.
>
reuseport_select_sock calls run_bpf that calls pskb_pull to
"temporarily advance data past protocol header" and it calls
bpf_prog_run_save_cb which takes non-constant skb argument. This is
the only instance in all the udp lookup functions where non-constant
is needed. It's logical that constant skbuf makes sense for socket
lookup-- I doubt any caller would expect the skbuf to be modified as a
side effect. It's also an implicit characteristic since
reuseport_select_sock may just clone the socket before calling BPF.

The problem is that all the flow dissector functions operate on const
skbs (again that's logical :-) ). So if we want to be able to call
lookup functions or even BPF to do flow dissection, then I think
something needs to change. I really don't want to unconsitify the flow
dissector functions. We could just always do the skb before calling
BPF, but I suppose that is a potential performance hit. Is there a
better way to resolve this?

Thanks,
Tom


Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-08-29 Thread Alexei Starovoitov
On Tue, Aug 29, 2017 at 07:03:43PM -0600, David Ahern wrote:
> On 8/28/17 10:11 PM, Alexei Starovoitov wrote:
> > 
> > Agree on the above, but you're mixing semantics of the new recurse
> > flag and implementation of it. Ex: we don't have to copy this flag
> > from prog->attr into cgroup. So this reset or non-reset discussion
> > only makes sense in the context of your current implementation.
> > We can implement the logic differently. Like don't copy that flag
> > at all and at attach time walk parent->parent->parent and see
> > what programs are attached. All of them should have prog->attr & 
> > recurse_bit set
> > In such implementation detach from 'b' is a nop from reset/non-reset
> > point of view. When socket creation in 'c' is invoked the program
> > 'c' is called first then the code keeps walking parents until root
> > invoking 'a' along the way.
> 
> So you are suggesting there is no recursive flag per cgroup? How do you
> know you need to walk cgroups? How do you know when to stop running
> programs?

you're talking about implementation, right?
My 'proposed' implemenation of walking from cgroup all the way to the root
is just an example. It's not efficient. More below...

> > I'm not saying it will be an efficient implementation. The point
> > is to discuss UAPI independent of implementation.
> > 
> >> ###
> >>
> >> Also, let's agree on this intention. Based on the new ground rule, I
> >> want to point out this example:
> >>
> >> If 'a' gets a program installed with no recurse flag set, ONLY processes
> >> in 'a' have the 'a' program run. Processes in groups 'b', 'c' and 'd'
> >> all stop at cgroup 'b' program.
> > 
> > I'm proposing that such situation should not be allowed to happen.
> > In a->b->c->d cgroup scenario if override+recurse prog attached to 'b'
> > then only the same override+recurse can be attached to c, d, a.
> > So at detach time there can be gaps (like only 'b' and 'd' have
> > override+recurse progs), but walking up until root from any point
> > will guarantee that only override+recurse programs are seen.
> > 
> 
> That seems very limiting to me. Seems like you are suggesting the entire
> cgroup tree is recursive or non-recursive, but never a mix.

Entire cgroup subtree. Yes. It's the simplest uapi I could think of.
Easy to understand and argue about and I think it's solving your use case.
It's also easily extendable. New combination and features won't break
the users. It feels you're in rush to get this stuff for this merge
window, therefore I want to agree on something that is simple,
non-controversial and extensible.
If you're not in rush (I'm not), we can come up with more flexible uapi.
For example: another way of thinking about your 'recursive' requirement
is to think that all 'program to be run' should be present as a link list
in a given cgroup. So no walking a chain of parents.
Instead of 'recursive' let's call this new flag 'multiprog'.
Now in a->b->c->d scenario. We can install 'multiprog' prog in 'b'.
The kernel will automatically propage it (like it does right now
with css_for_each_descendant_pre() loop) to 'c' and to 'd'.
Now we allow users to attach another 'multiprog' program to 'c'.
The kernel will maintain a link list of programs in every cgroup,
so there will be a link list of two programs in 'c' and 'd'
and invocation of the programs will be faster than walking
cgroup->parent->parent and checking some flags at every step,
since there will be less pointer dereferences and no flags to check.
Just invoke all programs in the current cgroup. Kernel took care
of ordering at the time of attach/detach.
I believe Andy proposed something like this back in Jan/Feb.



RE: Question about ip_defrag

2017-08-29 Thread liujian (CE)



Best Regards,
liujian


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Florian Westphal
> Sent: Tuesday, August 29, 2017 9:47 PM
> To: liujian (CE)
> Cc: Florian Westphal; Jesper Dangaard Brouer; netdev@vger.kernel.org;
> Wangkefeng (Kevin); weiyongjun (A)
> Subject: Re: Question about ip_defrag
> 
> liujian (CE)  wrote:
> 
> [ trimming cc list ]
> 
> > Now, I have not the real environment.
> > I use iperf generate fragment packets; and I always change NIC rx
> > irq's affinity cpu, to make sure frag_mem_limit reach to thresh.
> > my test machine, CPU num is 384.
> 
> Oh well, that explains it.
> 
> > > > +   if (frag_mem_limit(nf) > nf->low_thresh) {
> > > > inet_frag_schedule_worker(f);
> > > > +   update_frag_mem_limit(nf, SKB_TRUESIZE(1500) * 16);
> > > > +   }
> 
> You need to reduce this to a lower value.
> Your cpu count * batch_value needs to be less than low_thresh to avoid
> problems.
> 
> Wtih 384 cpus its close to 12 mbyte...
> 
> Perhaps do this:
> 
> update_frag_mem_limit(nf, 2 * 1024*1024 / NR_CPUS);
> 
> 
> However, I think its better to revert the percpu counter change and move back
> to a single atomic_t count.

Ok. 
Florian and Jesper, many thanks for this issue. 


Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-29 Thread Jason Wang



On 2017年08月30日 03:35, Willem de Bruijn wrote:

On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
 wrote:

On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin  wrote:

On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:

We don't enable network watchdog on virtio but we could and maybe
should.

Can you elaborate?

The issue is that holding onto buffers for very long times makes guests
think they are stuck. This is funamentally because from guest point of
view this is a NIC, so it is supposed to transmit things out in
a timely manner. If host backs the virtual NIC by something that is not
a NIC, with traffic shaping etc introducing unbounded latencies,
guest will be confused.

That assumes that guests are fragile in this regard. A linux guest
does not make such assumptions.

Yes it does. Examples above:
 > > - a single slow flow can occupy the whole ring, you will not
 > >   be able to make any new buffers available for the fast flow

Oh, right. Though those are due to vring_desc pool exhaustion
rather than an upper bound on latency of any single packet.

Limiting the number of zerocopy packets in flight to some fraction
of the ring ensures that fast flows can always grab a slot.
Running
out of ubuf_info slots reverts to copy, so indirectly does this. But
I read it correclty the zerocopy pool may be equal to or larger than
the descriptor pool. Should we refine the zcopy_used test

 (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx

to also return false if the number of outstanding ubuf_info is greater
than, say, vq->num >> 1?


We'll need to think about where to put the threshold, but I think it's
a good idea.

Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
resources.

In a sense it still means once you run out of slots zcopt gets disabled 
possibly permanently.

Need to experiment with some numbers.

I can take a stab with two flows, one delayed in a deep host qdisc
queue. See how this change affects the other flow and also how
sensitive that is to the chosen threshold value.

Incomplete results at this stage, but I do see this correlation between
flows. It occurs even while not running out of zerocopy descriptors,
which I cannot yet explain.

Running two threads in a guest, each with a udp socket, each
sending up to 100 datagrams, or until EAGAIN, every msec.

Sender A sends 1B datagrams.
Sender B sends VHOST_GOODCOPY_LEN, which is enough
to trigger zcopy_used in vhost net.

A local receive process on the host receives both flows. To avoid
a deep copy when looping the packet onto the receive path,
changed skb_orphan_frags_rx to always return false (gross hack).

The flow with the larger packets is redirected through netem on ifb0:

   modprobe ifb
   ip link set dev ifb0 up
   tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit

   tc qdisc add dev tap0 ingress
   tc filter add dev tap0 parent : protocol ip \
   u32 match ip dport 8000 0x \
   action mirred egress redirect dev ifb0

For 10 second run, packet count with various ifb0 queue lengths $LIMIT:

no filter
   rx.A: ~840,000
   rx.B: ~840,000


Just to make sure I understand the case here. What did rx.B mean here? I 
thought all traffic sent by Sender B has been redirected to ifb0?




limit 1
   rx.A: ~500,000
   rx.B: ~3100
   ifb0: 3273 sent, 371141 dropped

limit 100
   rx.A: ~9000
   rx.B: ~4200
   ifb0: 4630 sent, 1491 dropped

limit 1000
   rx.A: ~6800
   rx.B: ~4200
   ifb0: 4651 sent, 0 dropped

Sender B is always correctly rate limited to 1 MBps or less. With a
short queue, it ends up dropping a lot and sending even less.

When a queue builds up for sender B, sender A throughput is strongly
correlated with queue length. With queue length 1, it can send almost
at unthrottled speed. But even at limit 100 its throughput is on the
same order as sender B.

What is surprising to me is that this happens even though the number
of ubuf_info in use at limit 100 is around 100 at all times. In other words,
it does not exhaust the pool.

When forcing zcopy_used to be false for all packets, this effect of
sender A throughput being correlated with sender B does not happen.

no filter
   rx.A: ~850,000
   rx.B: ~850,000

limit 100
   rx.A: ~850,000
   rx.B: ~4200
   ifb0: 4518 sent, 876182 dropped

Also relevant is that with zerocopy, the sender processes back off
and report the same count as the receiver. Without zerocopy,
both senders send at full speed, even if only 4200 packets from flow
B arrive at the receiver.

This is with the default virtio_net driver, so without napi-tx.


What kind of qdisc do you use in guest? I suspect we should use 
something which could do fair queueing (e.g sfq).




It appears that the zerocopy notifications are pausing the guest.
Will look at that now.


Another factor is the tx interrupt coalescing parameters of ifb0, maybe 
we should disable it during the test.


Thanks



By the way, I have 

Re: [PATCH net] sch_hhf: fix null pointer dereference on init failure

2017-08-29 Thread Cong Wang
On Tue, Aug 29, 2017 at 12:02 PM, Nikolay Aleksandrov
 wrote:
> First I did it with the check in the for () conditional, but this is more
> visible and explicit. Let me know if you'd like the shorter version. :-)

Or, if you want to make the patch size smaller, just check NULL
before for():

if (!q->hh_flows)
return;

for (...)

Up to you, I have no strong opinion here, slightly prefer a smaller
one for backport.


Re: [PATCH net-next 0/3 v10] Add support for rmnet driver

2017-08-29 Thread Subash Abhinov Kasiviswanathan

On 2017-08-29 19:12, David Miller wrote:

Sigh, I had to revert.

You only allow RMNET to take on the values 'y' and 'n'.

You must allow for it to be 'm' and modular as well.


Hi David

I'll fix this now.

Sorry about the cover letter. I'll explain it better in subsequent 
submission.

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH net-next 0/3 v10] Add support for rmnet driver

2017-08-29 Thread David Miller

Sigh, I had to revert.

You only allow RMNET to take on the values 'y' and 'n'.

You must allow for it to be 'm' and modular as well.


Re: [PATCH net-next 0/3 v10] Add support for rmnet driver

2017-08-29 Thread David Miller
From: Subash Abhinov Kasiviswanathan 
Date: Tue, 29 Aug 2017 18:47:55 -0600

> I have updated the locking scheme as follows -

Series applied, but this is not how you write a header posting for a
patch set.

This posting is where you say at a high level what the patch series is
doing, how it is doing it, and why it is doing it that way.

You can explain what changes happened, and why, but that belongs
in the changelog at the end of this posting.  Here you've made
an explaination for one change the entire content of the text.

You not even saying what rmnet is, why we would want to add it to the
kernel, and what it's all about.  So now when someone tries to read
the merge commit that contains this text, they will have no context
about you and me talking about locking and they will thus ask
themselves "what is this person talking about here?  it's not
explaining the patch series at all"


Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-08-29 Thread David Ahern
On 8/28/17 10:11 PM, Alexei Starovoitov wrote:
> 
> Agree on the above, but you're mixing semantics of the new recurse
> flag and implementation of it. Ex: we don't have to copy this flag
> from prog->attr into cgroup. So this reset or non-reset discussion
> only makes sense in the context of your current implementation.
> We can implement the logic differently. Like don't copy that flag
> at all and at attach time walk parent->parent->parent and see
> what programs are attached. All of them should have prog->attr & recurse_bit 
> set
> In such implementation detach from 'b' is a nop from reset/non-reset
> point of view. When socket creation in 'c' is invoked the program
> 'c' is called first then the code keeps walking parents until root
> invoking 'a' along the way.

So you are suggesting there is no recursive flag per cgroup? How do you
know you need to walk cgroups? How do you know when to stop running
programs?

> I'm not saying it will be an efficient implementation. The point
> is to discuss UAPI independent of implementation.
> 
>> ###
>>
>> Also, let's agree on this intention. Based on the new ground rule, I
>> want to point out this example:
>>
>> If 'a' gets a program installed with no recurse flag set, ONLY processes
>> in 'a' have the 'a' program run. Processes in groups 'b', 'c' and 'd'
>> all stop at cgroup 'b' program.
> 
> I'm proposing that such situation should not be allowed to happen.
> In a->b->c->d cgroup scenario if override+recurse prog attached to 'b'
> then only the same override+recurse can be attached to c, d, a.
> So at detach time there can be gaps (like only 'b' and 'd' have
> override+recurse progs), but walking up until root from any point
> will guarantee that only override+recurse programs are seen.
> 

That seems very limiting to me. Seems like you are suggesting the entire
cgroup tree is recursive or non-recursive, but never a mix.


Re: [PATCH v2 net-next 3/6] flow_dissector: Add protocol specific flow dissection offload

2017-08-29 Thread David Miller
From: Tom Herbert 
Date: Tue, 29 Aug 2017 16:27:08 -0700

> +#define GOTO_BY_RESULT(ret) do { \
> + switch (ret) {  \
> + case FLOW_DISSECT_RET_OUT_GOOD: \
> + goto out_good;  \
> + case FLOW_DISSECT_RET_PROTO_AGAIN:  \
> + goto proto_again;   \
> + case FLOW_DISSECT_RET_IPPROTO_AGAIN:\
> + goto ip_proto_again;\
> + case FLOW_DISSECT_RET_OUT_BAD:  \
> + default:\
> + goto out_bad;   \
> + }   \
> +} while (0)
> +
> +#define GOTO_OR_CONT_BY_RESULT(ret) do { \
> + enum flow_dissect_ret __ret = (ret);\
> + \
> + if (__ret != FLOW_DISSECT_RET_CONTINUE) \
> + GOTO_BY_RESULT(__ret);  \
> +} while (0)

Please don't hide major control flow changes inside of a macro.  This
means returns and gotos.

It makes code impossible to audit.

Yes, this applies even if the macro has the word "GOTO" in it :-)


Re: [PATCH v2 net-next 2/6] udp: Constify skb argument in lookup functions

2017-08-29 Thread David Miller
From: Tom Herbert 
Date: Tue, 29 Aug 2017 16:27:07 -0700

> For UDP socket lookup functions, and associateed functions that take an
> skbuf as argument, declare the skb argument as constant.
> 
> One caveat is that reuseport_select_sock can be called from the UDP
> lookup functions with an skb argument. This function temporarily
> modifies the skbuff data pointer (in bpf_run via a pull/push sequence).
> To resolve compiler warning I added a local skbuf declaration that is
> not const and assigned to the skb argument with an explicit cast.
> 
> Signed-off-by: Tom Herbert 

Please don't do this.

If reuseport_select_sock() modifies anything in the SKB, especially
skb->data, it infects the entire call chain.  So you can't mark it
const in this family of calls.



[PATCH net-next 3/3 v10] drivers: net: ethernet: qualcomm: rmnet: Initial implementation

2017-08-29 Thread Subash Abhinov Kasiviswanathan
RmNet driver provides a transport agnostic MAP (multiplexing and
aggregation protocol) support in embedded module. Module provides
virtual network devices which can be attached to any IP-mode
physical device. This will be used to provide all MAP functionality
on future hardware in a single consistent location.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 Documentation/networking/rmnet.txt |  82 
 drivers/net/ethernet/qualcomm/Kconfig  |   2 +
 drivers/net/ethernet/qualcomm/Makefile |   2 +
 drivers/net/ethernet/qualcomm/rmnet/Kconfig|  12 +
 drivers/net/ethernet/qualcomm/rmnet/Makefile   |  10 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 419 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  56 +++
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 271 +
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h   |  26 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  88 +
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 107 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 105 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  45 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 170 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h|  29 ++
 15 files changed, 1424 insertions(+)
 create mode 100644 Documentation/networking/rmnet.txt
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Kconfig
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_command.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h

diff --git a/Documentation/networking/rmnet.txt 
b/Documentation/networking/rmnet.txt
new file mode 100644
index 000..6b341ea
--- /dev/null
+++ b/Documentation/networking/rmnet.txt
@@ -0,0 +1,82 @@
+1. Introduction
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+
+a. MAP packet (data / control)
+
+MAP header has the same endianness of the IP packet.
+
+Packet format -
+
+Bit 0 1   2-7  8 - 15   16 - 31
+Function   Command / Data   Reserved Pad   Multiplexer IDPayload length
+Bit32 - x
+Function Raw  Bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Control packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits are usually zeroed out and to be ignored by receiver.
+
+Padding is number of bytes to be added for 4 byte alignment if required by
+hardware.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. MAP packet (command specific)
+
+Bit 0 1   2-7  8 - 15   16 - 31
+Function   Command Reserved Pad   Multiplexer IDPayload length
+Bit  32 - 3940 - 4546 - 47   48 - 63
+Function   Command nameReserved   Command Type   Reserved
+Bit  64 - 95
+Function   Transaction ID
+Bit  96 - 127
+Function   Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types -
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+
+c. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered 

[PATCH net-next 0/3 v10] Add support for rmnet driver

2017-08-29 Thread Subash Abhinov Kasiviswanathan
Hi David

I have updated the locking scheme as follows -

The shared resource which needs to be protected is realdev->rx_handler_data.

For the writer path, this is using rtnl_lock(). The writer paths are
rmnet_newlink(), rmnet_dellink() and rmnet_force_unassociate_device(). These
paths are already called with rtnl_lock() acquired in. There is also an
ASSERT_RTNL() to ensure that we are calling with rtnl acquired. For
dereference here, we will need to use rtnl_dereference(). Dev list writing
needs to happen with rtnl_lock() acquired for netdev_master_upper_dev_link().

For the reader path, the real_dev->rx_handler_data is called in the TX / RX
path. We only need rcu_read_lock() for these scenarios. In these cases,
the rcu_read_lock() is held in __dev_queue_xmit() and
netif_receive_skb_internal(), so readers need to use rcu_dereference_rtnl()
to get the relevant information. For dev list reading, we again acquire
rcu_read_lock() in rmnet_dellink() for netdev_master_upper_dev_get_rcu().
We also use unregister_netdevice_many() to free all rmnet devices in
rmnet_force_unassociate_device() so we dont lose the rtnl_lock() and free in
same context.

I have also added this as a comment in rmnet_config.c.

--
v1: Same as the RFC patch with some minor fixes for issues reported by
kbuild test robot.

v1->v2: Change datatypes and remove config IOCTL as mentioned by David.
Also fix checkpatch issues and remove some unused code.

v2->v3: Move location to drivers/net and rename to rmnet. Change the
userspace - netlink communication from custom netlink to rtnl_link_ops.
Refactor some code. Use a fixed config for ingress and egress.

v3->v4: Move location to drivers/net/ethernet/qualcomm/.
Fix comments from Stephen and Jiri -
Split the ether and arp type changes into seperate patches.
Remove debug and custom logging and switch to standard netdevice log.
Remove module parameters. Refactor and change some code style issues.

v4->v5: Rename some structs and variables. Move the initializer
before the for loop start. Put the arp type in correct sequence.

v5->v6: Fix comments from Dan -
Use the upper link API. As a result, remove all the refcounting logic.
Device refcount is explicitly held on real_dev on rx_handler
registration only. Modifiy the flow control struct. Remove the unused
ethernet mode handling.

v6->v7: Fix comments from David - Add newline to end of Makefile. Remove
inline from .c files. Move the module init/exit to rmnet config. Fix an
error reported by kbuild test robot for an unused file.

v7->v8: Use a smaller value for ETH_P_MAP as mentioned by David. Change
netdev_info to netdev_dbg as mentioned by Andew. Fix comments from
Stephen regarding netdev_priv and sparse related errors of using 0 as NULL

v8->v9: Fix comments from David - Remove the CFLAG rule. Change the way
rmnet devices are freed. Instead of using a workqueue to unregister devices
individually, go through the list and free all devices within the rtnl_lock().

v9->v10: Fix the locking scheme as mentioned by David. Change comment near
MAP type definition as mentioned by Dan. Refactor some code.

Subash Abhinov Kasiviswanathan (3):
  net: ether: Add support for multiplexing and aggregation type
  net: arp: Add support for raw IP device
  drivers: net: ethernet: qualcomm: rmnet: Initial implementation

 Documentation/networking/rmnet.txt |  82 
 drivers/net/ethernet/qualcomm/Kconfig  |   2 +
 drivers/net/ethernet/qualcomm/Makefile |   2 +
 drivers/net/ethernet/qualcomm/rmnet/Kconfig|  12 +
 drivers/net/ethernet/qualcomm/rmnet/Makefile   |  10 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 419 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  56 +++
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 271 +
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.h   |  26 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  88 +
 .../ethernet/qualcomm/rmnet/rmnet_map_command.c| 107 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_map_data.c   | 105 ++
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h|  45 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 170 +
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.h|  29 ++
 include/uapi/linux/if_arp.h|   1 +
 include/uapi/linux/if_ether.h  |   3 +
 17 files changed, 1428 insertions(+)
 create mode 100644 Documentation/networking/rmnet.txt
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Kconfig
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/Makefile
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.h
 create mode 100644 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
 create mode 

[PATCH net-next 1/3 v10] net: ether: Add support for multiplexing and aggregation type

2017-08-29 Thread Subash Abhinov Kasiviswanathan
Define the Qualcomm multiplexing and aggregation (MAP) ether type 0x00F9.
This is needed for receiving data in the MAP protocol like RMNET. This is
not an officially registered ID.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 include/uapi/linux/if_ether.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 5bc9bfd..30526db 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -137,6 +137,9 @@
 #define ETH_P_IEEE802154 0x00F6/* IEEE802.15.4 frame   
*/
 #define ETH_P_CAIF 0x00F7  /* ST-Ericsson CAIF protocol*/
 #define ETH_P_XDSA 0x00F8  /* Multiplexed DSA protocol */
+#define ETH_P_MAP  0x00F9  /* Qualcomm multiplexing and
+* aggregation protocol
+*/
 
 /*
  * This is an Ethernet frame header.
-- 
1.9.1



[PATCH net-next 2/3 v10] net: arp: Add support for raw IP device

2017-08-29 Thread Subash Abhinov Kasiviswanathan
Define the raw IP type. This is needed for raw IP net devices
like rmnet.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 include/uapi/linux/if_arp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
index cf73510..a2a6356 100644
--- a/include/uapi/linux/if_arp.h
+++ b/include/uapi/linux/if_arp.h
@@ -59,6 +59,7 @@
 #define ARPHRD_LAPB516 /* LAPB */
 #define ARPHRD_DDCMP517/* Digital's DDCMP protocol */
 #define ARPHRD_RAWHDLC 518 /* Raw HDLC */
+#define ARPHRD_RAWIP519/* Raw IP   */
 
 #define ARPHRD_TUNNEL  768 /* IPIP tunnel  */
 #define ARPHRD_TUNNEL6 769 /* IP6IP6 tunnel*/
-- 
1.9.1



Re: Fwd: DA850-evm MAC Address is random

2017-08-29 Thread Adam Ford
t; v4.13 itself.
>>>>>
>>>>> I will test it.  Do need to run some instruction or do something
>>>>> special in U-Boot to pass this in the proper place for the kernel to
>>>>> pull it?  Tony's patch reference showed
>>>>> command for fdt set, but I am not sure I fully understand the
>>>>> parameters that went along with that.
>>>>
>>>> Nope, just applying the patch and booting the with the new dtb should
>>>> result in the random mac address going away.
>>>
>>> Unfortunately, I am not seeing any change with the patch (at least
>>> with Kernel 4.12.9 from stable).
>>>
>>> netconsole: network logging started
>>> davinci_emac davinci_emac.1: incompatible machine/device type for
>>> reading mac address
>>> davinci_emac davinci_emac.1: using random MAC addr: ee:74:c3:3a:15:be
>>>
>>> Looking at the source for cpsw-common.c function ti_cm_get_macid()
>>> doesn't have a case for the  ti,davinci-dm6467-emac so I wonder if
>>> there might be more to it.
>>
>> Hmm, it did solve the issue for me when I tried latest -next. And
>> reverting the patch brought back the random mac address usage. Could you
>> try latest mainline or -next?
>>
>> Meanwhile let me see whats going on with the observations you have.
>
> I will try again with -next this afternoon and see what I can find.
> Can you tell me which U-Boot version you're using? I want to match
> your setup. I want to see if something is missing during the hand-off
> between the Bootloader and Linux.
>

I wonder if U-Boot isn't pushing something to Linux because it doesn't
appear to be running some of the da850 specific code even when I run
linux-next.  Can you tell me what verision of U-Boot you're using?
Other than using davinci_all_defconfig, did you change the
configuration at all?

[1.411107] netconsole: network logging started
[1.416237] davinci_emac davinci_emac.1: incompatible machine/device type for
 reading mac address
[1.424496] davinci_emac davinci_emac.1: using random MAC addr: be:e2:84:ed:3
c:87

I also confirmed the SPI-flash has the MAC programmed:

# hexdump /dev/mtd5ro
000 0800 04ee 8e32     
010        



Here is my full log:

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 4.13.0-rc7-next-20170829 (aford@ubuntu16) (gcc vers
ion 4.8.3 20140320 (prerelease) (Sourcery CodeBench Lite 2014.05-29)) #1 PREEMPT
 Tue Aug 29 19:11:06 CDT 2017
[0.00] CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=0005317f
[0.00] CPU: VIVT data cache, VIVT instruction cache
[0.00] OF: fdt: Machine model: DA850/AM1808/OMAP-L138 EVM
[0.00] Memory policy: Data cache writethrough
[0.00] DaVinci da850/omap-l138 variant 0x0
[0.00] On node 0 totalpages: 8192
[0.00] free_area_init_node: node 0, pgdat c05dfec8, node_mem_map c1fb900
0
[0.00]   DMA zone: 64 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 8192 pages, LIFO batch:0
[0.00] random: fast init done
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists, mobility grouping on.  Total pages: 8128
[0.00] Kernel command line: mem=32M console=ttyS2,115200n8 root=/dev/mmc
blk0p2 rw wait noinitrd
[0.00] PID hash table entries: 128 (order: -3, 512 bytes)
[0.00] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
[0.00] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
[0.00] Memory: 26196K/32768K available (4412K kernel code, 330K rwdata,
1012K rodata, 224K init, 146K bss, 6572K reserved, 0K cma-reserved)
[0.00] Virtual kernel memory layout:
   vector  : 0x - 0x1000   (   4 kB)
   fixmap  : 0xffc0 - 0xfff0   (3072 kB)
   vmalloc : 0xc280 - 0xff80   ( 976 MB)
   lowmem  : 0xc000 - 0xc200   (  32 MB)
   modules : 0xbf00 - 0xc000   (  16 MB)
 .text : 0xc0008000 - 0xc04574e8   (4414 kB)
 .init : 0xc0556000 - 0xc058e000   ( 224 kB)
 .data : 0xc058e000 - 0xc05e09e0   ( 331 kB)
  .bss : 0xc05e4ebc - 0xc06097b4   ( 147 kB)\x00c - 0xc06097
b4   ( 147 kB)
[0.00] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[0.00] Preemptible hierarchical RCU implementation.
[0.00]  Tasks RCU enabled.\x00d.
[0.00] NR_IRQS: 245
[0.00] clocksource: timer0_1: mask: 0x max_cycles: 0x, m
ax_idle_ns: 79635851949 ns
[0.00] sched_clock: 32 bits at 24MHz, resolution 4

Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-29 Thread Eric W. Biederman
"prakash.sangappa"  writes:

> On 08/29/2017 04:02 PM, David Miller wrote:
>> From: Prakash Sangappa 
>> Date: Mon, 28 Aug 2017 17:12:20 -0700
>>
>>> Currently passing tid(gettid(2)) of a thread in struct ucred in
>>> SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
>>> it fails with EPERM error. Some applications deal with thread id
>>> of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
>>> message. Basically, either tgid(pid of the process) or the tid of
>>> the thread should be allowed without the need for CAP_SYS_ADMIN capability.
>>>
>>> SCM_CREDENTIALS will be used to determine the global id of a process or
>>> a thread running inside a pid namespace.
>>>
>>> This patch adds necessary check to accept tid in SCM_CREDENTIALS
>>> struct ucred.
>>>
>>> Signed-off-by: Prakash Sangappa 
>> I'm pretty sure that by the descriptions in previous changes to this
>> function, what you are proposing is basically a minor form of PID
>> spoofing which we only want someone with CAP_SYS_ADMIN over the
>> PID namespace to be able to do.
>
> The fix is to allow passing tid of the calling thread itself not of any
> other thread or process. Curious why would this be considered
> as pid spoofing?
>
> This change would enable a thread in a multi threaded process, running
> inside a pid namespace to be identified by the recipient of the
> message easily.

I think a more practical problem is that change, changes what is being
passed in the SCM_CREDENTIALS from a pid of a process to a tid of a
thread.  That could be confusing and that confusion could be exploited.

It is definitely confusing because in some instances a value can be both
a tgid and a tid.

I definitely think this needs to be talked about in terms of changing
what is passed in that field and what the consequences could be.

I suspect you are ok.  As nothing allows passing a tid today.  But I
don't see any analysis on why passing a tid instead of a tgid will not
confuse the receiving application, and in such confusion introduce a
security hole.

Eric


Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-29 Thread prakash.sangappa



On 08/29/2017 04:02 PM, David Miller wrote:

From: Prakash Sangappa 
Date: Mon, 28 Aug 2017 17:12:20 -0700


Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa 

I'm pretty sure that by the descriptions in previous changes to this
function, what you are proposing is basically a minor form of PID
spoofing which we only want someone with CAP_SYS_ADMIN over the
PID namespace to be able to do.


The fix is to allow passing tid of the calling thread itself not of any
other thread or process. Curious why would this be considered
as pid spoofing?

This change would enable a thread in a multi threaded process, running
inside a pid namespace to be identified by the recipient of the
message easily.




Sorry, I'm not applying this.




[PATCH v2 net-next 2/6] udp: Constify skb argument in lookup functions

2017-08-29 Thread Tom Herbert
For UDP socket lookup functions, and associateed functions that take an
skbuf as argument, declare the skb argument as constant.

One caveat is that reuseport_select_sock can be called from the UDP
lookup functions with an skb argument. This function temporarily
modifies the skbuff data pointer (in bpf_run via a pull/push sequence).
To resolve compiler warning I added a local skbuf declaration that is
not const and assigned to the skb argument with an explicit cast.

Signed-off-by: Tom Herbert 
---
 include/net/ip.h |  2 +-
 include/net/sock_reuseport.h |  2 +-
 include/net/udp.h| 11 ++-
 net/core/sock_reuseport.c|  5 +++--
 net/ipv4/udp.c   | 11 ++-
 net/ipv6/udp.c   | 10 +-
 6 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 9896f46cbbf1..8c0d84ffc659 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -79,7 +79,7 @@ struct ipcm_cookie {
 #define PKTINFO_SKB_CB(skb) ((struct in_pktinfo *)((skb)->cb))
 
 /* return enslaved device index if relevant */
-static inline int inet_sdif(struct sk_buff *skb)
+static inline int inet_sdif(const struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
if (skb && ipv4_l3mdev_skb(IPCB(skb)->flags))
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index aecd30308d50..d25352a848d9 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -20,7 +20,7 @@ extern int reuseport_add_sock(struct sock *sk, struct sock 
*sk2);
 extern void reuseport_detach_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
  u32 hash,
- struct sk_buff *skb,
+ const struct sk_buff *skb,
  int hdr_len);
 extern struct bpf_prog *reuseport_attach_prog(struct sock *sk,
  struct bpf_prog *prog);
diff --git a/include/net/udp.h b/include/net/udp.h
index 4e5f23fec35e..f3d1de6f0983 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -167,7 +167,7 @@ static inline void udp_csum_pull_header(struct sk_buff *skb)
UDP_SKB_CB(skb)->cscov -= sizeof(struct udphdr);
 }
 
-typedef struct sock *(*udp_lookup_t)(struct sk_buff *skb, __be16 sport,
+typedef struct sock *(*udp_lookup_t)(const struct sk_buff *skb, __be16 sport,
 __be16 dport);
 
 struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb,
@@ -288,8 +288,9 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, 
__be16 sport,
 __be32 daddr, __be16 dport, int dif);
 struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
   __be32 daddr, __be16 dport, int dif, int sdif,
-  struct udp_table *tbl, struct sk_buff *skb);
-struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
+  struct udp_table *tbl,
+  const struct sk_buff *skb);
+struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb,
 __be16 sport, __be16 dport);
 struct sock *udp6_lib_lookup(struct net *net,
 const struct in6_addr *saddr, __be16 sport,
@@ -299,8 +300,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
   const struct in6_addr *saddr, __be16 sport,
   const struct in6_addr *daddr, __be16 dport,
   int dif, int sdif, struct udp_table *tbl,
-  struct sk_buff *skb);
-struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
+  const struct sk_buff *skb);
+struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 __be16 sport, __be16 dport);
 
 /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index eed1ebf7f29d..a17f13b33189 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -164,9 +164,10 @@ void reuseport_detach_sock(struct sock *sk)
 EXPORT_SYMBOL(reuseport_detach_sock);
 
 static struct sock *run_bpf(struct sock_reuseport *reuse, u16 socks,
-   struct bpf_prog *prog, struct sk_buff *skb,
+   struct bpf_prog *prog, const struct sk_buff *_skb,
int hdr_len)
 {
+   struct sk_buff *skb = (struct sk_buff *)_skb; /* Override const */
struct sk_buff *nskb = NULL;
u32 index;
 
@@ -205,7 +206,7 @@ static struct sock *run_bpf(struct sock_reuseport *reuse, 
u16 socks,
  */
 struct sock *reuseport_select_sock(struct sock *sk,
   u32 hash,
- 

[PATCH v2 net-next 4/6] udp: flow dissector offload

2017-08-29 Thread Tom Herbert
Add support to perform UDP specific flow dissection. This is
primarily intended for dissecting encapsulated packets in UDP
encapsulation.

This patch adds a flow_dissect offload for UDP4 and UDP6. The backend
function performs a socket lookup and calls the flow_dissect function
if a socket is found.

Signed-off-by: Tom Herbert 
---
 include/linux/udp.h  |  8 
 include/net/udp.h|  8 
 include/net/udp_tunnel.h |  8 
 net/ipv4/udp_offload.c   | 45 +
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/udp_offload.c   | 13 +
 6 files changed, 83 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..2e90b189ef6a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,14 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+   /* Flow dissector function for a UDP socket */
+   enum flow_dissect_ret (*flow_dissect)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
/* udp_recvmsg try to use this before splicing sk_receive_queue */
struct sk_buff_head reader_queue cacheline_aligned_in_smp;
diff --git a/include/net/udp.h b/include/net/udp.h
index f3d1de6f0983..499e4faf8b14 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -174,6 +174,14 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+enum flow_dissect_ret udp_flow_dissect(const struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
struct udphdr *uh;
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 10cce0dd4450..b7102e0f41a9 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -69,6 +69,13 @@ typedef struct sk_buff **(*udp_tunnel_gro_receive_t)(struct 
sock *sk,
 struct sk_buff *skb);
 typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb,
 int nhoff);
+typedef enum flow_dissect_ret (*udp_tunnel_flow_dissect_t)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
 struct udp_tunnel_sock_cfg {
void *sk_user_data; /* user data used by encap_rcv call back */
@@ -78,6 +85,7 @@ struct udp_tunnel_sock_cfg {
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
+   udp_tunnel_flow_dissect_t flow_dissect;
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 97658bfc1b58..7f0a7ed4a6f7 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -328,11 +328,56 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
+enum flow_dissect_ret udp_flow_dissect(const struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   struct udphdr *uh, _uh;
+   struct sock *sk;
+
+   uh = __skb_header_pointer(skb, *p_nhoff, sizeof(_uh), data,
+ *p_hlen, &_uh);
+   if (!uh)
+   return 

[PATCH v2 net-next 6/6] vxlan: support flow dissect

2017-08-29 Thread Tom Herbert
Populate offload flow_dissect callback appropriately for VXLAN and
VXLAN-GPE.

Signed-off-by: Tom Herbert 
---
 drivers/net/vxlan.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ae3a1da703c2..41e50de40af4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1336,6 +1336,55 @@ static bool vxlan_ecn_decapsulate(struct vxlan_sock *vs, 
void *oiph,
return err <= 1;
 }
 
+static enum flow_dissect_ret vxlan_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   __be16 protocol = htons(ETH_P_TEB);
+   struct vxlanhdr *vhdr, _vhdr;
+   struct vxlan_sock *vs;
+
+   vhdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+   sizeof(_vhdr), data, *p_hlen, &_vhdr);
+   if (!vhdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   vs = rcu_dereference_sk_user_data(sk);
+   if (!vs)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   if (vs->flags & VXLAN_F_GPE) {
+   struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)vhdr;
+
+   /* Need to have Next Protocol set for interfaces in GPE mode. */
+   if (gpe->version != 0 || !gpe->np_applied || gpe->oam_flag)
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   switch (gpe->next_protocol) {
+   case VXLAN_GPE_NP_IPV4:
+   protocol = htons(ETH_P_IP);
+   break;
+   case VXLAN_GPE_NP_IPV6:
+   protocol = htons(ETH_P_IPV6);
+   break;
+   case VXLAN_GPE_NP_ETHERNET:
+   protocol = htons(ETH_P_TEB);
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+   }
+
+   *p_nhoff += sizeof(struct udphdr) + sizeof(_vhdr);
+   *p_proto = protocol;
+
+   return FLOW_DISSECT_RET_PROTO_AGAIN;
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 {
@@ -2864,6 +2913,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
tunnel_cfg.encap_destroy = NULL;
tunnel_cfg.gro_receive = vxlan_gro_receive;
tunnel_cfg.gro_complete = vxlan_gro_complete;
+   tunnel_cfg.flow_dissect = vxlan_flow_dissect;
 
setup_udp_tunnel_sock(net, sock, _cfg);
 
-- 
2.11.0



[PATCH v2 net-next 5/6] fou: Support flow dissection

2017-08-29 Thread Tom Herbert
Populate offload flow_dissect callabck appropriately for fou and gue.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 63 ++
 1 file changed, 63 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..a831dd49fb28 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -282,6 +282,20 @@ static int fou_gro_complete(struct sock *sk, struct 
sk_buff *skb,
return err;
 }
 
+static enum flow_dissect_ret fou_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   *p_ip_proto = fou_from_sock(sk)->protocol;
+   *p_nhoff += sizeof(struct udphdr);
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
  struct guehdr *guehdr, void *data,
  size_t hdrlen, struct gro_remcsum *grc,
@@ -500,6 +514,53 @@ static int gue_gro_complete(struct sock *sk, struct 
sk_buff *skb, int nhoff)
return err;
 }
 
+static enum flow_dissect_ret gue_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   struct guehdr *guehdr, _guehdr;
+
+   guehdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+ sizeof(_guehdr), data, *p_hlen, &_guehdr);
+   if (!guehdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   switch (guehdr->version) {
+   case 0:
+   if (unlikely(guehdr->control))
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   *p_ip_proto = guehdr->proto_ctype;
+   *p_nhoff += sizeof(struct udphdr) +
+   sizeof(*guehdr) + (guehdr->hlen << 2);
+
+   break;
+   case 1:
+   switch (((struct iphdr *)guehdr)->version) {
+   case 4:
+   *p_ip_proto = IPPROTO_IPIP;
+   break;
+   case 6:
+   *p_ip_proto = IPPROTO_IPV6;
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   *p_nhoff += sizeof(struct udphdr);
+
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static int fou_add_to_port_list(struct net *net, struct fou *fou)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
@@ -570,12 +631,14 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
tunnel_cfg.encap_rcv = fou_udp_recv;
tunnel_cfg.gro_receive = fou_gro_receive;
tunnel_cfg.gro_complete = fou_gro_complete;
+   tunnel_cfg.flow_dissect = fou_flow_dissect;
fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
tunnel_cfg.encap_rcv = gue_udp_recv;
tunnel_cfg.gro_receive = gue_gro_receive;
tunnel_cfg.gro_complete = gue_gro_complete;
+   tunnel_cfg.flow_dissect = gue_flow_dissect;
break;
default:
err = -EINVAL;
-- 
2.11.0



[PATCH v2 net-next 3/6] flow_dissector: Add protocol specific flow dissection offload

2017-08-29 Thread Tom Herbert
Add offload capability for performing protocol specific flow dissection
(either by EtherType or IP protocol).

Specifically:

- Add flow_dissect to offload callbacks
- Move flow_dissect_ret enum to flow_dissector.h, cleanup names and add a
  couple of values
- Create GOTO_BY_RESULT macro to use in the main flow dissector switch to
  simplify handling of functions that return flow_dissect_ret enum
- In __skb_flow_dissect, add default case for switch(proto) as well as
  switch(ip_proto) that looks up and calls protocol specific flow
  dissection

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h|   7 +++
 include/net/flow_dissector.h |   9 +++
 net/core/dev.c   |  14 +
 net/core/flow_dissector.c| 132 +++
 net/ipv4/route.c |   4 +-
 5 files changed, 128 insertions(+), 38 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c5475b37a631..90ccb434e127 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2208,6 +2208,12 @@ struct offload_callbacks {
struct sk_buff  **(*gro_receive)(struct sk_buff **head,
 struct sk_buff *skb);
int (*gro_complete)(struct sk_buff *skb, int nhoff);
+   enum flow_dissect_ret (*flow_dissect)(const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 };
 
 struct packet_offload {
@@ -3253,6 +3259,7 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
+struct packet_offload *flow_dissect_find_by_type(__be16 type);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index e2663e900b0a..ad75bbfd1c9c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -19,6 +19,14 @@ struct flow_dissector_key_control {
 #define FLOW_DIS_FIRST_FRAGBIT(1)
 #define FLOW_DIS_ENCAPSULATION BIT(2)
 
+enum flow_dissect_ret {
+   FLOW_DISSECT_RET_OUT_GOOD,
+   FLOW_DISSECT_RET_OUT_BAD,
+   FLOW_DISSECT_RET_PROTO_AGAIN,
+   FLOW_DISSECT_RET_IPPROTO_AGAIN,
+   FLOW_DISSECT_RET_CONTINUE,
+};
+
 /**
  * struct flow_dissector_key_basic:
  * @thoff: Transport header offset
@@ -205,6 +213,7 @@ enum flow_dissector_key_id {
 #define FLOW_DISSECTOR_F_STOP_AT_L3BIT(1)
 #define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABELBIT(2)
 #define FLOW_DISSECTOR_F_STOP_AT_ENCAP BIT(3)
+#define FLOW_DISSECTOR_F_STOP_AT_L4BIT(4)
 
 struct flow_dissector_key {
enum flow_dissector_key_id key_id;
diff --git a/net/core/dev.c b/net/core/dev.c
index 270b54754821..22ea8daa930c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4860,6 +4860,20 @@ struct packet_offload *gro_find_receive_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_receive_by_type);
 
+struct packet_offload *flow_dissect_find_by_type(__be16 type)
+{
+   struct list_head *offload_head = _base;
+   struct packet_offload *ptype;
+
+   list_for_each_entry_rcu(ptype, offload_head, list) {
+   if (ptype->type != type || !ptype->callbacks.flow_dissect)
+   continue;
+   return ptype;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL(flow_dissect_find_by_type);
+
 struct packet_offload *gro_find_complete_by_type(__be16 type)
 {
struct list_head *offload_head = _base;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 12302acdb073..6a2cf240069a 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -115,12 +116,6 @@ __be32 __skb_flow_get_ports(const struct sk_buff *skb, int 
thoff, u8 ip_proto,
 }
 EXPORT_SYMBOL(__skb_flow_get_ports);
 
-enum flow_dissect_ret {
-   FLOW_DISSECT_RET_OUT_GOOD,
-   FLOW_DISSECT_RET_OUT_BAD,
-   FLOW_DISSECT_RET_OUT_PROTO_AGAIN,
-};
-
 static enum flow_dissect_ret
 __skb_flow_dissect_mpls(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
@@ -322,7 +317,7 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
if (flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
return FLOW_DISSECT_RET_OUT_GOOD;
 
-   return FLOW_DISSECT_RET_OUT_PROTO_AGAIN;
+   return FLOW_DISSECT_RET_PROTO_AGAIN;
 }
 
 static void
@@ -383,6 +378,27 @@ __skb_flow_dissect_ipv6(const struct sk_buff *skb,
key_ip->ttl = 

[PATCH v2 net-next 1/6] flow_dissector: Move ETH_P_TEB processing to main switch

2017-08-29 Thread Tom Herbert
Support for processing TEB is currently in GRE flow dissection as a
special case. This can be moved to be a case the main proto switch in
__skb_flow_dissect.

Signed-off-by: Tom Herbert 
---
 net/core/flow_dissector.c | 44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index e2eaa1ff948d..12302acdb073 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -288,27 +288,8 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
if (hdr->flags & GRE_SEQ)
offset += sizeof(((struct pptp_gre_header *) 0)->seq);
 
-   if (gre_ver == 0) {
-   if (*p_proto == htons(ETH_P_TEB)) {
-   const struct ethhdr *eth;
-   struct ethhdr _eth;
-
-   eth = __skb_header_pointer(skb, *p_nhoff + offset,
-  sizeof(_eth),
-  data, *p_hlen, &_eth);
-   if (!eth)
-   return FLOW_DISSECT_RET_OUT_BAD;
-   *p_proto = eth->h_proto;
-   offset += sizeof(*eth);
-
-   /* Cap headers that we access via pointers at the
-* end of the Ethernet header as our maximum alignment
-* at that point is only 2 bytes.
-*/
-   if (NET_IP_ALIGN)
-   *p_hlen = *p_nhoff + offset;
-   }
-   } else { /* version 1, must be PPTP */
+   /* version 1, must be PPTP */
+   if (gre_ver == 1) {
u8 _ppp_hdr[PPP_HDRLEN];
u8 *ppp_hdr;
 
@@ -573,6 +554,27 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 
break;
}
+   case htons(ETH_P_TEB): {
+   const struct ethhdr *eth;
+   struct ethhdr _eth;
+
+   eth = __skb_header_pointer(skb, nhoff, sizeof(_eth),
+  data, hlen, &_eth);
+   if (!eth)
+   goto out_bad;
+
+   proto = eth->h_proto;
+   nhoff += sizeof(*eth);
+
+   /* Cap headers that we access via pointers at the
+* end of the Ethernet header as our maximum alignment
+* at that point is only 2 bytes.
+*/
+   if (NET_IP_ALIGN)
+   hlen = nhoff;
+
+   goto proto_again;
+   }
case htons(ETH_P_8021AD):
case htons(ETH_P_8021Q): {
const struct vlan_hdr *vlan;
-- 
2.11.0



[PATCH v2 net-next 0/6] flow_dissector: Protocol specific flow dissector offload

2017-08-29 Thread Tom Herbert
This patch set adds a new offload type to perform flow dissection for
specific protocols (either by EtherType or by IP protocol). This is
primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
the purposes of parsing the encapsulated packet.

Items in this patch set:
- Constify skb argument to UDP lookup functions
- Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
  on the code in the GRE dissect function and the special handling in
  GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
  goto proto_again is done)
- Add infrastructure for protocol specific flow dissection offload
- Add infrastructure to perform UDP flow dissection. Uses same model of
  GRO where a flow_dissect callback can be associated with a UDP
  socket
- Use the infrastructure to support flow dissection of VXLAN and GUE

Tested:

Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
that inner packet was being properly dissected.

v2: Add signed off

Tom Herbert (6):
  flow_dissector: Move ETH_P_TEB processing to main switch
  udp: Constify skb argument in lookup functions
  flow_dissector: Add protocol specific flow dissection offload
  udp: flow dissector offload
  fou: Support flow dissection
  vxlan: support flow dissect

 drivers/net/vxlan.c  |  50 
 include/linux/netdevice.h|   7 ++
 include/linux/udp.h  |   8 ++
 include/net/flow_dissector.h |   9 +++
 include/net/ip.h |   2 +-
 include/net/sock_reuseport.h |   2 +-
 include/net/udp.h|  19 +++--
 include/net/udp_tunnel.h |   8 ++
 net/core/dev.c   |  14 
 net/core/flow_dissector.c| 176 +--
 net/core/sock_reuseport.c|   5 +-
 net/ipv4/fou.c   |  63 
 net/ipv4/route.c |   4 +-
 net/ipv4/udp.c   |  11 +--
 net/ipv4/udp_offload.c   |  45 +++
 net/ipv4/udp_tunnel.c|   1 +
 net/ipv6/udp.c   |  10 +--
 net/ipv6/udp_offload.c   |  13 
 18 files changed, 369 insertions(+), 78 deletions(-)

-- 
2.11.0



Re: [PATCH net-next] neigh: increase queue_len_bytes to match wmem_default

2017-08-29 Thread Eric Dumazet
On Tue, 2017-08-29 at 15:16 -0700, Eric Dumazet wrote:
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index
> 568ccfd6dd371d88136ffabe5cfcc36f099786b6..7616cd76f6f6a62f395da897baef2c66c0098193
>  100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6086,9 +6086,9 @@ int tcp_conn_request(struct request_sock_ops
> *rsk_ops,
> struct tcp_sock *tp = tcp_sk(sk);
> struct net *net = sock_net(sk);
> struct sock *fastopen_sk = NULL;
> -   struct dst_entry *dst = NULL;
> struct request_sock *req;
> bool want_cookie = false;
> +   struct dst_entry *dst;
> struct flowi fl;
>  
> /* TW buckets are converted to open req

This part was meant to belong to a separate patch :/

No big deal, this was also one of your suggestion David.




Re: [PATCH net-next] neigh: increase queue_len_bytes to match wmem_default

2017-08-29 Thread David Miller
From: Eric Dumazet 
Date: Tue, 29 Aug 2017 16:15:28 -0700

> On Tue, 2017-08-29 at 15:16 -0700, Eric Dumazet wrote:
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index
>> 568ccfd6dd371d88136ffabe5cfcc36f099786b6..7616cd76f6f6a62f395da897baef2c66c0098193
>>  100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -6086,9 +6086,9 @@ int tcp_conn_request(struct request_sock_ops
>> *rsk_ops,
>> struct tcp_sock *tp = tcp_sk(sk);
>> struct net *net = sock_net(sk);
>> struct sock *fastopen_sk = NULL;
>> -   struct dst_entry *dst = NULL;
>> struct request_sock *req;
>> bool want_cookie = false;
>> +   struct dst_entry *dst;
>> struct flowi fl;
>>  
>> /* TW buckets are converted to open req
> 
> This part was meant to belong to a separate patch :/
> 
> No big deal, this was also one of your suggestion David.

Yeah, no big deal.  But thanks for pointing it out.


Re: [PATCH] drivers: net: xgene: Correct probe sequence handling

2017-08-29 Thread David Miller
From: Iyappan Subramanian 
Date: Tue, 29 Aug 2017 15:43:12 -0700

> From: Quan Nguyen 
> 
> The phy is connected at early stage of probe but not properly
> disconnected if error occurs.  This patch fixes the issue.
> 
> Also changing the return type of xgene_enet_check_phy_handle(),
> since this function always returns success.
> 
> Signed-off-by: Quan Nguyen 
> Signed-off-by: Iyappan Subramanian 

Applied, thank you.


Re: [PATCH net] sch_htb: fix crash on init failure

2017-08-29 Thread Nikolay Aleksandrov
On 30.08.2017 02:06, David Miller wrote:
> 
> I expect that you will resubmit all of these similar fixes as a patch
> series after you have sorted everything out.
> 
> Correct?
> 

Yes, I will. There are a few more places that need fixing, I'll resubmit
them all as a set.




Re: [PATCH net-next] neigh: increase queue_len_bytes to match wmem_default

2017-08-29 Thread David Miller
From: Eric Dumazet 
Date: Tue, 29 Aug 2017 15:16:01 -0700

> From: Eric Dumazet 
> 
> Florian reported UDP xmit drops that could be root caused to the
> too small neigh limit.
> 
> Current limit is 64 KB, meaning that even a single UDP socket would hit
> it, since its default sk_sndbuf comes from net.core.wmem_default
> (~212992 bytes on 64bit arches).
> 
> Once ARP/ND resolution is in progress, we should allow a little more
> packets to be queued, at least for one producer.
> 
> Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
> limit and either block in sendmsg() or return -EAGAIN.
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Florian Fainelli 

Applied, thanks for following up on this.


Re: [PATCH net-next v2] net: bcmgenet: Use correct I/O accessors

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 12:25:31 -0700

> The GENET driver currently uses __raw_{read,write}l which means
> native I/O endian. This works correctly for an ARM LE kernel (default)
> but fails miserably on an ARM BE (BE8) kernel where registers are kept
> little endian, so replace uses with {read,write}l_relaxed here which is
> what we want because this is all performance sensitive code.
> 
> Signed-off-by: Florian Fainelli 

Applied.


Re: [PATCH] net: remove dmaengine.h inclusion from netdevice.h

2017-08-29 Thread David Miller
From: Dave Jiang 
Date: Tue, 29 Aug 2017 13:17:51 -0700

> Since the removal of NET_DMA, dmaengine.h header file shouldn't be needed
> by netdevice.h anymore.
> 
> Signed-off-by: Dave Jiang 

Applied to net-next, but it would have been really great for you to
have provided a proper "Fixes: " tag referencing the NET_DMA
removal change.

Thanks.


Re: [PATCH V2 net-next] liquidio: show NIC's U-Boot version in a dev_info() message

2017-08-29 Thread David Miller
From: Felix Manlunas 
Date: Tue, 29 Aug 2017 12:19:57 -0700

> From: Weilin Chang 
> 
> Signed-off-by: Weilin Chang 
> Signed-off-by: Felix Manlunas 
> ---
> Patch Change Log:
>   V1 -> V2:
> * Move octeon_get_uboot_version() to a proper place to avoid forward
>   declaration.
> * Remove complicated for-loops that search for substrings; replace them
>   with calls to strstr().
> * Don't add unnecessary fields to struct octeon_device.

Applied.


Re: [PATCH net] nfp: double free on error in probe

2017-08-29 Thread David Miller
From: Dan Carpenter 
Date: Tue, 29 Aug 2017 22:15:16 +0300

> Both the nfp_net_pf_app_start() and the nfp_net_pci_probe() functions
> call nfp_net_pf_app_stop_ctrl(pf) so there is a double free.  The free
> should be done from the probe function because it's allocated there so
> I have removed the call from nfp_net_pf_app_start().
> 
> Fixes: 02082701b974 ("nfp: create control vNICs and wire up rx/tx")
> Signed-off-by: Dan Carpenter 

Applied.


Re: [PATCH net] sch_htb: fix crash on init failure

2017-08-29 Thread David Miller

I expect that you will resubmit all of these similar fixes as a patch
series after you have sorted everything out.

Correct?


Re: [PATCH] net: dsa: make some structures const

2017-08-29 Thread David Miller
From: Bhumika Goyal 
Date: Tue, 29 Aug 2017 22:17:52 +0530

> Make these const as they are not modified anywhere.
> 
> Signed-off-by: Bhumika Goyal 

Applied.


Re: [PATCH ] net: frag: print frag_mem_limit value in sockstat proc file

2017-08-29 Thread David Miller
From: 
Date: Tue, 29 Aug 2017 21:05:34 +0800

> From: liujian 
> 
> From 6d7b857d5( net: use lib/percpu_counter API for fragmentation mem
> accounting),
> frag_mem_limit and sum_frag_mem_limit have different value if there are
> multiple NIC RX CPU.
> Print frag_mem_limit value, then we can get more debug info.
> 
> Signed-off-by: liujian 

Sorry, I don't think it is justified to change the deprecated procfs
output just for this.

Use tracepoints and other similar facilities to debug situations
like this if you like.


Re: [RESEND PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-29 Thread David Miller
From: Prakash Sangappa 
Date: Mon, 28 Aug 2017 17:12:20 -0700

> Currently passing tid(gettid(2)) of a thread in struct ucred in
> SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
> it fails with EPERM error. Some applications deal with thread id
> of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
> message. Basically, either tgid(pid of the process) or the tid of
> the thread should be allowed without the need for CAP_SYS_ADMIN capability.
> 
> SCM_CREDENTIALS will be used to determine the global id of a process or
> a thread running inside a pid namespace.
> 
> This patch adds necessary check to accept tid in SCM_CREDENTIALS
> struct ucred.
> 
> Signed-off-by: Prakash Sangappa 

I'm pretty sure that by the descriptions in previous changes to this
function, what you are proposing is basically a minor form of PID
spoofing which we only want someone with CAP_SYS_ADMIN over the
PID namespace to be able to do.

Sorry, I'm not applying this.


Re: [PATCH net-next v3 2/2] tcp_diag: report TCP MD5 signing keys and addresses

2017-08-29 Thread Eric Dumazet
On Tue, 2017-08-29 at 15:29 -0700, Ivan Delalande wrote:
> Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
> processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
> not possible to retrieve these from the kernel once they have been
> configured on sockets.
> 
> Signed-off-by: Ivan Delalande 
> ---
>  include/uapi/linux/inet_diag.h |   1 +
>  net/ipv4/tcp_diag.c| 115 
> ++---
>  2 files changed, 110 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
> index 678496897a68..f52ff62bfabe 100644
> --- a/include/uapi/linux/inet_diag.h
> +++ b/include/uapi/linux/inet_diag.h
> @@ -143,6 +143,7 @@ enum {
>   INET_DIAG_MARK,
>   INET_DIAG_BBRINFO,
>   INET_DIAG_CLASS_ID,
> + INET_DIAG_MD5SIG,
>   __INET_DIAG_MAX,
>  };
>  
> diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
> index a748c74aa8b7..f972f9f7eae4 100644
> --- a/net/ipv4/tcp_diag.c
> +++ b/net/ipv4/tcp_diag.c
> @@ -16,6 +16,7 @@
>  
>  #include 
>  
> +#include 
>  #include 
>  
>  static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
> @@ -36,6 +37,106 @@ static void tcp_diag_get_info(struct sock *sk, struct 
> inet_diag_msg *r,
>   tcp_get_info(sk, info);
>  }
>  
> +#ifdef CONFIG_TCP_MD5SIG
> +static void inet_diag_md5sig_fill(struct tcp_md5sig *info,
> +   const struct tcp_md5sig_key *key)
> +{
> + #if IS_ENABLED(CONFIG_IPV6)
> + if (key->family == AF_INET6) {
> + struct sockaddr_in6 *sin6 =
> + (struct sockaddr_in6 *)>tcpm_addr;
> +
> + memcpy(>sin6_addr, >addr.a6,
> +sizeof(struct in6_addr));
> + } else
> + #endif
> + {
> + struct sockaddr_in *sin =
> + (struct sockaddr_in *)>tcpm_addr;
> +
> + memcpy(>sin_addr, >addr.a4, sizeof(struct in_addr));

You probably need a memset(... 0  ...) to clear the XX bytes that are
not used by IPv4 address.

Otherwise your patch is going to 'leak'  124 bytes of kernel memory to
user space and offer yet another way for attackers to build exploits.

AFAIK, struct tcp_md5sig has a huge hole, since it uses

"struct __kernel_sockaddr_storage tcpm_addr "

So a similar problem exists for IPv6.

Maybe it is time to define a much smaller structure, using only 16 bytes
instead of 128 for the address.




Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-29 Thread Willem de Bruijn
On Tue, Aug 29, 2017 at 4:40 PM, Michael S. Tsirkin  wrote:
> On Tue, Aug 29, 2017 at 03:53:08PM -0400, Willem de Bruijn wrote:
>> On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin  wrote:
>> > On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
>> >> By the way, I have had an unrelated patch outstanding for a while
>> >> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
>> >> command. Will send that as RFC.
>> >
>> > Oh nice.
>>
>> Great :)
>>
>> > One needs to be careful about locking there which is why
>> > no devices support that yet.
>>
>> I originally wrote it based on the virtnet_reset function introduced
>> for xdp. Calling this from virtnet_config_changed_work is non trivial,
>> as virtnet_freeze_down waits until no config worker is running.
>>
>> Otherwise, I could not find any constraints on when freeze may be
>> called, and it largely follows the same path. I hope I didn't miss anything.
>
> The issue is that on freeze processes are not running so we
> generally know no new packets will arrive (might be wrong
> for bridging, then it's a bug). On device error you must
> prevent new skbs from coming in, etc.

Thanks a lot for the quick review. I had indeed not yet figured out
which invariants freeze can depend on that are not universal. Same
for the virtnet_reset call from the .ndo_xdp. Will need to take a much
better look at that.


[PATCH] drivers: net: xgene: Correct probe sequence handling

2017-08-29 Thread Iyappan Subramanian
From: Quan Nguyen 

The phy is connected at early stage of probe but not properly
disconnected if error occurs.  This patch fixes the issue.

Also changing the return type of xgene_enet_check_phy_handle(),
since this function always returns success.

Signed-off-by: Quan Nguyen 
Signed-off-by: Iyappan Subramanian 
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 27 
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 1d307f2..6e253d9 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1661,21 +1661,21 @@ static int xgene_enet_get_irqs(struct xgene_enet_pdata 
*pdata)
return 0;
 }
 
-static int xgene_enet_check_phy_handle(struct xgene_enet_pdata *pdata)
+static void xgene_enet_check_phy_handle(struct xgene_enet_pdata *pdata)
 {
int ret;
 
if (pdata->phy_mode == PHY_INTERFACE_MODE_XGMII)
-   return 0;
+   return;
 
if (!IS_ENABLED(CONFIG_MDIO_XGENE))
-   return 0;
+   return;
 
ret = xgene_enet_phy_connect(pdata->ndev);
if (!ret)
pdata->mdio_driver = true;
 
-   return 0;
+   return;
 }
 
 static void xgene_enet_gpiod_get(struct xgene_enet_pdata *pdata)
@@ -1779,10 +1779,6 @@ static int xgene_enet_get_resources(struct 
xgene_enet_pdata *pdata)
if (ret)
return ret;
 
-   ret = xgene_enet_check_phy_handle(pdata);
-   if (ret)
-   return ret;
-
xgene_enet_gpiod_get(pdata);
 
pdata->clk = devm_clk_get(>dev, NULL);
@@ -2097,9 +2093,11 @@ static int xgene_enet_probe(struct platform_device *pdev)
goto err;
}
 
+   xgene_enet_check_phy_handle(pdata);
+
ret = xgene_enet_init_hw(pdata);
if (ret)
-   goto err;
+   goto err2;
 
link_state = pdata->mac_ops->link_state;
if (pdata->phy_mode == PHY_INTERFACE_MODE_XGMII) {
@@ -2117,29 +2115,30 @@ static int xgene_enet_probe(struct platform_device 
*pdev)
spin_lock_init(>stats_lock);
ret = xgene_extd_stats_init(pdata);
if (ret)
-   goto err2;
+   goto err1;
 
xgene_enet_napi_add(pdata);
ret = register_netdev(ndev);
if (ret) {
netdev_err(ndev, "Failed to register netdev\n");
-   goto err2;
+   goto err1;
}
 
return 0;
 
-err2:
+err1:
/*
 * If necessary, free_netdev() will call netif_napi_del() and undo
 * the effects of xgene_enet_napi_add()'s calls to netif_napi_add().
 */
 
+   xgene_enet_delete_desc_rings(pdata);
+
+err2:
if (pdata->mdio_driver)
xgene_enet_phy_disconnect(pdata);
else if (phy_interface_mode_is_rgmii(pdata->phy_mode))
xgene_enet_mdio_remove(pdata);
-err1:
-   xgene_enet_delete_desc_rings(pdata);
 err:
free_netdev(ndev);
return ret;
-- 
2.7.4



Re: [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support

2017-08-29 Thread Willem de Bruijn
On Tue, Aug 29, 2017 at 5:13 PM, Michael S. Tsirkin  wrote:
> On Tue, Aug 29, 2017 at 05:02:29PM -0400, Willem de Bruijn wrote:
>> + virtio-dev
>>
>> On Tue, Aug 29, 2017 at 4:38 PM, Michael S. Tsirkin  wrote:
>> > On Tue, Aug 29, 2017 at 04:27:41PM -0400, Willem de Bruijn wrote:
>> >> On Tue, Aug 29, 2017 at 4:16 PM, Michael S. Tsirkin  
>> >> wrote:
>> >> > On Tue, Aug 29, 2017 at 04:07:59PM -0400, Willem de Bruijn wrote:
>> >> >> From: Willem de Bruijn 
>> >> >>
>> >> >> Implement a mechanism to signal that a virtio device implements the
>> >> >> VIRTIO_CONFIG_S_NEEDS_RESET command.
>> >> >>
>> >> >> Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
>> >> >> and verifying the reset state would require an expensive state change.
>> >> >>
>> >> >> To avoid that, add a status bit to the feature register used during
>> >> >> feature negotiation between host and guest.
>> >> >>
>> >> >> Set the bit for virtio-net, which supports the feature.
>> >> >>
>> >> >> Signed-off-by: Willem de Bruijn 
>> >> >
>> >> > All virtio 1 devices have the reset feature
>> >>
>> >> I don't quite follow. No device drivers implement this request currently?
>> >
>> > Depends. Spec 1.0 describes the bit and that driver can respond
>> > by reseting the device. You seem to do something else
>> > in this patchset, but as designed in 1.0 it does not seem to need
>> > a feature bit.
>>
>> I see. So support is designed to be best effort?
>>
>> The feature bit is only needed if driver support is optional.
>>
>> The ack response is necessary if the device acts differently
>> depending on whether the reset happened. The device has
>> to reset its local state, too, so I think that this is needed.
>
> That reset should only happen when guest driver resets the device.
> And spec already has a mechanism for that anyway.

And the device can read the ring state to see whether it has reset,
perhaps.

>>
>> >> > so maybe guest does
>> >> > not need this flag. Does device need it? Does device really
>> >> > care that guest can't recover?
>> >>
>> >> If all device drivers support it, then the flag is not needed.
>> >>
>> >> But as long as legacy device drivers can exist that do not support
>> >> this feature, it has to have a way of differentiating between the two.
>> >
>> > Why? Device won't set this unless it's in a bad state. In that case,
>> > setting the bit is harmless even if drivers ignore it.
>>
>> The goal is for the device to be able to rely on the driver reset to get
>> to a good state even if it gets it into a bad state.
>>
>> That allows it to implement optimizations that could get it into that bad
>> state.
>
> I see. I don't think this is what the need reset was designed for.
>
> We can extend it to cover this case but let's add a bit more
> documentation then.

Okay.

> And in particular won't it be better if we could just reset one ring,
> and not the whole device state? This might be nicer so flows on other
> rings aren't disrupted.

Indeed. But that would require a different request, then? It also
depends on the use case. A full device reset is a big hammer,
but if used only to get out of rare edge cases, it is good enough.

>>
>> In particular, in the edge case where the device performs backpressure,
>> takes the descriptor out of the avail ring and does not immediately post
>> it to the used ring.
>>
>> A reset will make the guest free all delayed packets and treat any
>> unsent and unacknowledged as network drops. This allows the
>> device to indeed drop long delayed packets when they eventually
>> surface (e.g., leave a qdisc queue).
>
> In this particular case, won't it be easier for device to just
> report all packets as used, without involving the guest?

When the device can just iterate over the outstanding packet, yeah,
that's actually simpler.

>> This is of course not safe with
>> zerocopy packets.
>
>
> I wonder if we can teach kernel to drop zero copy packets too.

Your point about changing frags[] underneath a cloned skb really does
make that hard. We might be able to mitigate individual specific cases
of high latency, such as TC queue occupancy.

>
> --
> MST


Re: [PATCH net-next v3 1/2] inet_diag: allow protocols to provide additional data

2017-08-29 Thread Stephen Hemminger
On Tue, 29 Aug 2017 15:29:53 -0700
Ivan Delalande  wrote:

> @@ -452,13 +466,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo 
> *hashinfo,
>   struct net *net = sock_net(in_skb->sk);
>   struct sk_buff *rep;
>   struct sock *sk;
> + bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);

Please keep declarations in Christmas tree order if possible.

int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
  struct sk_buff *skb, const struct inet_diag_req_v2 *req,
  struct user_namespace *user_ns,
  u32 portid, u32 seq, u16 nlmsg_flags,
  const struct nlmsghdr *unlh,
  bool net_admin)
{
bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
const struct tcp_congestion_ops *ca_ops;
const struct inet_diag_handler *handler;
int ext = req->idiag_ext;
struct inet_diag_msg *r;
struct nlmsghdr  *nlh;
struct nlattr *attr;
void *info = NULL;


Re: [PATCH net-next 0/6] flow_dissector: Protocol specific flow dissector offload

2017-08-29 Thread David Miller

Please add proper signoffs to your patches.

Thanks.


Re: [PATCH net-next] ipv6: Use rt6i_idev index for echo replies to a local address

2017-08-29 Thread David Miller
From: David Ahern 
Date: Mon, 28 Aug 2017 13:53:34 -0700

> Tariq repored local pings to linklocal address is failing:
> $ ifconfig ens8
> ens8: flags=4163  mtu 1500
> inet 11.141.16.6  netmask 255.255.0.0  broadcast 11.141.255.255
> inet6 fe80::7efe:90ff:fecb:7502  prefixlen 64  scopeid 0x20
> ether 7c:fe:90:cb:75:02  txqueuelen 1000  (Ethernet)
> RX packets 12  bytes 1164 (1.1 KiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 30  bytes 2484 (2.4 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> $  /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8
> PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes
> 
> --- fe80::7efe:90ff:fecb:7502%ens8 ping statistics ---
> 3 packets transmitted, 0 received, 100% packet loss, time 2043ms
> 
> icmpv6_echo_reply needs to use the rt6i_idev dev index for local traffic
> similar to how icmp6_send does. Convert the change for icmp6_send into a
> helper that can be used in both places. Add the long over due
> skb_rt6_info helper to convert dst on an skb to rt6_info similar to
> skb_rtable for ipv4.
> 
> Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on
>device with address")
> Reported-by: Tariq Toukan 
> Signed-off-by: David Ahern 

Applied but please do not break up Fixes: tags even if the line is
very long.

Otherwise it's a pain to grep the logs for specific multi-word string
sequences in the Fixes tag.

Thanks.


Re: [PATCH net-next v1] amd-xgbe: Interrupt summary bits are h/w version dependent

2017-08-29 Thread David Miller
From: Tom Lendacky 
Date: Mon, 28 Aug 2017 15:29:34 -0500

> There is a difference in the bit position of the normal interrupt summary
> enable (NIE) and abnormal interrupt summary enable (AIE) between revisions
> of the hardware.  For older revisions the NIE and AIE bits are positions
> 16 and 15 respectively.  For newer revisions the NIE and AIE bits are
> positions 15 and 14.  The effect in changing the bit position is that
> newer hardware won't receive AIE interrupts in the current version of the
> driver.  Specifically, the driver uses this interrupt to collect
> statistics on when a receive buffer unavailable event occurs and to
> restart the driver/device when a fatal bus error occurs.
> 
> Update the driver to set the interrupt enable bit based on the reported
> version of the hardware.
> 
> Signed-off-by: Tom Lendacky 

Applied.


[PATCH net-next v3 0/2] report TCP MD5 signing keys and addresses

2017-08-29 Thread Ivan Delalande
Allow userspace to retrieve MD5 signature keys and addresses configured
on TCP sockets through inet_diag.

Thank you Eric Dumazet for the useful explanations and feedback.

v3: - rename inet_diag_*md5sig in tcp_diag.c to tcp_diag_* for
  consistency,
- don't lock the socket tcp_diag_put_md5sig,
- add checks on md5sig_count in tcp_diag_put_md5sig to not create
  the netlink attribute if the list is empty, and to avoid overflows
  or memory leaks if the list has changed in the meantime.

v2: - move changes to tcp_diag.c and extend inet_diag_handler to allow
  protocols to provide additional data on INET_DIAG_INFO,
- lock socket before calling tcp_diag_put_md5sig.


I also have a patch for iproute2/ss to test this change, making it print
this new attribute. I'm planning to polish and send it if this series
gets applied.


Ivan Delalande (2):
  inet_diag: allow protocols to provide additional data
  tcp_diag: report TCP MD5 signing keys and addresses

 include/linux/inet_diag.h  |   7 +++
 include/uapi/linux/inet_diag.h |   1 +
 net/ipv4/inet_diag.c   |  22 ++--
 net/ipv4/tcp_diag.c| 115 ++---
 4 files changed, 135 insertions(+), 10 deletions(-)

-- 
2.14.1



[PATCH net-next v3 2/2] tcp_diag: report TCP MD5 signing keys and addresses

2017-08-29 Thread Ivan Delalande
Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
not possible to retrieve these from the kernel once they have been
configured on sockets.

Signed-off-by: Ivan Delalande 
---
 include/uapi/linux/inet_diag.h |   1 +
 net/ipv4/tcp_diag.c| 115 ++---
 2 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 678496897a68..f52ff62bfabe 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -143,6 +143,7 @@ enum {
INET_DIAG_MARK,
INET_DIAG_BBRINFO,
INET_DIAG_CLASS_ID,
+   INET_DIAG_MD5SIG,
__INET_DIAG_MAX,
 };
 
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a748c74aa8b7..f972f9f7eae4 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -16,6 +16,7 @@
 
 #include 
 
+#include 
 #include 
 
 static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
@@ -36,6 +37,106 @@ static void tcp_diag_get_info(struct sock *sk, struct 
inet_diag_msg *r,
tcp_get_info(sk, info);
 }
 
+#ifdef CONFIG_TCP_MD5SIG
+static void inet_diag_md5sig_fill(struct tcp_md5sig *info,
+ const struct tcp_md5sig_key *key)
+{
+   #if IS_ENABLED(CONFIG_IPV6)
+   if (key->family == AF_INET6) {
+   struct sockaddr_in6 *sin6 =
+   (struct sockaddr_in6 *)>tcpm_addr;
+
+   memcpy(>sin6_addr, >addr.a6,
+  sizeof(struct in6_addr));
+   } else
+   #endif
+   {
+   struct sockaddr_in *sin =
+   (struct sockaddr_in *)>tcpm_addr;
+
+   memcpy(>sin_addr, >addr.a4, sizeof(struct in_addr));
+   }
+
+   info->tcpm_addr.ss_family = key->family;
+   info->tcpm_prefixlen = key->prefixlen;
+   info->tcpm_keylen = key->keylen;
+   memcpy(info->tcpm_key, key->key, key->keylen);
+}
+
+static int inet_diag_put_md5sig(struct sk_buff *skb,
+   const struct tcp_md5sig_info *md5sig)
+{
+   const struct tcp_md5sig_key *key;
+   struct nlattr *attr;
+   struct tcp_md5sig *info;
+   int md5sig_count = 0;
+
+   hlist_for_each_entry_rcu(key, >head, node)
+   md5sig_count++;
+   if (md5sig_count == 0)
+   return 0;
+
+   attr = nla_reserve(skb, INET_DIAG_MD5SIG,
+  md5sig_count * sizeof(struct tcp_md5sig));
+   if (!attr)
+   return -EMSGSIZE;
+
+   info = nla_data(attr);
+   hlist_for_each_entry_rcu(key, >head, node) {
+   inet_diag_md5sig_fill(info++, key);
+   if (--md5sig_count == 0)
+   break;
+   }
+   if (md5sig_count > 0)
+   memset(info, 0, md5sig_count * sizeof(struct tcp_md5sig));
+
+   return 0;
+}
+#endif
+
+static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
+   struct sk_buff *skb)
+{
+#ifdef CONFIG_TCP_MD5SIG
+   if (net_admin) {
+   struct tcp_md5sig_info *md5sig;
+   int err = 0;
+
+   rcu_read_lock();
+   md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+   if (md5sig)
+   err = inet_diag_put_md5sig(skb, md5sig);
+   rcu_read_unlock();
+   if (err < 0)
+   return err;
+   }
+#endif
+
+   return 0;
+}
+
+static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
+{
+   size_t size = 0;
+
+#ifdef CONFIG_TCP_MD5SIG
+   if (sk_fullsock(sk)) {
+   const struct tcp_md5sig_info *md5sig;
+   const struct tcp_md5sig_key *key;
+
+   rcu_read_lock();
+   md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+   if (md5sig) {
+   hlist_for_each_entry_rcu(key, >head, node)
+   size += sizeof(struct tcp_md5sig);
+   }
+   rcu_read_unlock();
+   }
+#endif
+
+   return size;
+}
+
 static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
  const struct inet_diag_req_v2 *r, struct nlattr *bc)
 {
@@ -68,13 +169,15 @@ static int tcp_diag_destroy(struct sk_buff *in_skb,
 #endif
 
 static const struct inet_diag_handler tcp_diag_handler = {
-   .dump= tcp_diag_dump,
-   .dump_one= tcp_diag_dump_one,
-   .idiag_get_info  = tcp_diag_get_info,
-   .idiag_type  = IPPROTO_TCP,
-   .idiag_info_size = sizeof(struct tcp_info),
+   .dump   = tcp_diag_dump,
+   .dump_one   = tcp_diag_dump_one,
+   .idiag_get_info = tcp_diag_get_info,
+   .idiag_get_aux  = tcp_diag_get_aux,
+   .idiag_get_aux_size = 

[PATCH net-next v3 1/2] inet_diag: allow protocols to provide additional data

2017-08-29 Thread Ivan Delalande
Extend inet_diag_handler to allow individual protocols to report
additional data on INET_DIAG_INFO through idiag_get_aux. The size
can be dynamic and is computed by idiag_get_aux_size.

Signed-off-by: Ivan Delalande 
---
 include/linux/inet_diag.h |  7 +++
 net/ipv4/inet_diag.c  | 22 ++
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index 65da430e260f..ee251c585854 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -25,6 +25,13 @@ struct inet_diag_handler {
  struct inet_diag_msg *r,
  void *info);
 
+   int (*idiag_get_aux)(struct sock *sk,
+bool net_admin,
+struct sk_buff *skb);
+
+   size_t  (*idiag_get_aux_size)(struct sock *sk,
+ bool net_admin);
+
int (*destroy)(struct sk_buff *in_skb,
   const struct inet_diag_req_v2 *req);
 
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67325d5832d7..8a88ef373395 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -93,8 +93,17 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, 
struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
 
-static size_t inet_sk_attr_size(void)
+static size_t inet_sk_attr_size(struct sock *sk,
+   const struct inet_diag_req_v2 *req,
+   bool net_admin)
 {
+   const struct inet_diag_handler *handler;
+   size_t aux = 0;
+
+   handler = inet_diag_table[req->sdiag_protocol];
+   if (handler && handler->idiag_get_aux_size)
+   aux = handler->idiag_get_aux_size(sk, net_admin);
+
returnnla_total_size(sizeof(struct tcp_info))
+ nla_total_size(1) /* INET_DIAG_SHUTDOWN */
+ nla_total_size(1) /* INET_DIAG_TOS */
@@ -105,6 +114,7 @@ static size_t inet_sk_attr_size(void)
+ nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
+ nla_total_size(TCP_CA_NAME_MAX)
+ nla_total_size(sizeof(struct tcpvegas_info))
+   + nla_total_size(aux)
+ 64;
 }
 
@@ -260,6 +270,10 @@ int inet_sk_diag_fill(struct sock *sk, struct 
inet_connection_sock *icsk,
 
handler->idiag_get_info(sk, r, info);
 
+   if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
+   if (handler->idiag_get_aux(sk, net_admin, skb) < 0)
+   goto errout;
+
if (sk->sk_state < TCP_TIME_WAIT) {
union tcp_cc_info info;
size_t sz = 0;
@@ -452,13 +466,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo 
*hashinfo,
struct net *net = sock_net(in_skb->sk);
struct sk_buff *rep;
struct sock *sk;
+   bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
int err;
 
sk = inet_diag_find_one_icsk(net, hashinfo, req);
if (IS_ERR(sk))
return PTR_ERR(sk);
 
-   rep = nlmsg_new(inet_sk_attr_size(), GFP_KERNEL);
+   rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL);
if (!rep) {
err = -ENOMEM;
goto out;
@@ -467,8 +482,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, req,
   sk_user_ns(NETLINK_CB(in_skb).sk),
   NETLINK_CB(in_skb).portid,
-  nlh->nlmsg_seq, 0, nlh,
-  netlink_net_capable(in_skb, CAP_NET_ADMIN));
+  nlh->nlmsg_seq, 0, nlh, net_admin);
if (err < 0) {
WARN_ON(err == -EMSGSIZE);
nlmsg_free(rep);
-- 
2.14.1



Re: Permissions for eBPF objects

2017-08-29 Thread Mickaël Salaün

On 29/08/2017 03:44, Chenbo Feng wrote:
> On Mon, Aug 28, 2017 at 6:15 PM, Alexei Starovoitov
>  wrote:
>> On Mon, Aug 28, 2017 at 05:47:19PM -0700, Chenbo Feng wrote:
>>> On Fri, Aug 25, 2017 at 6:03 PM, Alexei Starovoitov
>>>  wrote:
 On Fri, Aug 25, 2017 at 10:07:27PM +0200, Daniel Borkmann wrote:
> On 08/25/2017 09:52 PM, Chenbo Feng wrote:
>> On Fri, Aug 25, 2017 at 12:45 PM, Jeffrey Vander Stoep 
>>  wrote:
>>> On Fri, Aug 25, 2017 at 12:26 PM, Stephen Smalley  
>>> wrote:
 On Fri, 2017-08-25 at 11:01 -0700, Jeffrey Vander Stoep via Selinux
 wrote:
> I’d like to get your thoughts on adding LSM permission checks on BPF
> objects.

 before reinventing the wheel please take a look at landlock work.
 Everything that was discussed in this thread is covered by it.
 The patches have been in development for more than a year and most of the 
 early
 issues have been resolved.
 It will be presented again during security summit in LA in September.

>>> I am not very familiar with landlock lsm, isn't this module also
>>> depend on the lsm hooks to do
>>> the landlock check? If so then adding lsm hooks for eBPF object seems
>>> not conflict with the
>>> work on progress.
>>
>> I see. I got it the other way around. What lsm checks are you proposing?
>> and why unprivileged_bpf_disabled is not enough?
>> you want to allow unpriv only for specific user(s) ?
>>
> Exactly, the proposal patch I am currently working on will add checks
> before map creation,
> map read,  and map modify, since all these functionalities will be
> available to all users when
> unprivileged_bpf_disabled is turned off. And eBPF prog_load may also
> need a check as well
> since loading some types of program is not restricted either.
> 

It would be interesting to be able to check a wide range of actions
performed with the BPF syscall: the command and the union bpf_attr
argument. Because it is a multiplexer, that may be challenging, though.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next 0/4] nsh: headers, GSO

2017-08-29 Thread David Miller
From: Jiri Benc 
Date: Mon, 28 Aug 2017 21:43:20 +0200

> This adds header structs and helpers for NSH together with GSO support.
> 
> Note there is no code in this patchset that actually manipulates the NSH
> headers. That was sent to netdev by Yi Yang ("[PATCH net-next v6 0/3]
> openvswitch: add NSH support"). The aim of this series is to lay the
> groundwork and ease the implementation for him.
> 
> In addition to openvswitch, the NSH support should be added to tc (flower to
> match, act_nsh to push/pop NSH headers). That will come later. There's
> currently no plan to support NSH by other means than those two.
> 
> The patch 3 in this patchset was written by Yi Yang, I took it from the
> aforementioned series and slightly modified it - see the note in the patch.

Series applied, thanks Jiri.


[PATCH net-next] neigh: increase queue_len_bytes to match wmem_default

2017-08-29 Thread Eric Dumazet
From: Eric Dumazet 

Florian reported UDP xmit drops that could be root caused to the
too small neigh limit.

Current limit is 64 KB, meaning that even a single UDP socket would hit
it, since its default sk_sndbuf comes from net.core.wmem_default
(~212992 bytes on 64bit arches).

Once ARP/ND resolution is in progress, we should allow a little more
packets to be queued, at least for one producer.

Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
limit and either block in sendmsg() or return -EAGAIN.

Signed-off-by: Eric Dumazet 
Reported-by: Florian Fainelli 
---
 Documentation/networking/ip-sysctl.txt |7 +--
 include/net/sock.h |   10 ++
 net/core/sock.c|   10 --
 net/decnet/dn_neigh.c  |2 +-
 net/ipv4/arp.c |2 +-
 net/ipv4/tcp_input.c   |2 +-
 net/ipv6/ndisc.c   |2 +-
 7 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 
6b0bc0f715346a097a6df46e2ba2771359abcd23..b3345d0fe0a67e477a6754848e7fc7be144322d5
 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -109,7 +109,10 @@ neigh/default/unres_qlen_bytes - INTEGER
queued for each unresolved address by other network layers.
(added in linux 3.3)
Setting negative value is meaningless and will return error.
-   Default: 65536 Bytes(64KB)
+   Default: SK_WMEM_MAX, (same as net.core.wmem_default).
+   Exact value depends on architecture and kernel options,
+   but should be enough to allow queuing 256 packets
+   of medium size.
 
 neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
@@ -119,7 +122,7 @@ neigh/default/unres_qlen - INTEGER
unexpected packet loss. The current default value is calculated
according to default value of unres_qlen_bytes and true size of
packet.
-   Default: 31
+   Default: 101
 
 mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
diff --git a/include/net/sock.h b/include/net/sock.h
index 
1c2912d433e81b10f3fdc87bcfcbb091570edc03..03a362568357acc7278a318423dd3873103f90ca
 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2368,6 +2368,16 @@ bool sk_net_capable(const struct sock *sk, int cap);
 
 void sk_get_meminfo(const struct sock *sk, u32 *meminfo);
 
+/* Take into consideration the size of the struct sk_buff overhead in the
+ * determination of these values, since that is non-constant across
+ * platforms.  This makes socket queueing behavior and performance
+ * not depend upon such differences.
+ */
+#define _SK_MEM_PACKETS256
+#define _SK_MEM_OVERHEAD   SKB_TRUESIZE(256)
+#define SK_WMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
+#define SK_RMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
+
 extern __u32 sysctl_wmem_max;
 extern __u32 sysctl_rmem_max;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 
dfdd14cac775e9bfcee0085ee32ffcd0ab28b67b..9b7b6bbb2a23e7652a1f34a305f29d49de00bc8c
 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -307,16 +307,6 @@ static struct lock_class_key af_wlock_keys[AF_MAX];
 static struct lock_class_key af_elock_keys[AF_MAX];
 static struct lock_class_key af_kern_callback_keys[AF_MAX];
 
-/* Take into consideration the size of the struct sk_buff overhead in the
- * determination of these values, since that is non-constant across
- * platforms.  This makes socket queueing behavior and performance
- * not depend upon such differences.
- */
-#define _SK_MEM_PACKETS256
-#define _SK_MEM_OVERHEAD   SKB_TRUESIZE(256)
-#define SK_WMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
-#define SK_RMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
-
 /* Run time adjustable parameters. */
 __u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX;
 EXPORT_SYMBOL(sysctl_wmem_max);
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 
21dedf6fd0f76dec22b2b3685beb89cfefea7ded..22bf0b95d6edc3c27ef3a99d27cb70a1551e3e0e
 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -94,7 +94,7 @@ struct neigh_table dn_neigh_table = {
[NEIGH_VAR_BASE_REACHABLE_TIME] = 30 * HZ,
[NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ,
[NEIGH_VAR_GC_STALETIME] = 60 * HZ,
-   [NEIGH_VAR_QUEUE_LEN_BYTES] = 64*1024,
+   [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX,
[NEIGH_VAR_PROXY_QLEN] = 0,
[NEIGH_VAR_ANYCAST_DELAY] = 0,
[NEIGH_VAR_PROXY_DELAY] = 0,
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 

Re: [PATCH net-next 0/3] tc: act_ife: handle IEEE IFE ethertype as default

2017-08-29 Thread David Miller
From: Alexander Aring 
Date: Mon, 28 Aug 2017 15:03:12 -0400

> this patch series will introduce the IFE ethertype which is registered by
> IEEE. If the netlink act_ife type netlink attribute is not given it will
> use this value by default now.
> At least it will introduce some UAPI testcases to check if the default type
> is used if not specified and vice versa.

Series applied, thank you.


Re: [PATCH net-next] Documentation: networking: Add blurb about patches in patchwork

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 15:07:51 -0700

> Explain that the patch queue in patchwork should not be touched by patch
> submitters.
> 
> Signed-off-by: Florian Fainelli 

Applied.


Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 15:08:00 -0700

> On 08/29/2017 02:52 PM, David Miller wrote:
>> From: Florian Fainelli 
>> Date: Tue, 29 Aug 2017 14:45:30 -0700
>> 
>>> On 08/29/2017 02:42 PM, David Miller wrote:
 From: Florian Fainelli 
 Date: Tue, 29 Aug 2017 11:39:41 -0700

> While trying an ARM BE kernel for kinks, the 3 drivers below started not
> working and the reasons why became pretty obvious because the register 
> space
> remains LE (hardwired), except for Broadcom MIPS where it follows the 
> CPU's
> native endian (let's call that a feature).

 Series applied, thanks Florian.
>>>
>>> If you have not pushed yet, seems like not, can you check you applied v2
>>> of this patch series, in particular this patch:
>>>
>>> http://patchwork.ozlabs.org/patch/807296/
>> 
>> If you didn't keep messing with the patchwork state of patches in my
>> queue this confusion would not have happened.
>> 
>> I did happen to apply v2, but because of all of the confusion you keep
>> creating, I used the header message from v1.
>> 
>> It's already pushed out so there is nothing we can do about it.
>> 
> 
> It's all good, thanks! /me goes updating netdev-FAQ.txt to mention not
> touching patchwork.

Thank you.


Re: [PATCH net v2 0/6] net:ethernet:aquantia: Atlantic driver Update 2017-08-23

2017-08-29 Thread David Miller
From: Pavel Belous 
Date: Mon, 28 Aug 2017 21:52:07 +0300

> From: Pavel Belous 
> 
> This series contains updates for aQuantia Atlantic driver.
> 
> It has bugfixes and some improvements.
> 
> Changes in v2:
>  - "MCP state change" fix removed (will be sent as
> a separate fix after further investigation.)

Series applied, thanks.


Re: [PATCH] packet: Don't write vnet header beyond end of buffer

2017-08-29 Thread David Miller
From: Benjamin Poirier 
Date: Mon, 28 Aug 2017 14:29:41 -0400

> ... which may happen with certain values of tp_reserve and maclen.
> 
> Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv")
> Signed-off-by: Benjamin Poirier 
> Cc: Willem de Bruijn 

Applied and queued up for -stable.


Re: [PATCH net v1 1/1] tipc: permit bond slave as bearer

2017-08-29 Thread David Miller
From: Parthasarathy Bhuvaragan 
Date: Mon, 28 Aug 2017 17:57:02 +0200

> For a bond slave device as a tipc bearer, the dev represents the bond
> interface and orig_dev represents the slave in tipc_l2_rcv_msg().
> Since we decode the tipc_ptr from bonding device (dev), we fail to
> find the bearer and thus tipc links are not established.
> 
> In this commit, we register the tipc protocol callback per device and
> look for tipc bearer from both the devices.
> 
> Signed-off-by: Parthasarathy Bhuvaragan 
> 

Applied.


Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread Florian Fainelli
On 08/29/2017 02:52 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue, 29 Aug 2017 14:45:30 -0700
> 
>> On 08/29/2017 02:42 PM, David Miller wrote:
>>> From: Florian Fainelli 
>>> Date: Tue, 29 Aug 2017 11:39:41 -0700
>>>
 While trying an ARM BE kernel for kinks, the 3 drivers below started not
 working and the reasons why became pretty obvious because the register 
 space
 remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
 native endian (let's call that a feature).
>>>
>>> Series applied, thanks Florian.
>>
>> If you have not pushed yet, seems like not, can you check you applied v2
>> of this patch series, in particular this patch:
>>
>> http://patchwork.ozlabs.org/patch/807296/
> 
> If you didn't keep messing with the patchwork state of patches in my
> queue this confusion would not have happened.
> 
> I did happen to apply v2, but because of all of the confusion you keep
> creating, I used the header message from v1.
> 
> It's already pushed out so there is nothing we can do about it.
> 

It's all good, thanks! /me goes updating netdev-FAQ.txt to mention not
touching patchwork.
-- 
Florian


[PATCH net-next] Documentation: networking: Add blurb about patches in patchwork

2017-08-29 Thread Florian Fainelli
Explain that the patch queue in patchwork should not be touched by patch
submitters.

Signed-off-by: Florian Fainelli 
---
 Documentation/networking/netdev-FAQ.txt | 8 
 1 file changed, 8 insertions(+)

diff --git a/Documentation/networking/netdev-FAQ.txt 
b/Documentation/networking/netdev-FAQ.txt
index 247a30ba8e17..cfc66ea72329 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -111,6 +111,14 @@ A: Generally speaking, the patches get triaged quickly (in 
less than 48h).
patch is a good way to ensure your patch is ignored or pushed to
the bottom of the priority list.
 
+Q: I submitted multiple versions of the patch series, should I directly update
+   patchwork for the previous versions of these patch series?
+
+A: No, please don't interfere with the patch status on patchwork, leave it to
+   the maintainer to figure out what is the most recent and current version 
that
+   should be applied. If there is any doubt, the maintainer will reply and ask
+   what should be done.
+
 Q: How can I tell what patches are queued up for backporting to the
various stable releases?
 
-- 
2.9.3



Re: [PATCH net-next 0/4] mlx4 misc patches

2017-08-29 Thread David Miller
From: Tariq Toukan 
Date: Mon, 28 Aug 2017 16:38:19 +0300

> This patchset contains misc patches from the team
> to the mlx4 Core and Eth drivers.
> 
> Patch 1 by Eran replaces large static allocations by dynamic ones.
> Patch 2 by Leon makes an explicit conversion and solves a smatch warning.
> In patch 3 I fix a misplaced brackets of the sizeof operation.
> Patch 4 by Moshe adds the ability to inform the FW regarding user mac updates.
> 
> Series generated against net-next commit:
> 901c5d2fbfcd ARM: dts: rk3228-evb: Fix the compiling error

Series applied, thanks.


Re: [PATCH v2 net-next 0/8] bpf: Add option to set mark and priority in cgroup sock programs

2017-08-29 Thread David Miller
From: David Ahern 
Date: Fri, 25 Aug 2017 12:05:33 -0700

> Add option to set mark and priority in addition to bound device for newly
> created sockets. Also, allow the bpf programs to use the get_current_uid_gid
> helper meaning socket marks, priority and device can be set base on the
> uid/gid of the running process.
> 
> For flexbility in deploying these programs, option is added to allow cgroups
> to be walked from current to root running any program attached. This allows
> one cgroup level to control the device a socket is bound to (e.g, a VRF) while
> cgroups can be used to set socket marks and priority.
> 
> Sample programs are updated to demonstrate the new options.
> 
> v2
> - added flag to control recursive behavior as requested by Alexei
> - added comment to sock_filter_func_proto regarding use of
>   get_current_uid_gid helper
> - updated test programs for recursive option

I'm marking this patch series as "deferred" while the semantic issues
keep getting discussed.

Thanks.


Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 14:45:30 -0700

> On 08/29/2017 02:42 PM, David Miller wrote:
>> From: Florian Fainelli 
>> Date: Tue, 29 Aug 2017 11:39:41 -0700
>> 
>>> While trying an ARM BE kernel for kinks, the 3 drivers below started not
>>> working and the reasons why became pretty obvious because the register space
>>> remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
>>> native endian (let's call that a feature).
>> 
>> Series applied, thanks Florian.
> 
> If you have not pushed yet, seems like not, can you check you applied v2
> of this patch series, in particular this patch:
> 
> http://patchwork.ozlabs.org/patch/807296/

If you didn't keep messing with the patchwork state of patches in my
queue this confusion would not have happened.

I did happen to apply v2, but because of all of the confusion you keep
creating, I used the header message from v1.

It's already pushed out so there is nothing we can do about it.


Re: [PATCH net-next v3] bridge: fdb add and delete tracepoints

2017-08-29 Thread David Miller
From: Roopa Prabhu 
Date: Tue, 29 Aug 2017 13:16:57 -0700

> From: Roopa Prabhu 
> 
> A few useful tracepoints to trace bridge forwarding
> database updates.
> 
> Signed-off-by: Roopa Prabhu 
> ---
> v2: address comments from florian
> v3: remove stray character '=' in print (pointed out by florian)

Applied, thanks.


Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread Florian Fainelli
On 08/29/2017 02:42 PM, David Miller wrote:
> From: Florian Fainelli 
> Date: Tue, 29 Aug 2017 11:39:41 -0700
> 
>> While trying an ARM BE kernel for kinks, the 3 drivers below started not
>> working and the reasons why became pretty obvious because the register space
>> remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
>> native endian (let's call that a feature).
> 
> Series applied, thanks Florian.

If you have not pushed yet, seems like not, can you check you applied v2
of this patch series, in particular this patch:

http://patchwork.ozlabs.org/patch/807296/

Thanks!
-- 
Florian


Re: [PATCH net-next 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread David Miller
From: Florian Fainelli 
Date: Tue, 29 Aug 2017 11:39:41 -0700

> While trying an ARM BE kernel for kinks, the 3 drivers below started not
> working and the reasons why became pretty obvious because the register space
> remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
> native endian (let's call that a feature).

Series applied, thanks Florian.


Re: [PATCH 0/4] irda: move it to drivers/staging so we can delete it

2017-08-29 Thread Ondrej Zary
On Tuesday 29 August 2017 01:42:08 David Miller wrote:
> From: Greg Kroah-Hartman 
> Date: Sun, 27 Aug 2017 17:03:30 +0200
>
> > The IRDA code has long been obsolete and broken.  So, to keep people
> > from trying to use it, and to prevent people from having to maintain it,
> > let's move it to drivers/staging/ so that we can delete it entirely from
> > the kernel in a few releases.
>
> No objection, I'll apply this to net-next, thanks Greg.

IRDA works fine in Debian 9 (kernel 4.9) and I use it for simple file 
transfer. Hope I'm not the only one...

# irattach /dev/ttyS0 -d tekram -s
# irdadump
21:28:52.830350 xid:cmd aed8eb79 >  S=6 s=0 (14)
21:28:52.922368 xid:cmd aed8eb79 >  S=6 s=1 (14)
21:28:53.014350 xid:cmd aed8eb79 >  S=6 s=2 (14)
21:28:53.106338 xid:cmd aed8eb79 >  S=6 s=3 (14)
21:28:53.190276 xid:rsp aed8eb79 < 35d1 S=6 s=3 Nokia 6230i hint=b125 [ 
PnP Modem Fax Telephony IrCOMM IrOBEX ] (28)
21:28:53.198384 xid:cmd aed8eb79 >  S=6 s=4 (14)
21:28:53.290382 xid:cmd aed8eb79 >  S=6 s=5 (14)
21:28:53.382341 xid:cmd aed8eb79 >  S=6 s=* pentium hint=0400 [ 
Computer ] (23)
^C
8 packets received by filter

$ obexftp -i -l MMC
Connecting..\done
Receiving "MMC".../
 ]>






$ obexftp -i -c MMC -g Image004.jpg
Connecting..\done
Sending "MMC"...|done
Receiving "Image004.jpg"...-done
Disconnecting..\done


-- 
Ondrej Zary


Re: [PATCH] DSA support for Micrel KSZ8895

2017-08-29 Thread Florian Fainelli
On 08/29/2017 02:15 PM, Pavel Machek wrote:
> On Tue 2017-08-29 14:26:04, Andrew Lunn wrote:
>>> But the MDIO emaulation code is from their driver, after lots of
>>> deletions.
>>
>> Is this driver supposed to run on lots of different OSs? That would
>> explain why they ignored the Linux MDIO and PHY layers.
> 
> It did not look particulary portable.

Part of the problem is that they need to duplicate the standard MII
definitions, whereas we could re-use those from include/linux/mii.h
and/or mdio.h.

> 
>> If possible, please make use of the Linux infrastructure.
> 
> I did not find any infrastructure I could use instead
> ksz_mdio_emulation.

fixed PHY/swphy.c is as close as it could get, but it is a highly
simplified version of this.

> 
> Now, drivers/net/phy/spi_ks8995.c can access the PHY registers, and I
> can not see any translation there, so there may be something I'm
> missing.

I don't see anything in that driver that seems to access PHY registers
what makes you think it does?

There's got to be a way to perform indirect accesses through SPI,
Woojung, do you know?

> 
> Pointers would be welcome at this point.
> 
> Thanks,
>   Pavel
> 


-- 
Florian


Re: mlxsw and rtnl lock

2017-08-29 Thread Arkadi Sharshevsky


On 08/29/2017 11:04 PM, David Ahern wrote:
> On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote:
>>
>>
>> On 08/28/2017 09:00 PM, David Ahern wrote:
>>> On 8/26/17 11:04 AM, Ido Schimmel wrote:
 Regarding the silent abort, that's intentional. You can look at the same
 code in v4.9 - when the chain was still blocking - and you'll see that
 we didn't propagate the error even then. This was discussed in the past
 and the conclusion was that user doesn't expect to operation to fail. If
 hardware resources are exceeded, we let the kernel take care of the
 forwarding instead.

>>>
>>> In addition to Roopa's comments... The silent abort is not a good user
>>> experience. Right now it's add a network address or route, cross fingers
>>> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
>>> prefix, etc) that triggers the offload abort.
>>>
>>> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
>>> see any query related to current usage, and there is no API to pass any
>>> of that data to user space so user space has no programmatic way to
>>> handle this. I realize you are aware of this limitation. The point is to
>>> emphasize the need to resolve this.
>>>
>>
>> We actually thought about providing he user some tools to understand
>> the ASIC's limitations by introducing the 'resource' object to devlink.
>>
>> By linking dpipe tables to resources the user can understand which
>> hardware processes share a common resource, furthermore this resources
>> usage could be observed. By this more visibility can be obtained.
>>
>> Its not a remedy for the silent abort, but, maybe a notification
>> can be sent from devlink in case of abort that some resources is
>> full.
>>
>> This proposition was sent as RFC several weeks ago.
>>
> 
> Do you have patches (kernel and devlink) for the proposal?
> 

No, only the design RFC which describe the UAPI, devlink
commands and the devlink/driver interactions.

I wanted to receive some feedback before the coding.


Re: [PATCH] DSA support for Micrel KSZ8895

2017-08-29 Thread Pavel Machek
On Tue 2017-08-29 14:26:04, Andrew Lunn wrote:
> > But the MDIO emaulation code is from their driver, after lots of
> > deletions.
> 
> Is this driver supposed to run on lots of different OSs? That would
> explain why they ignored the Linux MDIO and PHY layers.

It did not look particulary portable.

> If possible, please make use of the Linux infrastructure.

I did not find any infrastructure I could use instead
ksz_mdio_emulation.

Now, drivers/net/phy/spi_ks8995.c can access the PHY registers, and I
can not see any translation there, so there may be something I'm
missing.

Pointers would be welcome at this point.

Thanks,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support

2017-08-29 Thread Michael S. Tsirkin
On Tue, Aug 29, 2017 at 05:02:29PM -0400, Willem de Bruijn wrote:
> + virtio-dev
> 
> On Tue, Aug 29, 2017 at 4:38 PM, Michael S. Tsirkin  wrote:
> > On Tue, Aug 29, 2017 at 04:27:41PM -0400, Willem de Bruijn wrote:
> >> On Tue, Aug 29, 2017 at 4:16 PM, Michael S. Tsirkin  
> >> wrote:
> >> > On Tue, Aug 29, 2017 at 04:07:59PM -0400, Willem de Bruijn wrote:
> >> >> From: Willem de Bruijn 
> >> >>
> >> >> Implement a mechanism to signal that a virtio device implements the
> >> >> VIRTIO_CONFIG_S_NEEDS_RESET command.
> >> >>
> >> >> Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
> >> >> and verifying the reset state would require an expensive state change.
> >> >>
> >> >> To avoid that, add a status bit to the feature register used during
> >> >> feature negotiation between host and guest.
> >> >>
> >> >> Set the bit for virtio-net, which supports the feature.
> >> >>
> >> >> Signed-off-by: Willem de Bruijn 
> >> >
> >> > All virtio 1 devices have the reset feature
> >>
> >> I don't quite follow. No device drivers implement this request currently?
> >
> > Depends. Spec 1.0 describes the bit and that driver can respond
> > by reseting the device. You seem to do something else
> > in this patchset, but as designed in 1.0 it does not seem to need
> > a feature bit.
> 
> I see. So support is designed to be best effort?
> 
> The feature bit is only needed if driver support is optional.
> 
> The ack response is necessary if the device acts differently
> depending on whether the reset happened. The device has
> to reset its local state, too, so I think that this is needed.

That reset should only happen when guest driver resets the device.
And spec already has a mechanism for that anyway.

> 
> >> > so maybe guest does
> >> > not need this flag. Does device need it? Does device really
> >> > care that guest can't recover?
> >>
> >> If all device drivers support it, then the flag is not needed.
> >>
> >> But as long as legacy device drivers can exist that do not support
> >> this feature, it has to have a way of differentiating between the two.
> >
> > Why? Device won't set this unless it's in a bad state. In that case,
> > setting the bit is harmless even if drivers ignore it.
> 
> The goal is for the device to be able to rely on the driver reset to get
> to a good state even if it gets it into a bad state.
> 
> That allows it to implement optimizations that could get it into that bad
> state.

I see. I don't think this is what the need reset was designed for.

We can extend it to cover this case but let's add a bit more
documentation then.

And in particular won't it be better if we could just reset one ring,
and not the whole device state? This might be nicer so flows on other
rings aren't disrupted.

> 
> In particular, in the edge case where the device performs backpressure,
> takes the descriptor out of the avail ring and does not immediately post
> it to the used ring.
> 
> A reset will make the guest free all delayed packets and treat any
> unsent and unacknowledged as network drops. This allows the
> device to indeed drop long delayed packets when they eventually
> surface (e.g., leave a qdisc queue).

In this particular case, won't it be easier for device to just
report all packets as used, without involving the guest?

> This is of course not safe with
> zerocopy packets.


I wonder if we can teach kernel to drop zero copy packets too.

-- 
MST


Re: [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support

2017-08-29 Thread Willem de Bruijn
+ virtio-dev

On Tue, Aug 29, 2017 at 4:38 PM, Michael S. Tsirkin  wrote:
> On Tue, Aug 29, 2017 at 04:27:41PM -0400, Willem de Bruijn wrote:
>> On Tue, Aug 29, 2017 at 4:16 PM, Michael S. Tsirkin  wrote:
>> > On Tue, Aug 29, 2017 at 04:07:59PM -0400, Willem de Bruijn wrote:
>> >> From: Willem de Bruijn 
>> >>
>> >> Implement a mechanism to signal that a virtio device implements the
>> >> VIRTIO_CONFIG_S_NEEDS_RESET command.
>> >>
>> >> Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
>> >> and verifying the reset state would require an expensive state change.
>> >>
>> >> To avoid that, add a status bit to the feature register used during
>> >> feature negotiation between host and guest.
>> >>
>> >> Set the bit for virtio-net, which supports the feature.
>> >>
>> >> Signed-off-by: Willem de Bruijn 
>> >
>> > All virtio 1 devices have the reset feature
>>
>> I don't quite follow. No device drivers implement this request currently?
>
> Depends. Spec 1.0 describes the bit and that driver can respond
> by reseting the device. You seem to do something else
> in this patchset, but as designed in 1.0 it does not seem to need
> a feature bit.

I see. So support is designed to be best effort?

The feature bit is only needed if driver support is optional.

The ack response is necessary if the device acts differently
depending on whether the reset happened. The device has
to reset its local state, too, so I think that this is needed.


>> > so maybe guest does
>> > not need this flag. Does device need it? Does device really
>> > care that guest can't recover?
>>
>> If all device drivers support it, then the flag is not needed.
>>
>> But as long as legacy device drivers can exist that do not support
>> this feature, it has to have a way of differentiating between the two.
>
> Why? Device won't set this unless it's in a bad state. In that case,
> setting the bit is harmless even if drivers ignore it.

The goal is for the device to be able to rely on the driver reset to get
to a good state even if it gets it into a bad state.

That allows it to implement optimizations that could get it into that bad
state.

In particular, in the edge case where the device performs backpressure,
takes the descriptor out of the avail ring and does not immediately post
it to the used ring.

A reset will make the guest free all delayed packets and treat any
unsent and unacknowledged as network drops. This allows the
device to indeed drop long delayed packets when they eventually
surface (e.g., leave a qdisc queue). This is of course not safe with
zerocopy packets.


[PATCH net-next v2 2/4] net: dsa: bcm_sf2: Use correct I/O accessors

2017-08-29 Thread Florian Fainelli
The Starfigther 2 driver currently uses __raw_{read,write}l which means
native I/O endian. This works correctly for an ARM LE kernel (default)
but fails miserably on an ARM BE (BE8) kernel where registers are kept
little endian, so replace uses with {read,write}l_relaxed here which is
what we want because this is all performance sensitive code.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.h b/drivers/net/dsa/bcm_sf2.h
index 7d3030e04f11..d9c96b281fc0 100644
--- a/drivers/net/dsa/bcm_sf2.h
+++ b/drivers/net/dsa/bcm_sf2.h
@@ -130,12 +130,12 @@ static inline u32 bcm_sf2_mangle_addr(struct bcm_sf2_priv 
*priv, u32 off)
 #define SF2_IO_MACRO(name) \
 static inline u32 name##_readl(struct bcm_sf2_priv *priv, u32 off) \
 {  \
-   return __raw_readl(priv->name + off);   \
+   return readl_relaxed(priv->name + off); \
 }  \
 static inline void name##_writel(struct bcm_sf2_priv *priv,\
  u32 val, u32 off) \
 {  \
-   __raw_writel(val, priv->name + off);\
+   writel_relaxed(val, priv->name + off);  \
 }  \
 
 /* Accesses to 64-bits register requires us to latch the hi/lo pairs
@@ -179,23 +179,23 @@ static inline u32 bcm_sf2_mangle_addr(struct bcm_sf2_priv 
*priv, u32 off)
 static inline u32 core_readl(struct bcm_sf2_priv *priv, u32 off)
 {
u32 tmp = bcm_sf2_mangle_addr(priv, off);
-   return __raw_readl(priv->core + tmp);
+   return readl_relaxed(priv->core + tmp);
 }
 
 static inline void core_writel(struct bcm_sf2_priv *priv, u32 val, u32 off)
 {
u32 tmp = bcm_sf2_mangle_addr(priv, off);
-   __raw_writel(val, priv->core + tmp);
+   writel_relaxed(val, priv->core + tmp);
 }
 
 static inline u32 reg_readl(struct bcm_sf2_priv *priv, u16 off)
 {
-   return __raw_readl(priv->reg + priv->reg_offsets[off]);
+   return readl_relaxed(priv->reg + priv->reg_offsets[off]);
 }
 
 static inline void reg_writel(struct bcm_sf2_priv *priv, u32 val, u16 off)
 {
-   __raw_writel(val, priv->reg + priv->reg_offsets[off]);
+   writel_relaxed(val, priv->reg + priv->reg_offsets[off]);
 }
 
 SF2_IO64_MACRO(core);
-- 
1.9.1



[PATCH net-next v2 0/4] Endian fixes for SYSTEMPORT/SF2/MDIO

2017-08-29 Thread Florian Fainelli
Hi David,

While trying an ARM BE kernel for kinks, the 3 drivers below started not
working and the reasons why became pretty obvious because the register space
remains LE (hardwired), except for Broadcom MIPS where it follows the CPU's
native endian (let's call that a feature).

Thanks!

Changes in v2:

- correctly set RSB_SWAP1 and RSB_SWAP0 for all combinations, now properly
  tested on SYSTEMPORT (BCM7445), SYSTEMPORT Lite (BCM7278) under both endian

Florian Fainelli (4):
  net: systemport: Use correct I/O accessors
  net: dsa: bcm_sf2: Use correct I/O accessors
  net: systemport: Set correct RSB endian bits based on host
  net: phy: mdio-bcm-unimac: Use correct I/O accessors

 drivers/net/dsa/bcm_sf2.h  | 12 +--
 drivers/net/ethernet/broadcom/bcmsysport.c | 20 +++
 drivers/net/phy/mdio-bcm-unimac.c  | 32 --
 3 files changed, 44 insertions(+), 20 deletions(-)

-- 
1.9.1



Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi

2017-08-29 Thread Michael S. Tsirkin
On Tue, Aug 29, 2017 at 03:53:08PM -0400, Willem de Bruijn wrote:
> On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin  wrote:
> > On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
> >> By the way, I have had an unrelated patch outstanding for a while
> >> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
> >> command. Will send that as RFC.
> >
> > Oh nice.
> 
> Great :)
> 
> > One needs to be careful about locking there which is why
> > no devices support that yet.
> 
> I originally wrote it based on the virtnet_reset function introduced
> for xdp. Calling this from virtnet_config_changed_work is non trivial,
> as virtnet_freeze_down waits until no config worker is running.
> 
> Otherwise, I could not find any constraints on when freeze may be
> called, and it largely follows the same path. I hope I didn't miss anything.

The issue is that on freeze processes are not running so we
generally know no new packets will arrive (might be wrong
for bridging, then it's a bug). On device error you must
prevent new skbs from coming in, etc.

-- 
MST


[PATCH net-next v2 1/4] net: systemport: Use correct I/O accessors

2017-08-29 Thread Florian Fainelli
The SYSTEMPORT driver currently uses __raw_{read,write}l which means
native I/O endian. This works correctly for an ARM LE kernel (default)
but fails miserably on an ARM BE (BE8) kernel where registers are kept
little endian, so replace uses with {read,write}l_relaxed here which is
what we want because this is all performance sensitive code.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index b3a21418f511..a7e84292af50 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -32,13 +32,13 @@
 #define BCM_SYSPORT_IO_MACRO(name, offset) \
 static inline u32 name##_readl(struct bcm_sysport_priv *priv, u32 off) \
 {  \
-   u32 reg = __raw_readl(priv->base + offset + off);   \
+   u32 reg = readl_relaxed(priv->base + offset + off); \
return reg; \
 }  \
 static inline void name##_writel(struct bcm_sysport_priv *priv,
\
  u32 val, u32 off) \
 {  \
-   __raw_writel(val, priv->base + offset + off);   \
+   writel_relaxed(val, priv->base + offset + off); \
 }  \
 
 BCM_SYSPORT_IO_MACRO(intrl2_0, SYS_PORT_INTRL2_0_OFFSET);
@@ -59,14 +59,14 @@ static inline u32 rdma_readl(struct bcm_sysport_priv *priv, 
u32 off)
 {
if (priv->is_lite && off >= RDMA_STATUS)
off += 4;
-   return __raw_readl(priv->base + SYS_PORT_RDMA_OFFSET + off);
+   return readl_relaxed(priv->base + SYS_PORT_RDMA_OFFSET + off);
 }
 
 static inline void rdma_writel(struct bcm_sysport_priv *priv, u32 val, u32 off)
 {
if (priv->is_lite && off >= RDMA_STATUS)
off += 4;
-   __raw_writel(val, priv->base + SYS_PORT_RDMA_OFFSET + off);
+   writel_relaxed(val, priv->base + SYS_PORT_RDMA_OFFSET + off);
 }
 
 static inline u32 tdma_control_bit(struct bcm_sysport_priv *priv, u32 bit)
@@ -110,10 +110,10 @@ static inline void dma_desc_set_addr(struct 
bcm_sysport_priv *priv,
 dma_addr_t addr)
 {
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
-   __raw_writel(upper_32_bits(addr) & DESC_ADDR_HI_MASK,
+   writel_relaxed(upper_32_bits(addr) & DESC_ADDR_HI_MASK,
 d + DESC_ADDR_HI_STATUS_LEN);
 #endif
-   __raw_writel(lower_32_bits(addr), d + DESC_ADDR_LO);
+   writel_relaxed(lower_32_bits(addr), d + DESC_ADDR_LO);
 }
 
 static inline void tdma_port_write_desc_addr(struct bcm_sysport_priv *priv,
-- 
1.9.1



[PATCH net-next v2 3/4] net: systemport: Set correct RSB endian bits based on host

2017-08-29 Thread Florian Fainelli
RSB_SWAP0 needs to match the host CPU endian, and it needs to be set
for LE and clear for BE. RSB_SWAP1 must always be cleared for SYSTEMPORT
Lite.

With these settings, we have the Receive Status Block always match the
host endian and we do not need to perform any conversion. Since there is
not necessarily a CONFIG_CPU_LITTLE_ENDIAN option defined, we test for
!CONFIG_CPU_BIG_ENDIAN which is guaranteed to be set.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index a7e84292af50..931751e4f369 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1762,10 +1762,14 @@ static void rbuf_init(struct bcm_sysport_priv *priv)
reg = rbuf_readl(priv, RBUF_CONTROL);
reg |= RBUF_4B_ALGN | RBUF_RSB_EN;
/* Set a correct RSB format on SYSTEMPORT Lite */
-   if (priv->is_lite) {
+   if (priv->is_lite)
reg &= ~RBUF_RSB_SWAP1;
+
+   /* Set a correct RSB format based on host endian */
+   if (!IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
reg |= RBUF_RSB_SWAP0;
-   }
+   else
+   reg &= ~RBUF_RSB_SWAP0;
rbuf_writel(priv, reg, RBUF_CONTROL);
 }
 
-- 
1.9.1



[PATCH net-next v2 4/4] net: phy: mdio-bcm-unimac: Use correct I/O accessors

2017-08-29 Thread Florian Fainelli
The driver currently uses __raw_{read,write}l which works for all
platforms supported: Broadcom MIPS LE/BE (native endian), ARM LE (native
endian) but not ARM BE (registers are still LE). Switch to using the
proper accessors for all platforms and explain why Broadcom MIPS BE is
special here, in doing so, we introduce a couple of helper functions to
abstract these differences.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/mdio-bcm-unimac.c | 32 ++--
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/mdio-bcm-unimac.c 
b/drivers/net/phy/mdio-bcm-unimac.c
index 73c5267a11fd..08e0647b85e2 100644
--- a/drivers/net/phy/mdio-bcm-unimac.c
+++ b/drivers/net/phy/mdio-bcm-unimac.c
@@ -47,18 +47,38 @@ struct unimac_mdio_priv {
void*wait_func_data;
 };
 
+static inline u32 unimac_mdio_readl(struct unimac_mdio_priv *priv, u32 offset)
+{
+   /* MIPS chips strapped for BE will automagically configure the
+* peripheral registers for CPU-native byte order.
+*/
+   if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+   return __raw_readl(priv->base + offset);
+   else
+   return readl_relaxed(priv->base + offset);
+}
+
+static inline void unimac_mdio_writel(struct unimac_mdio_priv *priv, u32 val,
+ u32 offset)
+{
+   if (IS_ENABLED(CONFIG_MIPS) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+   __raw_writel(val, priv->base + offset);
+   else
+   writel_relaxed(val, priv->base + offset);
+}
+
 static inline void unimac_mdio_start(struct unimac_mdio_priv *priv)
 {
u32 reg;
 
-   reg = __raw_readl(priv->base + MDIO_CMD);
+   reg = unimac_mdio_readl(priv, MDIO_CMD);
reg |= MDIO_START_BUSY;
-   __raw_writel(reg, priv->base + MDIO_CMD);
+   unimac_mdio_writel(priv, reg, MDIO_CMD);
 }
 
 static inline unsigned int unimac_mdio_busy(struct unimac_mdio_priv *priv)
 {
-   return __raw_readl(priv->base + MDIO_CMD) & MDIO_START_BUSY;
+   return unimac_mdio_readl(priv, MDIO_CMD) & MDIO_START_BUSY;
 }
 
 static int unimac_mdio_poll(void *wait_func_data)
@@ -87,7 +107,7 @@ static int unimac_mdio_read(struct mii_bus *bus, int phy_id, 
int reg)
 
/* Prepare the read operation */
cmd = MDIO_RD | (phy_id << MDIO_PMD_SHIFT) | (reg << MDIO_REG_SHIFT);
-   __raw_writel(cmd, priv->base + MDIO_CMD);
+   unimac_mdio_writel(priv, cmd, MDIO_CMD);
 
/* Start MDIO transaction */
unimac_mdio_start(priv);
@@ -96,7 +116,7 @@ static int unimac_mdio_read(struct mii_bus *bus, int phy_id, 
int reg)
if (ret)
return ret;
 
-   cmd = __raw_readl(priv->base + MDIO_CMD);
+   cmd = unimac_mdio_readl(priv, MDIO_CMD);
 
/* Some broken devices are known not to release the line during
 * turn-around, e.g: Broadcom BCM53125 external switches, so check for
@@ -118,7 +138,7 @@ static int unimac_mdio_write(struct mii_bus *bus, int 
phy_id,
/* Prepare the write operation */
cmd = MDIO_WR | (phy_id << MDIO_PMD_SHIFT) |
(reg << MDIO_REG_SHIFT) | (0x & val);
-   __raw_writel(cmd, priv->base + MDIO_CMD);
+   unimac_mdio_writel(priv, cmd, MDIO_CMD);
 
unimac_mdio_start(priv);
 
-- 
1.9.1



Re: [PATCH RFC 2/2] virtio_net: enable probing for NEEDS_RESET support

2017-08-29 Thread Michael S. Tsirkin
On Tue, Aug 29, 2017 at 04:27:41PM -0400, Willem de Bruijn wrote:
> On Tue, Aug 29, 2017 at 4:16 PM, Michael S. Tsirkin  wrote:
> > On Tue, Aug 29, 2017 at 04:07:59PM -0400, Willem de Bruijn wrote:
> >> From: Willem de Bruijn 
> >>
> >> Implement a mechanism to signal that a virtio device implements the
> >> VIRTIO_CONFIG_S_NEEDS_RESET command.
> >>
> >> Testing for VIRTIO_CONFIG_S_NEEDS_RESET support by issuing the request
> >> and verifying the reset state would require an expensive state change.
> >>
> >> To avoid that, add a status bit to the feature register used during
> >> feature negotiation between host and guest.
> >>
> >> Set the bit for virtio-net, which supports the feature.
> >>
> >> Signed-off-by: Willem de Bruijn 
> >
> > All virtio 1 devices have the reset feature
> 
> I don't quite follow. No device drivers implement this request currently?

Depends. Spec 1.0 describes the bit and that driver can respond
by reseting the device. You seem to do something else
in this patchset, but as designed in 1.0 it does not seem to need
a feature bit.

> > so maybe guest does
> > not need this flag. Does device need it? Does device really
> > care that guest can't recover?
> 
> If all device drivers support it, then the flag is not needed.
> 
> But as long as legacy device drivers can exist that do not support
> this feature, it has to have a way of differentiating between the two.

Why? Device won't set this unless it's in a bad state. In that case,
setting the bit is harmless even if drivers ignore it.

-- 
MST


Re: [PATCH net-next v3] bridge: fdb add and delete tracepoints

2017-08-29 Thread Nikolay Aleksandrov
On 29/08/17 23:16, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> A few useful tracepoints to trace bridge forwarding
> database updates.
> 
> Signed-off-by: Roopa Prabhu 
> ---
> v2: address comments from florian
> v3: remove stray character '=' in print (pointed out by florian)
> 
>  include/trace/events/bridge.h |   98 
> +
>  net/bridge/br_fdb.c   |7 +++
>  net/core/net-traces.c |6 +++
>  3 files changed, 111 insertions(+)
>  create mode 100644 include/trace/events/bridge.h
> 

Acked-by: Nikolay Aleksandrov 



Re: [PATCH] Documentation: fix little inconsistencies

2017-08-29 Thread Pavel Machek
On Tue 2017-08-29 13:09:45, Jonathan Corbet wrote:
> On Tue, 29 Aug 2017 09:50:57 -0700
> "Darrick J. Wong"  wrote:
> 
> > > Anything wrong here? It is fixing extra '+' in the ascii art...  
> > 
> > There's nothing incorrect here, I merely thought it odd to send a fix
> > for networking documentation to the ext4 list, but not netdev?
> 
> In fact, davem likes to handle networking documentation patches himself,
> so resending this one to netdev would be a good idea.

Its a typo in documentation, do we really need to waste any more time
with it? Dave, can you just ack it?

> > index 5e40e1f..6309e90 100644
> > --- a/Documentation/networking/switchdev.txt
> > +++ b/Documentation/networking/switchdev.txt
> > @@ -29,7 +29,7 @@ with SR-IOV or soft switches, such as OVS, are
> > possible.
> >    sw1p1  +  sw1p3  +  sw1p5  +  eth1
> >  +|+|+|+
> >  |||||||
> > - +--++++-+--++---+  +-+-+
> > + +--++++++---+  +-+-+

Besides, if documentation fixes should go to Davem/netdev,
getmaintainer.pl should say so:

pavel@duo:/data/l/linux-n900$ scripts/get_maintainer.pl -f
Documentation/networking
Jonathan Corbet  (maintainer:DOCUMENTATION)
"David S. Miller"  (commit_signer:52/87=60%)
David Howells 
(commit_signer:7/87=8%,authored:8/87=9%)
Florian Fainelli  (commit_signer:6/87=7%)
Mauro Carvalho Chehab 
(commit_signer:6/87=7%,authored:5/87=6%)
Yuchung Cheng  (commit_signer:6/87=7%)
Hangbin Liu  (authored:5/87=6%)
linux-...@vger.kernel.org (open list:DOCUMENTATION)
linux-ker...@vger.kernel.org (open list)
pavel@duo:/data/l/linux-n900$


Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


  1   2   3   >