Re: [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY

2018-05-16 Thread Jesper Dangaard Brouer
On Tue, 15 May 2018 21:06:08 +0200
Björn Töpel  wrote:

> @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 
> *xdp)
>   int metasize;
>   int headroom;
>  
> + // XXX implement clone, copy, use "native" MEM_TYPE
> + if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> + return NULL;
> +

There is going to be significant tradeoffs between AF_XDP zero-copy and
copy-variant.  The copy-variant, still have very attractive
RX-performance, and other benefits like no exposing unrelated packets
to userspace (but limit these to the XDP filter).

Thus, as a user I would like to choose between AF_XDP zero-copy and
copy-variant. Even if my NIC support zero-copy, I can be interested in
only enabling the copy-variant. This patchset doesn't let me choose.

How do we expose this to userspace?
(Maybe as simple as an sockaddr_xdp->sxdp_flags flag?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 11/40] ipv6/flowlabel: simplify pid namespace lookup

2018-05-16 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Sat, May 05, 2018 at 07:37:33AM -0500, Eric W. Biederman wrote:
>> Christoph Hellwig  writes:
>> 
>> > The shole seq_file sequence already operates under a single RCU lock pair,
>> > so move the pid namespace lookup into it, and stop grabbing a reference
>> > and remove all kinds of boilerplate code.
>> 
>> This is wrong.
>> 
>> Move task_active_pid_ns(current) from open to seq_start actually means
>> that the results if you pass this proc file between callers the results
>> will change.  So this breaks file descriptor passing.
>> 
>> Open is a bad place to access current.  In the middle of read/write is
>> broken.
>> 
>> 
>> In this particular instance looking up the pid namespace with
>> task_active_pid_ns was a personal brain fart.  What the code should be
>> doing (with an appropriate helper) is:
>> 
>> struct pid_namespace *pid_ns = inode->i_sb->s_fs_info;
>> 
>> Because each mount of proc is bound to a pid namespace.  Looking up the
>> pid namespace from the super_block is a much better way to go.
>
> What do you have in mind for the helper?  For now I've thrown it in
> opencoded into my working tree, but I'd be glad to add a helper.
>
> struct pid_namespace *proc_pid_namespace(struct inode *inode)
> {
>   // maybe warn on for s_magic not on procfs??
>   return inode->i_sb->s_fs_info;
> }

That should work.  Ideally out of line for the proc_fs.h version.
Basically it should be a cousin of PDE_DATA.

Eric



Re: [PATCH net-next 1/3] net: ethernet: ti: Allow most drivers with COMPILE_TEST

2018-05-16 Thread kbuild test robot
Hi Florian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   drivers/net//ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
>> drivers/net//ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 
>> 1 of 'writel_relaxed' makes integer from pointer without a cast 
>> [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   In file included from arch/xtensa/include/asm/io.h:83:0,
from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:11,
from drivers/net//ethernet/ti/davinci_cpdma.c:21:
   include/asm-generic/io.h:303:24: note: expected 'u32 {aka unsigned int}' but 
argument is of type 'void *'
#define writel_relaxed writel_relaxed
   ^
>> include/asm-generic/io.h:304:20: note: in expansion of macro 'writel_relaxed'
static inline void writel_relaxed(u32 value, volatile void __iomem *addr)
   ^~
--
   drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
   drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 
of 'writel_relaxed' makes integer from pointer without a cast [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   In file included from arch/xtensa/include/asm/io.h:83:0,
from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:11,
from drivers/net/ethernet/ti/davinci_cpdma.c:21:
   include/asm-generic/io.h:303:24: note: expected 'u32 {aka unsigned int}' but 
argument is of type 'void *'
#define writel_relaxed writel_relaxed
   ^
>> include/asm-generic/io.h:304:20: note: in expansion of macro 'writel_relaxed'
static inline void writel_relaxed(u32 value, volatile void __iomem *addr)
   ^~

vim +/writel_relaxed +1083 drivers/net//ethernet/ti/davinci_cpdma.c

ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1029  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1030  int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
aef614e1 drivers/net/ethernet/ti/davinci_cpdma.c Sebastian Siewior 2013-04-23  
1031   int len, int directed)
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1032  {
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1033 struct cpdma_ctlr   *ctlr = chan->ctlr;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1034 struct cpdma_desc __iomem   *desc;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1035 dma_addr_t  buffer;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1036 unsigned long   flags;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1037 u32 mode;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1038 int ret = 0;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1039  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1040 spin_lock_irqsave(>lock, flags);
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1041  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1042 if (chan->state == CPDMA_STATE_TEARDOWN) {
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1043 ret = -EINVAL;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1044 goto unlock_ret;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1045 }
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1046  
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1047 if (chan->count >= chan->desc_num)  {
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1048 chan->stats.desc_alloc_fail++;
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1049 ret = -ENOMEM;
742fb20f 

Re: Grant

2018-05-16 Thread Maratovich.M. Fridman



I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.



Re: [PATCH net-next 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST

2018-05-16 Thread kbuild test robot
Hi Florian,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   In file included from include/linux/swab.h:5:0,
from include/uapi/linux/byteorder/big_endian.h:13,
from include/linux/byteorder/big_endian.h:5,
from arch/m68k/include/uapi/asm/byteorder.h:5,
from include/asm-generic/bitops/le.h:6,
from arch/m68k/include/asm/bitops.h:519,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/module.h:9,
from drivers/net//ethernet/freescale/fec_main.c:24:
   drivers/net//ethernet/freescale/fec_main.c: In function 'fec_restart':
>> drivers/net//ethernet/freescale/fec_main.c:959:26: error: 'FEC_RACC' 
>> undeclared (first use in this function); did you mean 'FEC_RXIC1'?
  val = readl(fep->hwp + FEC_RACC);
 ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
   include/linux/byteorder/generic.h:89:21: note: in expansion of macro 
'__le32_to_cpu'
#define le32_to_cpu __le32_to_cpu
^
   arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32'
#define readl(addr)  in_le32(addr)
 ^~~
   drivers/net//ethernet/freescale/fec_main.c:959:9: note: in expansion of 
macro 'readl'
  val = readl(fep->hwp + FEC_RACC);
^
   drivers/net//ethernet/freescale/fec_main.c:959:26: note: each undeclared 
identifier is reported only once for each function it appears in
  val = readl(fep->hwp + FEC_RACC);
 ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
   include/linux/byteorder/generic.h:89:21: note: in expansion of macro 
'__le32_to_cpu'
#define le32_to_cpu __le32_to_cpu
^
   arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32'
#define readl(addr)  in_le32(addr)
 ^~~
   drivers/net//ethernet/freescale/fec_main.c:959:9: note: in expansion of 
macro 'readl'
  val = readl(fep->hwp + FEC_RACC);
^
   In file included from arch/m68k/include/asm/io_mm.h:27:0,
from arch/m68k/include/asm/io.h:5,
from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:11,
from include/linux/skbuff.h:34,
from include/linux/if_ether.h:23,
from include/uapi/linux/ethtool.h:19,
from include/linux/ethtool.h:18,
from include/linux/netdevice.h:41,
from drivers/net//ethernet/freescale/fec_main.c:34:
   drivers/net//ethernet/freescale/fec_main.c:968:38: error: 'FEC_FTRL' 
undeclared (first use in this function); did you mean 'FEC_ECNTRL'?
  writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
 ^
   arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32'
#define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = 
cpu_to_le32(l))
   ^~~~
   drivers/net//ethernet/freescale/fec_main.c:968:3: note: in expansion of 
macro 'writel'
  writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
  ^~
   drivers/net//ethernet/freescale/fec_main.c:1034:38: error: 'FEC_R_FIFO_RSEM' 
undeclared (first use in this function); did you mean 'FEC_FIFO_RAM'?
  writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM);
 ^
   arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32'
#define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = 
cpu_to_le32(l))
   ^~~~
   drivers/net//ethernet/freescale/fec_main.c:1034:3: note: in expansion of 
macro 'writel'
  writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM);
  ^~
   drivers/net//ethernet/freescale/fec_main.c:1035:38: error: 'FEC_R_FIFO_RSFL' 
undeclared (first use in 

[PATCH net-next] vmxnet3: Replace msleep(1) with usleep_range()

2018-05-16 Thread YueHaibing
As documented in Documentation/timers/timers-howto.txt,
replace msleep(1) with usleep_range().

Signed-off-by: YueHaibing 
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 6 +++---
 drivers/net/vmxnet3/vmxnet3_ethtool.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
b/drivers/net/vmxnet3/vmxnet3_drv.c
index 9ebe2a6..2234a33 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -2945,7 +2945,7 @@ vmxnet3_close(struct net_device *netdev)
 * completion.
 */
while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state))
-   msleep(1);
+   usleep_range(1000, 2000);
 
vmxnet3_quiesce_dev(adapter);
 
@@ -2995,7 +2995,7 @@ vmxnet3_change_mtu(struct net_device *netdev, int new_mtu)
 * completion.
 */
while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state))
-   msleep(1);
+   usleep_range(1000, 2000);
 
if (netif_running(netdev)) {
vmxnet3_quiesce_dev(adapter);
@@ -3567,7 +3567,7 @@ static void vmxnet3_shutdown_device(struct pci_dev *pdev)
 * completion.
 */
while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state))
-   msleep(1);
+   usleep_range(1000, 2000);
 
if (test_and_set_bit(VMXNET3_STATE_BIT_QUIESCED,
 >state)) {
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c 
b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 2ff2731..559db05 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -600,7 +600,7 @@ vmxnet3_set_ringparam(struct net_device *netdev,
 * completion.
 */
while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state))
-   msleep(1);
+   usleep_range(1000, 2000);
 
if (netif_running(netdev)) {
vmxnet3_quiesce_dev(adapter);
-- 
2.7.0




Re: [PATCH net-next 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST

2018-05-16 Thread kbuild test robot
Hi Florian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All warnings (new ones prefixed by >>):

   In file included from include/linux/swab.h:5:0,
from include/uapi/linux/byteorder/big_endian.h:13,
from include/linux/byteorder/big_endian.h:5,
from arch/m68k/include/uapi/asm/byteorder.h:5,
from include/asm-generic/bitops/le.h:6,
from arch/m68k/include/asm/bitops.h:519,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/module.h:9,
from drivers/net/ethernet/freescale/fec_main.c:24:
   drivers/net/ethernet/freescale/fec_main.c: In function 'fec_restart':
   drivers/net/ethernet/freescale/fec_main.c:959:26: error: 'FEC_RACC' 
undeclared (first use in this function); did you mean 'FEC_RXIC0'?
  val = readl(fep->hwp + FEC_RACC);
 ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
>> include/linux/byteorder/generic.h:89:21: note: in expansion of macro 
>> '__le32_to_cpu'
#define le32_to_cpu __le32_to_cpu
^
>> arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32'
#define readl(addr)  in_le32(addr)
 ^~~
>> drivers/net/ethernet/freescale/fec_main.c:959:9: note: in expansion of macro 
>> 'readl'
  val = readl(fep->hwp + FEC_RACC);
^
   drivers/net/ethernet/freescale/fec_main.c:959:26: note: each undeclared 
identifier is reported only once for each function it appears in
  val = readl(fep->hwp + FEC_RACC);
 ^
   include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32'
 (__builtin_constant_p((__u32)(x)) ? \
   ^
>> include/linux/byteorder/generic.h:89:21: note: in expansion of macro 
>> '__le32_to_cpu'
#define le32_to_cpu __le32_to_cpu
^
>> arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32'
#define readl(addr)  in_le32(addr)
 ^~~
>> drivers/net/ethernet/freescale/fec_main.c:959:9: note: in expansion of macro 
>> 'readl'
  val = readl(fep->hwp + FEC_RACC);
^
   In file included from arch/m68k/include/asm/io_mm.h:27:0,
from arch/m68k/include/asm/io.h:5,
from include/linux/scatterlist.h:9,
from include/linux/dma-mapping.h:11,
from include/linux/skbuff.h:34,
from include/linux/if_ether.h:23,
from include/uapi/linux/ethtool.h:19,
from include/linux/ethtool.h:18,
from include/linux/netdevice.h:41,
from drivers/net/ethernet/freescale/fec_main.c:34:
   drivers/net/ethernet/freescale/fec_main.c:968:38: error: 'FEC_FTRL' 
undeclared (first use in this function); did you mean 'FEC_ECNTRL'?
  writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
 ^
   arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32'
#define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = 
cpu_to_le32(l))
   ^~~~
>> drivers/net/ethernet/freescale/fec_main.c:968:3: note: in expansion of macro 
>> 'writel'
  writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
  ^~
   drivers/net/ethernet/freescale/fec_main.c:1034:38: error: 'FEC_R_FIFO_RSEM' 
undeclared (first use in this function); did you mean 'FEC_FIFO_RAM'?
  writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM);
 ^
   arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32'
#define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = 
cpu_to_le32(l))
   ^~~~
   drivers/net/ethernet/freescale/fec_main.c:1034:3: note: in expansion of 
macro 'writel'
  writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM);
  ^~
   drivers/net/ethernet/freescale/fec_main.c:1035:38: error: 'FEC_R_FIFO_RSFL' 
undeclared (first 

Re: Donation

2018-05-16 Thread M.M. Fridman



I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.



Re: [PATCH net-next 1/3] net: ethernet: ti: Allow most drivers with COMPILE_TEST

2018-05-16 Thread kbuild test robot
Hi Florian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/realmode.h:15:0,
from arch/x86/include/asm/acpi.h:33,
from arch/x86/include/asm/fixmap.h:19,
from arch/x86/include/asm/apic.h:10,
from arch/x86/include/asm/smp.h:13,
from include/linux/smp.h:64,
from include/linux/topology.h:33,
from include/linux/gfp.h:9,
from include/linux/idr.h:16,
from include/linux/kernfs.h:14,
from include/linux/sysfs.h:16,
from include/linux/kobject.h:20,
from include/linux/device.h:16,
from drivers/net/ethernet/ti/davinci_cpdma.c:17:
   drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit':
>> drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 
>> of '__writel' makes integer from pointer without a cast [-Wint-conversion]
 writel_relaxed(token, >sw_token);
^
   arch/x86/include/asm/io.h:88:39: note: in definition of macro 
'writel_relaxed'
#define writel_relaxed(v, a) __writel(v, a)
  ^
   arch/x86/include/asm/io.h:71:18: note: expected 'unsigned int' but argument 
is of type 'void *'
build_mmio_write(__writel, "l", unsigned int, "r", )
 ^
   arch/x86/include/asm/io.h:53:20: note: in definition of macro 
'build_mmio_write'
static inline void name(type val, volatile void __iomem *addr) \
   ^~~~

vim +/__writel +1083 drivers/net/ethernet/ti/davinci_cpdma.c

ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1029  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1030  int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
aef614e1 drivers/net/ethernet/ti/davinci_cpdma.c Sebastian Siewior 2013-04-23  
1031   int len, int directed)
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1032  {
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1033 struct cpdma_ctlr   *ctlr = chan->ctlr;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1034 struct cpdma_desc __iomem   *desc;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1035 dma_addr_t  buffer;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1036 unsigned long   flags;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1037 u32 mode;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1038 int ret = 0;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1039  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1040 spin_lock_irqsave(>lock, flags);
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1041  
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1042 if (chan->state == CPDMA_STATE_TEARDOWN) {
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1043 ret = -EINVAL;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1044 goto unlock_ret;
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1045 }
ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15  
1046  
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1047 if (chan->count >= chan->desc_num)  {
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1048 chan->stats.desc_alloc_fail++;
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1049 ret = -ENOMEM;
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1050 goto unlock_ret;
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1051 }
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1052  
742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27  
1053 desc = cpdma_desc_alloc(ctlr->pool);
ef8c2dab 

Re: xdp and fragments with virtio

2018-05-16 Thread David Ahern
On 5/16/18 1:24 AM, Jason Wang wrote:
> 
> 
> On 2018年05月16日 11:51, David Ahern wrote:
>> Hi Jason:
>>
>> I am trying to test MTU changes to the BPF fib_lookup helper and seeing
>> something odd. Hoping you can help.
>>
>> I have a VM with multiple virtio based NICs and tap backends. I install
>> the xdp program on eth1 and eth2 to do forwarding. In the host I send a
>> large packet to eth1:
>>
>> $ ping -s 1500 9.9.9.9
>>
>>
>> The tap device in the host sees 2 packets:
>>
>> $ sudo tcpdump -nv -i vm02-eth1
>> 20:44:33.943160 IP (tos 0x0, ttl 64, id 58746, offset 0, flags [+],
>> proto ICMP (1), length 1500)
>>  10.100.1.254 > 9.9.9.9: ICMP echo request, id 17917, seq 1,
>> length 1480
>> 20:44:33.943172 IP (tos 0x0, ttl 64, id 58746, offset 1480, flags
>> [none], proto ICMP (1), length 48)
>>  10.100.1.254 > 9.9.9.9: ip-proto-1
>>
>>
>> In the VM, the XDP program only sees the first packet, not the fragment.
>> I added a printk to the program (see diff below):
>>
>> $ cat trace_pipe
>>    -0 [003] ..s2   254.436467: 0: packet length 1514
>>
>>
>> Anything come to mind in the virtio xdp implementation that affects
>> fragment packets? I see this with both IPv4 and v6.
> 
> Not yet. But we do turn of tap gso when virtio has XDP set, but it
> shouldn't matter this case.
> 
> Will try to see what's wrong.
> 

I added this to the command line for the NICs and it works:

"mrg_rxbuf=off,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off"

XDP program sees the full size packet and the fragment.

Fun fact: only adding mrg_rxbuf=off so that mergeable_rx_bufs is false
but big_packets is true generates a panic when it receives large packets.


Re: pull-request: bpf-next 2018-05-17

2018-05-16 Thread David Miller
From: Daniel Borkmann 
Date: Thu, 17 May 2018 03:09:48 +0200

> The following pull-request contains BPF updates for your *net-next*
> tree.

Looks good, pulled, thanks Daniel.


Re: [PATCH net-next v3 1/3] ipv4: support sport, dport and ip_proto in RTM_GETROUTE

2018-05-16 Thread David Miller
From: Roopa Prabhu 
Date: Wed, 16 May 2018 13:30:28 -0700

> yes, but we hold rcu read lock before calling the reply function for
> fib result.  I did consider allocating the skb before the read
> lock..but then the refactoring (into a separate netlink reply func)
> would seem unnecessary.
> 
> I am fine with pre-allocating and undoing the refactoring if that works 
> better.

Hmmm... I also notice that with this change we end up doing the
rtnl_unicast() under the RCU lock which is unnecessary too.

So yes, please pull the "out_skb" allocation before the
rcu_read_lock(), and push the rtnl_unicast() after the
rcu_read_unlock().

It really is a shame that sharing the ETH_P_IP skb between the route
route lookup and the netlink response doesn't work properly.

I was using RTM_GETROUTE at one point for route/fib lookup performance
measurements.  It never was great at that, but now that there is going
to be two SKB allocations instead of one it is going to be even less
useful for that kind of usage.


[PATCH] bonding: introduce link change helper

2018-05-16 Thread Tonghao Zhang
Introduce an new common helper to avoid redundancy.

Signed-off-by: Tonghao Zhang 
---
 drivers/net/bonding/bond_main.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 718e491..3063a9c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2135,6 +2135,24 @@ static int bond_miimon_inspect(struct bonding *bond)
return commit;
 }
 
+static void bond_miimon_link_change(struct bonding *bond,
+   struct slave *slave,
+   char link)
+{
+   switch (BOND_MODE(bond)) {
+   case BOND_MODE_8023AD:
+   bond_3ad_handle_link_change(slave, link);
+   break;
+   case BOND_MODE_TLB:
+   case BOND_MODE_ALB:
+   bond_alb_handle_link_change(bond, slave, link);
+   break;
+   case BOND_MODE_XOR:
+   bond_update_slave_arr(bond, NULL);
+   break;
+   }
+}
+
 static void bond_miimon_commit(struct bonding *bond)
 {
struct list_head *iter;
@@ -2176,16 +2194,7 @@ static void bond_miimon_commit(struct bonding *bond)
slave->speed == SPEED_UNKNOWN ? 0 : 
slave->speed,
slave->duplex ? "full" : "half");
 
-   /* notify ad that the link status has changed */
-   if (BOND_MODE(bond) == BOND_MODE_8023AD)
-   bond_3ad_handle_link_change(slave, 
BOND_LINK_UP);
-
-   if (bond_is_lb(bond))
-   bond_alb_handle_link_change(bond, slave,
-   BOND_LINK_UP);
-
-   if (BOND_MODE(bond) == BOND_MODE_XOR)
-   bond_update_slave_arr(bond, NULL);
+   bond_miimon_link_change(bond, slave, BOND_LINK_UP);
 
if (!bond->curr_active_slave || slave == primary)
goto do_failover;
@@ -2207,16 +2216,7 @@ static void bond_miimon_commit(struct bonding *bond)
netdev_info(bond->dev, "link status definitely down for 
interface %s, disabling it\n",
slave->dev->name);
 
-   if (BOND_MODE(bond) == BOND_MODE_8023AD)
-   bond_3ad_handle_link_change(slave,
-   BOND_LINK_DOWN);
-
-   if (bond_is_lb(bond))
-   bond_alb_handle_link_change(bond, slave,
-   BOND_LINK_DOWN);
-
-   if (BOND_MODE(bond) == BOND_MODE_XOR)
-   bond_update_slave_arr(bond, NULL);
+   bond_miimon_link_change(bond, slave, BOND_LINK_DOWN);
 
if (slave == 
rcu_access_pointer(bond->curr_active_slave))
goto do_failover;
-- 
1.8.3.1



Re: [PATCH ghak81 V3 3/3] audit: collect audit task parameters

2018-05-16 Thread kbuild test robot
Hi Richard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20180516]
[cannot apply to linus/master tip/sched/core v4.17-rc5 v4.17-rc4 v4.17-rc3 
v4.17-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-group-task-params/20180517-090703
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/fork.c: In function 'copy_process':
>> kernel/fork.c:1739:3: error: 'struct task_struct' has no member named 'audit'
 p->audit = NULL;
  ^~

vim +1739 kernel/fork.c

  1728  
  1729  p->default_timer_slack_ns = current->timer_slack_ns;
  1730  
  1731  task_io_accounting_init(>ioac);
  1732  acct_clear_integrals(p);
  1733  
  1734  posix_cpu_timers_init(p);
  1735  
  1736  p->start_time = ktime_get_ns();
  1737  p->real_start_time = ktime_get_boot_ns();
  1738  p->io_context = NULL;
> 1739  p->audit = NULL;
  1740  cgroup_fork(p);
  1741  #ifdef CONFIG_NUMA
  1742  p->mempolicy = mpol_dup(p->mempolicy);
  1743  if (IS_ERR(p->mempolicy)) {
  1744  retval = PTR_ERR(p->mempolicy);
  1745  p->mempolicy = NULL;
  1746  goto bad_fork_cleanup_threadgroup_lock;
  1747  }
  1748  #endif
  1749  #ifdef CONFIG_CPUSETS
  1750  p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
  1751  p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
  1752  seqcount_init(>mems_allowed_seq);
  1753  #endif
  1754  #ifdef CONFIG_TRACE_IRQFLAGS
  1755  p->irq_events = 0;
  1756  p->hardirqs_enabled = 0;
  1757  p->hardirq_enable_ip = 0;
  1758  p->hardirq_enable_event = 0;
  1759  p->hardirq_disable_ip = _THIS_IP_;
  1760  p->hardirq_disable_event = 0;
  1761  p->softirqs_enabled = 1;
  1762  p->softirq_enable_ip = _THIS_IP_;
  1763  p->softirq_enable_event = 0;
  1764  p->softirq_disable_ip = 0;
  1765  p->softirq_disable_event = 0;
  1766  p->hardirq_context = 0;
  1767  p->softirq_context = 0;
  1768  #endif
  1769  
  1770  p->pagefault_disabled = 0;
  1771  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


linux-next: manual merge of the net-next tree with the vfs tree

2018-05-16 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/ipv4/ipconfig.c

between commits:

  3f3942aca6da ("proc: introduce proc_create_single{,_data}")
  c04d2cb2009f ("ipconfig: Write NTP server IPs to 
/proc/net/ipconfig/ntp_servers")

from the vfs tree and commit:

  4d019b3f80dc ("ipconfig: Create /proc/net/ipconfig directory")

from the net-next tree.

I fixed it up (see below - there may be more to do) and can carry the
fix as necessary. This is now fixed as far as linux-next is concerned,
but any non trivial conflicts should be mentioned to your upstream
maintainer when your tree is submitted for merging.  You may also want
to consider cooperating with the maintainer of the conflicting tree to
minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/ipv4/ipconfig.c
index bbcbcc113d19,86c9f755de3d..
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@@ -1282,6 -1317,74 +1317,61 @@@ static int pnp_seq_show(struct seq_fil
   _servaddr);
return 0;
  }
 -
 -static int pnp_seq_open(struct inode *indoe, struct file *file)
 -{
 -  return single_open(file, pnp_seq_show, NULL);
 -}
 -
 -static const struct file_operations pnp_seq_fops = {
 -  .open   = pnp_seq_open,
 -  .read   = seq_read,
 -  .llseek = seq_lseek,
 -  .release= single_release,
 -};
 -
+ /* Create the /proc/net/ipconfig directory */
+ static int __init ipconfig_proc_net_init(void)
+ {
+   ipconfig_dir = proc_net_mkdir(_net, "ipconfig", init_net.proc_net);
+   if (!ipconfig_dir)
+   return -ENOMEM;
+ 
+   return 0;
+ }
+ 
+ /* Create a new file under /proc/net/ipconfig */
+ static int ipconfig_proc_net_create(const char *name,
+   const struct file_operations *fops)
+ {
+   char *pname;
+   struct proc_dir_entry *p;
+ 
+   if (!ipconfig_dir)
+   return -ENOMEM;
+ 
+   pname = kasprintf(GFP_KERNEL, "%s%s", "ipconfig/", name);
+   if (!pname)
+   return -ENOMEM;
+ 
+   p = proc_create(pname, 0444, init_net.proc_net, fops);
+   kfree(pname);
+   if (!p)
+   return -ENOMEM;
+ 
+   return 0;
+ }
+ 
+ /* Write NTP server IP addresses to /proc/net/ipconfig/ntp_servers */
+ static int ntp_servers_seq_show(struct seq_file *seq, void *v)
+ {
+   int i;
+ 
+   for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) {
+   if (ic_ntp_servers[i] != NONE)
+   seq_printf(seq, "%pI4\n", _ntp_servers[i]);
+   }
+   return 0;
+ }
+ 
+ static int ntp_servers_seq_open(struct inode *inode, struct file *file)
+ {
+   return single_open(file, ntp_servers_seq_show, NULL);
+ }
+ 
+ static const struct file_operations ntp_servers_seq_fops = {
+   .open   = ntp_servers_seq_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+ };
  #endif /* CONFIG_PROC_FS */
  
  /*
@@@ -1356,8 -1459,20 +1446,20 @@@ static int __init ip_auto_config(void
int err;
unsigned int i;
  
+   /* Initialise all name servers and NTP servers to NONE (but only if the
+* "ip=" or "nfsaddrs=" kernel command line parameters weren't decoded,
+* otherwise we'll overwrite the IP addresses specified there)
+*/
+   if (ic_set_manually == 0) {
+   ic_nameservers_predef();
+   ic_ntp_servers_predef();
+   }
+ 
  #ifdef CONFIG_PROC_FS
 -  proc_create("pnp", 0444, init_net.proc_net, _seq_fops);
 +  proc_create_single("pnp", 0444, init_net.proc_net, pnp_seq_show);
+ 
+   if (ipconfig_proc_net_init() == 0)
+   ipconfig_proc_net_create("ntp_servers", _servers_seq_fops);
  #endif /* CONFIG_PROC_FS */
  
if (!ic_enable)


pgp6lRKBz8avo.pgp
Description: OpenPGP digital signature


Re: [PATCH 34/40] atm: simplify procfs code

2018-05-16 Thread Eric W. Biederman
Christoph Hellwig  writes:

> On Sat, May 05, 2018 at 07:51:18AM -0500, Eric W. Biederman wrote:
>> Christoph Hellwig  writes:
>> 
>> > Use remove_proc_subtree to remove the whole subtree on cleanup, and
>> > unwind the registration loop into individual calls.  Switch to use
>> > proc_create_seq where applicable.
>> 
>> Can you please explain why you are removing the error handling when
>> you are unwinding the registration loop?
>
> Because there is no point in handling these errors.  The code work
> perfectly fine without procfs, or without given proc files and the
> removal works just fine if they don't exist either.  This is a very
> common patter in various parts of the kernel already.
>
> I'll document it better in the changelog.

Thank you.  That is the kind of thing that could be a signal of
inattentiveness and problems, especially when it is not documented.

Eric



pull-request: bpf-next 2018-05-17

2018-05-16 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 53a7bdfb2a2756cce8003b90817f8a6fb4d830d9:

  dt-bindings: dsa: Remove unnecessary #address/#size-cells (2018-05-08 
20:28:44 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to e23afe5e7cba89cd0744c5218eda1b3553455c17:

  bpf: sockmap, on update propagate errors back to userspace (2018-05-17 
01:48:22 +0200)


Alexei Starovoitov (4):
  Merge branch 'bpf-jit-cleanups'
  Merge branch 'fix-samples'
  Merge branch 'convert-doc-to-rst'
  selftests/bpf: make sure build-id is on

Björn Töpel (1):
  xsk: fix 64-bit division

Daniel Borkmann (14):
  Merge branch 'bpf-btf-id'
  Merge branch 'bpf-nfp-programmable-rss'
  

Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier

2018-05-16 Thread Joe Stringer
On 14 May 2018 at 20:04, Alexei Starovoitov
 wrote:
> On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote:
>> Allow helper functions to acquire a reference and return it into a
>> register. Specific pointer types such as the PTR_TO_SOCKET will
>> implicitly represent such a reference. The verifier must ensure that
>> these references are released exactly once in each path through the
>> program.
>>
>> To achieve this, this commit assigns an id to the pointer and tracks it
>> in the 'bpf_func_state', then when the function or program exits,
>> verifies that all of the acquired references have been freed. When the
>> pointer is passed to a function that frees the reference, it is removed
>> from the 'bpf_func_state` and all existing copies of the pointer in
>> registers are marked invalid.
>>
>> Signed-off-by: Joe Stringer 
>> ---
>>  include/linux/bpf_verifier.h |  18 ++-
>>  kernel/bpf/verifier.c| 295 
>> ---
>>  2 files changed, 292 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 9dcd87f1d322..8dbee360b3ec 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -104,6 +104,11 @@ struct bpf_stack_state {
>>   u8 slot_type[BPF_REG_SIZE];
>>  };
>>
>> +struct bpf_reference_state {
>> + int id;
>> + int insn_idx; /* allocation insn */
>
> the insn_idx is for more verbose messages, right?
> It doesn't seem to affect the safety of algorithm.
> Please add a comment to clarify that.

Yup, will do.

>> +/* Acquire a pointer id from the env and update the state->refs to include
>> + * this new pointer reference.
>> + * On success, returns a valid pointer id to associate with the register
>> + * On failure, returns a negative errno.
>> + */
>> +static int acquire_reference_state(struct bpf_verifier_env *env, int 
>> insn_idx)
>> +{
>> + struct bpf_func_state *state = cur_func(env);
>> + int new_ofs = state->acquired_refs;
>> + int id, err;
>> +
>> + err = realloc_reference_state(state, state->acquired_refs + 1, true);
>> + if (err)
>> + return err;
>> + id = ++env->id_gen;
>> + state->refs[new_ofs].id = id;
>> + state->refs[new_ofs].insn_idx = insn_idx;
>
> I thought that we may avoid this extra 'ref_state' array if we store
> 'id' into 'aux' array which is one to one to array of instructions
> and avoid this expensive reallocs, but then I realized we can go
> through the same instruction that returns a pointer to socket
> multiple times and every time it needs to be different 'id' and
> tracked indepdently, so yeah. All that infra is necessary.
> Would be good to document the algorithm a bit more.

Good point, I'll add these details to the bpf_reference_state definition.
Will consider other areas that could receive some docs attention.

>> @@ -2498,6 +2711,15 @@ static int check_helper_call(struct bpf_verifier_env 
>> *env, int func_id, int insn
>>   return err;
>>   }
>>
>> + /* If the function is a release() function, mark all copies of the same
>> +  * pointer as "freed" in all registers and in the stack.
>> +  */
>> + if (is_release_function(func_id)) {
>> + err = release_reference(env);
>
> I think this can be improved if check_func_arg() stores ptr_id into meta.
> Then this loop
>  for (i = BPF_REG_1; i < BPF_REG_6; i++) {
>if (reg_is_refcounted([i])) {
> in release_reference() won't be needed.

That's a nice cleanup.

> Also the macros from the previous patch look ugly, but considering this patch
> I guess it's justified. At least I don't see a better way of doing it.

Completely agree, ugly, but I also didn't see a great alternative.


Re: [PATCH v3] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-16 Thread Saeed Mahameed
On Wed, 2018-05-16 at 21:07 +0200, Christophe JAILLET wrote:
> When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used
> to
> free it.
> 
> Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing")
> Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx
> rules")
> Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to
> query vport RoCE fields")
> Signed-off-by: Christophe JAILLET 
> ---
> v1 -> v2: More places to update have been added to the patch
> v2 -> v3: Add Fixes tag
> 
> 3 patches with one Fixes tag each should probably be better, but
> honestly, I won't send a v4.
> Fill free to split it if needed.

Applied to mlx5-next, thanks Christophe!

> ---
>  drivers/infiniband/hw/mlx5/cq.c| 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/vport.c| 6 +++
> ---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/cq.c
> b/drivers/infiniband/hw/mlx5/cq.c
> index 77d257ec899b..6d52ea03574e 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev
> *dev, struct ib_udata *udata,
>   return 0;
>  
>  err_cqb:
> - kfree(*cqb);
> + kvfree(*cqb);
>  
>  err_db:
>   mlx5_ib_db_unmap_user(to_mucontext(context), >db);
> diff --git
> a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 35e256eb2f6e..b123f8a52ad8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct
> mlx5_eswitch *esw)
>  
>   esw->offloads.vport_rx_group = g;
>  out:
> - kfree(flow_group_in);
> + kvfree(flow_group_in);
>   return err;
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> index 177e076b8d17..719cecb182c6 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> @@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct
> mlx5_core_dev *mdev,
>   *system_image_guid = MLX5_GET64(query_nic_vport_context_out,
> out,
>   nic_vport_context.system_ima
> ge_guid);
>  
> - kfree(out);
> + kvfree(out);
>  
>   return 0;
>  }
> @@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct
> mlx5_core_dev *mdev, u64 *node_guid)
>   *node_guid = MLX5_GET64(query_nic_vport_context_out, out,
>   nic_vport_context.node_guid);
>  
> - kfree(out);
> + kvfree(out);
>  
>   return 0;
>  }
> @@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct
> mlx5_core_dev *mdev,
>   *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out,
>  nic_vport_context.qkey_violation_
> counter);
>  
> - kfree(out);
> + kvfree(out);
>  
>   return 0;
>  }

[PATCH net] erspan: fix invalid erspan version.

2018-05-16 Thread William Tu
ERSPAN only support version 1 and 2.  When packets send to an
erspan device which does not have proper version number set,
drop the packet.  In real case, we observe multicast packets
sent to the erspan pernet device, erspan0, which does not have
erspan version configured.

Reported-by: Greg Rose 
Signed-off-by: William Tu 
---
 net/ipv4/ip_gre.c  | 4 +++-
 net/ipv6/ip6_gre.c | 5 -
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2409e648454d..2d8efeecf619 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -734,10 +734,12 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
erspan_build_header(skb, ntohl(tunnel->parms.o_key),
tunnel->index,
truncate, true);
-   else
+   else if (tunnel->erspan_ver == 2)
erspan_build_header_v2(skb, ntohl(tunnel->parms.o_key),
   tunnel->dir, tunnel->hwid,
   truncate, true);
+   else
+   goto free_skb;
 
tunnel->parms.o_flags &= ~TUNNEL_KEY;
__gre_xmit(skb, dev, >parms.iph, htons(ETH_P_ERSPAN));
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index bede77f24784..d20072fc38cb 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -991,11 +991,14 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff 
*skb,
erspan_build_header(skb, ntohl(t->parms.o_key),
t->parms.index,
truncate, false);
-   else
+   else if (t->parms.erspan_ver == 2)
erspan_build_header_v2(skb, ntohl(t->parms.o_key),
   t->parms.dir,
   t->parms.hwid,
   truncate, false);
+   else
+   goto tx_err;
+
fl6.daddr = t->parms.raddr;
}
 
-- 
2.7.4



Re: [RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-05-16 Thread Joe Stringer
On 14 May 2018 at 19:37, Alexei Starovoitov
 wrote:
> On Wed, May 09, 2018 at 02:07:02PM -0700, Joe Stringer wrote:
>> Teach the verifier a little bit about a new type of pointer, a
>> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
>> 'struct bpf_sock' structure.
>>
>> Signed-off-by: Joe Stringer 
>> ---
>>  include/linux/bpf.h  | 19 +-
>>  include/linux/bpf_verifier.h |  2 ++
>>  kernel/bpf/verifier.c| 86 
>> ++--
>>  net/core/filter.c| 30 +---
>>  4 files changed, 114 insertions(+), 23 deletions(-)
>
> Ack for patches 1-3. In this one few nits:
>
>> @@ -1723,6 +1752,16 @@ static int check_mem_access(struct bpf_verifier_env 
>> *env, int insn_idx, u32 regn
>>   err = check_packet_access(env, regno, off, size, false);
>>   if (!err && t == BPF_READ && value_regno >= 0)
>>   mark_reg_unknown(env, regs, value_regno);
>> +
>> + } else if (reg->type == PTR_TO_SOCKET) {
>> + if (t == BPF_WRITE) {
>> + verbose(env, "cannot write into socket\n");
>> + return -EACCES;
>> + }
>> + err = check_sock_access(env, regno, off, size, t);
>> + if (!err && t == BPF_READ && value_regno >= 0)
>
> t == BPF_READ check is unnecessary.
>
>> @@ -5785,7 +5845,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
>> *attr)
>>
>>   if (ret == 0)
>>   /* program is valid, convert *(u32*)(ctx + off) accesses */
>> - ret = convert_ctx_accesses(env);
>> + ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
>> +PTR_TO_CTX);
>> +
>> + if (ret == 0)
>> + /* Convert *(u32*)(sock_ops + off) accesses */
>> + ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
>> +PTR_TO_SOCKET);
>
> Overall looks great.
> Only this part is missing for PTR_TO_SOCKET:
>  } else if (dst_reg_type != *prev_dst_type &&
> (dst_reg_type == PTR_TO_CTX ||
>  *prev_dst_type == PTR_TO_CTX)) {
>  verbose(env, "same insn cannot be used with different 
> pointers\n");
>  return -EINVAL;
> similar logic has to be added.
> Otherwise the following will be accepted:
>
> R1 = sock_ptr
> goto X;
> ...
> R1 = some_other_valid_ptr;
> goto X;
> ...
>
> R2 = *(u32 *)(R1 + 0);
> this will be rewritten for first branch,
> but it's wrong for second.
>

Thanks for the review, will address these comments.


Re: [bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace

2018-05-16 Thread Daniel Borkmann
On 05/17/2018 01:38 AM, John Fastabend wrote:
> When an error happens in the update sockmap element logic also pass
> the err up to the user.
> 
> Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with 
> hashmap")
> Signed-off-by: John Fastabend 

Agree, applied to bpf-next, thanks John!


[PATCH bpf] bpf: fix truncated jump targets on heavy expansions

2018-05-16 Thread Daniel Borkmann
Recently during testing, I ran into the following panic:

  [  207.892422] Internal error: Accessing user space memory outside uaccess.h 
routines: 9604 [#1] SMP
  [  207.901637] Modules linked in: binfmt_misc [...]
  [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: GW  
   4.17.0-rc3+ #7
  [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 
03/31/2017
  [  207.982428] pstate: 6045 (nZCv daif +PAN -UAO)
  [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  207.992603] lr : 0x00bdb754
  [  207.996080] sp : 13703ca0
  [  207.999384] x29: 13703ca0 x28: 0001
  [  208.004688] x27: 0001 x26: 
  [  208.009992] x25: 13703ce0 x24: 800fb4afcb00
  [  208.015295] x23: 7d2f5038 x22: 7d2f5000
  [  208.020599] x21: feff2a6f x20: 000a
  [  208.025903] x19: 09578000 x18: 0a03
  [  208.031206] x17:  x16: 
  [  208.036510] x15: 9de83000 x14: 
  [  208.041813] x13:  x12: 
  [  208.047116] x11: 0001 x10: 089e7f18
  [  208.052419] x9 : feff2a6f x8 : 
  [  208.057723] x7 : 000a x6 : 00280c616000
  [  208.063026] x5 : 0018 x4 : 7db6
  [  208.068329] x3 : 0008647a x2 : 19868179b1484500
  [  208.073632] x1 :  x0 : 09578c08
  [  208.078938] Process test_verifier (pid: 2256, stack limit = 
0x49ca7974)
  [  208.086235] Call trace:
  [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  208.093713]  0x00bdb754
  [  208.096845]  bpf_test_run+0x78/0xf8
  [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
  [  208.104758]  sys_bpf+0x314/0x1198
  [  208.108064]  el0_svc_naked+0x30/0x34
  [  208.111632] Code: 91302260 f941 f9001fa1 d281 (29500680)
  [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---

The program itself which caused this had a long jump over the whole
instruction sequence where all of the inner instructions required
heavy expansions into multiple BPF instructions. Additionally, I also
had BPF hardening enabled which requires once more rewrites of all
constant values in order to blind them. Each time we rewrite insns,
bpf_adj_branches() would need to potentially adjust branch targets
which cross the patchlet boundary to accommodate for the additional
delta. Eventually that lead to the case where the target offset could
not fit into insn->off's upper 0x7fff limit anymore where then offset
wraps around becoming negative (in s16 universe), or vice versa
depending on the jump direction.

Therefore it becomes necessary to detect and reject any such occasions
in a generic way for native eBPF and cBPF to eBPF migrations. For
the latter we can simply check bounds in the bpf_convert_filter()'s
BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
of subsequent hardening) is a bit more complex in that we need to
detect such truncations before hitting the bpf_prog_realloc(). Thus
the latter is split into an extra pass to probe problematic offsets
on the original program in order to fail early. With that in place
and carefully tested I no longer hit the panic and the rewrites are
rejected properly. The above example panic I've seen on bpf-next,
though the issue itself is generic in that a guard against this issue
in bpf seems more appropriate in this case.

Signed-off-by: Daniel Borkmann 
---
 [ Will follow up with an additional test case in bpf-next. ]

 kernel/bpf/core.c | 100 --
 net/core/filter.c |  11 --
 2 files changed, 84 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..6ef6746 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -218,47 +218,84 @@ int bpf_prog_calc_tag(struct bpf_prog *fp)
return 0;
 }
 
-static void bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta)
+static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta,
+   u32 curr, const bool probe_pass)
 {
+   const s64 imm_min = S32_MIN, imm_max = S32_MAX;
+   s64 imm = insn->imm;
+
+   if (curr < pos && curr + imm + 1 > pos)
+   imm += delta;
+   else if (curr > pos + delta && curr + imm + 1 <= pos + delta)
+   imm -= delta;
+   if (imm < imm_min || imm > imm_max)
+   return -ERANGE;
+   if (!probe_pass)
+   insn->imm = imm;
+   return 0;
+}
+
+static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta,
+   u32 curr, const bool probe_pass)
+{
+   const s32 off_min = S16_MIN, off_max = S16_MAX;
+   s32 off = insn->off;
+
+   if (curr < pos && 

[bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace

2018-05-16 Thread John Fastabend
When an error happens in the update sockmap element logic also pass
the err up to the user.

Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with 
hashmap")
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 79f5e89..c6de139 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1875,7 +1875,7 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
write_unlock_bh(>sk_callback_lock);
}
 out:
-   return 0;
+   return err;
 }
 
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)



[PATCH net-next 5/8] tcp: new helper tcp_timeout_mark_lost

2018-05-16 Thread Yuchung Cheng
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 net/ipv4/tcp_input.c | 50 +---
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6fb0a28977a0..af32accda2a9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,18 +1917,43 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
-/* Enter Loss state. If we detect SACK reneging, forget all SACK information
+/* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
  */
+static void tcp_timeout_mark_lost(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;
+   bool is_reneg;  /* is receiver reneging on SACKs? */
+
+   skb = tcp_rtx_queue_head(sk);
+   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+   if (is_reneg) {
+   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
+   tp->sacked_out = 0;
+   /* Mark SACK reneging until we recover from this loss event. */
+   tp->is_sack_reneg = 1;
+   } else if (tcp_is_reno(tp)) {
+   tcp_reset_reno_sack(tp);
+   }
+
+   skb_rbtree_walk_from(skb) {
+   if (is_reneg)
+   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+   tcp_mark_skb_lost(sk, skb);
+   }
+   tcp_verify_left_out(tp);
+   tcp_clear_all_retrans_hints(tp);
+}
+
+/* Enter Loss state. */
 void tcp_enter_loss(struct sock *sk)
 {
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
-   struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
-   bool is_reneg;  /* is receiver reneging on SACKs? */
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1944,24 +1969,7 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   if (tcp_is_reno(tp))
-   tcp_reset_reno_sack(tp);
-
-   skb = tcp_rtx_queue_head(sk);
-   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
-   if (is_reneg) {
-   NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
-   tp->sacked_out = 0;
-   /* Mark SACK reneging until we recover from this loss event. */
-   tp->is_sack_reneg = 1;
-   }
-   skb_rbtree_walk_from(skb) {
-   if (is_reneg)
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-   tcp_mark_skb_lost(sk, skb);
-   }
-   tcp_verify_left_out(tp);
-   tcp_clear_all_retrans_hints(tp);
+   tcp_timeout_mark_lost(sk);
 
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 1/8] tcp: support DUPACK threshold in RACK

2018-05-16 Thread Yuchung Cheng
This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.

When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.

The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.

Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.

With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.

There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).

Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 Documentation/networking/ip-sysctl.txt |  1 +
 include/net/tcp.h  |  1 +
 net/ipv4/tcp_recovery.c| 40 +-
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 59afc9a10b4f..13bbac50dc8b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -451,6 +451,7 @@ tcp_recovery - INTEGER
RACK: 0x1 enables the RACK loss detection for fast detection of lost
  retransmissions and tail drops.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+   RACK: 0x4 disables RACK's DUPACK threshold heuristic
 
Default: 0x1
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3b1d617b0110..85000c85ddcd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -245,6 +245,7 @@ extern long sysctl_tcp_mem[3];
 
 #define TCP_RACK_LOSS_DETECTION  0x1 /* Use RACK to detect losses */
 #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
+#define TCP_RACK_NO_DUPTHRESH0x4 /* Do not use DUPACK threshold in RACK */
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 3a81720ac0c4..1c1bdf12a96f 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -21,6 +21,32 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, 
u32 seq2)
return t1 > t2 || (t1 == t2 && after(seq1, seq2));
 }
 
+u32 tcp_rack_reo_wnd(const struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (!tp->rack.reord) {
+   /* If reordering has not been observed, be aggressive during
+* the recovery or starting the recovery by DUPACK threshold.
+*/
+   if (inet_csk(sk)->icsk_ca_state >= TCP_CA_Recovery)
+   return 0;
+
+   if (tp->sacked_out >= tp->reordering &&
+   !(sock_net(sk)->ipv4.sysctl_tcp_recovery & 
TCP_RACK_NO_DUPTHRESH))
+   return 0;
+   }
+
+   /* To be more reordering resilient, allow min_rtt/4 settling delay.
+* Use min_rtt instead of the smoothed RTT because reordering is
+* often a path property and less related to queuing or delayed ACKs.
+* Upon receiving DSACKs, linearly increase the window up to the
+* smoothed RTT.
+*/
+   return min((tcp_min_rtt(tp) >> 2) * tp->rack.reo_wnd_steps,
+  tp->srtt_us >> 3);
+}
+
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
  *
  * Marks a packet 

[PATCH net-next 4/8] tcp: account lost retransmit after timeout

2018-05-16 Thread Yuchung Cheng
The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).

The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.

This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 include/net/tcp.h   |  1 +
 net/ipv4/tcp_input.c| 18 +++---
 net/ipv4/tcp_recovery.c |  4 ++--
 3 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d7f81325bee5..402484ed9b57 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 076206873e3e..6fb0a28977a0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,7 +1929,6 @@ void tcp_enter_loss(struct sock *sk)
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
-   bool mark_lost;
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1945,9 +1944,6 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   tp->retrans_out = 0;
-   tp->lost_out = 0;
-
if (tcp_is_reno(tp))
tcp_reset_reno_sack(tp);
 
@@ -1959,21 +1955,13 @@ void tcp_enter_loss(struct sock *sk)
/* Mark SACK reneging until we recover from this loss event. */
tp->is_sack_reneg = 1;
}
-   tcp_clear_all_retrans_hints(tp);
-
skb_rbtree_walk_from(skb) {
-   mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
-is_reneg);
-   if (mark_lost)
-   tcp_sum_lost(tp, skb);
-   TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-   if (mark_lost) {
+   if (is_reneg)
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-   TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
-   tp->lost_out += tcp_skb_pcount(skb);
-   }
+   tcp_mark_skb_lost(sk, skb);
}
tcp_verify_left_out(tp);
+   tcp_clear_all_retrans_hints(tp);
 
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 299b0e38aa9a..b2f9be388bf3 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -2,7 +2,7 @@
 #include 
 #include 
 
-static void tcp_rack_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -95,7 +95,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
remaining = tp->rack.rtt_us + reo_wnd -
tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
if (remaining <= 0) {
-   tcp_rack_mark_skb_lost(sk, skb);
+   tcp_mark_skb_lost(sk, skb);
list_del_init(>tcp_tsorted_anchor);
} else {
/* Record maximum wait time */
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 8/8] tcp: don't mark recently sent packets lost on RTO

2018-05-16 Thread Yuchung Cheng
An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.

Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.

Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).

This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 net/ipv4/tcp_input.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ba8a8e3464aa..0bf032839548 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,11 +1929,11 @@ static bool tcp_is_rack(const struct sock *sk)
 static void tcp_timeout_mark_lost(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct sk_buff *skb;
+   struct sk_buff *skb, *head;
bool is_reneg;  /* is receiver reneging on SACKs? */
 
-   skb = tcp_rtx_queue_head(sk);
-   is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+   head = tcp_rtx_queue_head(sk);
+   is_reneg = head && (TCP_SKB_CB(head)->sacked & TCPCB_SACKED_ACKED);
if (is_reneg) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
tp->sacked_out = 0;
@@ -1943,9 +1943,13 @@ static void tcp_timeout_mark_lost(struct sock *sk)
tcp_reset_reno_sack(tp);
}
 
+   skb = head;
skb_rbtree_walk_from(skb) {
if (is_reneg)
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+   else if (tcp_is_rack(sk) && skb != head &&
+tcp_rack_skb_timeout(tp, skb, 0) > 0)
+   continue; /* Don't mark recently sent ones lost yet */
tcp_mark_skb_lost(sk, skb);
}
tcp_verify_left_out(tp);
@@ -1972,7 +1976,7 @@ void tcp_enter_loss(struct sock *sk)
tcp_ca_event(sk, CA_EVENT_LOSS);
tcp_init_undo(tp);
}
-   tp->snd_cwnd   = 1;
+   tp->snd_cwnd   = tcp_packets_in_flight(tp) + 1;
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 7/8] tcp: new helper tcp_rack_skb_timeout

2018-05-16 Thread Yuchung Cheng
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 include/net/tcp.h   |  2 ++
 net/ipv4/tcp_input.c| 10 +-
 net/ipv4/tcp_recovery.c |  9 +++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 402484ed9b57..b46d0f9adbdb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1880,6 +1880,8 @@ void tcp_init(void);
 /* tcp_recovery.c */
 void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
+extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb,
+   u32 reo_wnd);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1ccc97b368c7..ba8a8e3464aa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,6 +1917,11 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
@@ -2031,11 +2036,6 @@ static inline int tcp_dupack_heuristics(const struct 
tcp_sock *tp)
return tp->sacked_out + 1;
 }
 
-static bool tcp_is_rack(const struct sock *sk)
-{
-   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
-}
-
 /* Linux NewReno/SACK/ECN state machine.
  * --
  *
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index b2f9be388bf3..30cbfb69b1de 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -47,6 +47,12 @@ u32 tcp_rack_reo_wnd(const struct sock *sk)
   tp->srtt_us >> 3);
 }
 
+s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd)
+{
+   return tp->rack.rtt_us + reo_wnd -
+  tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+}
+
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
  *
  * Marks a packet lost, if some packet sent later has been (s)acked.
@@ -92,8 +98,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 
*reo_timeout)
/* A packet is lost if it has not been s/acked beyond
 * the recent RTT plus the reordering window.
 */
-   remaining = tp->rack.rtt_us + reo_wnd -
-   tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+   remaining = tcp_rack_skb_timeout(tp, skb, reo_wnd);
if (remaining <= 0) {
tcp_mark_skb_lost(sk, skb);
list_del_init(>tcp_tsorted_anchor);
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 6/8] tcp: separate loss marking and state update on RTO

2018-05-16 Thread Yuchung Cheng
Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.

But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.

This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.

This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
   first before tcp_enter_recovery flips state to CA_Recovery.

2) avoid intertwining loss marking with state update, making the
   code more readable.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af32accda2a9..1ccc97b368c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1955,6 +1955,8 @@ void tcp_enter_loss(struct sock *sk)
struct net *net = sock_net(sk);
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
 
+   tcp_timeout_mark_lost(sk);
+
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
!after(tp->high_seq, tp->snd_una) ||
@@ -1969,8 +1971,6 @@ void tcp_enter_loss(struct sock *sk)
tp->snd_cwnd_cnt   = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
 
-   tcp_timeout_mark_lost(sk);
-
/* Timeout in disordered state after receiving substantial DUPACKs
 * suggests that the degree of reordering is over-estimated.
 */
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 2/8] tcp: disable RFC6675 loss detection

2018-05-16 Thread Yuchung Cheng
This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 Documentation/networking/ip-sysctl.txt |  3 ++-
 net/ipv4/tcp_input.c   | 12 
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 13bbac50dc8b..ea304a23c8d7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -449,7 +449,8 @@ tcp_recovery - INTEGER
features.
 
RACK: 0x1 enables the RACK loss detection for fast detection of lost
- retransmissions and tail drops.
+ retransmissions and tail drops. It also subsumes and disables
+ RFC6675 recovery for SACK connections.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
RACK: 0x4 disables RACK's DUPACK threshold heuristic
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b188e0d75edd..ccbe04f80040 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2035,6 +2035,11 @@ static inline int tcp_dupack_heuristics(const struct 
tcp_sock *tp)
return tp->sacked_out + 1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+   return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* Linux NewReno/SACK/ECN state machine.
  * --
  *
@@ -2141,7 +2146,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
return true;
 
/* Not-A-Trick#2 : Classic rule... */
-   if (tcp_dupack_heuristics(tp) > tp->reordering)
+   if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering)
return true;
 
return false;
@@ -2722,8 +2727,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int 
*ack_flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   /* Use RACK to detect loss */
-   if (sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) {
+   if (tcp_is_rack(sk)) {
u32 prior_retrans = tp->retrans_out;
 
tcp_rack_mark_lost(sk);
@@ -2862,7 +2866,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
fast_rexmit = 1;
}
 
-   if (do_lost)
+   if (!tcp_is_rack(sk) && do_lost)
tcp_update_scoreboard(sk, fast_rexmit);
*rexmit = REXMIT_LOST;
 }
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH net-next 3/8] tcp: simpler NewReno implementation

2018-05-16 Thread Yuchung Cheng
This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.

Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.

Signed-off-by: Yuchung Cheng 
Signed-off-by: Neal Cardwell 
Reviewed-by: Eric Dumazet 
Reviewed-by: Soheil Hassas Yeganeh 
Reviewed-by: Priyaranjan Jha 
---
 include/net/tcp.h   |  1 +
 net/ipv4/tcp_input.c| 19 +++
 net/ipv4/tcp_recovery.c | 27 +++
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85000c85ddcd..d7f81325bee5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ccbe04f80040..076206873e3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2223,9 +2223,7 @@ static void tcp_update_scoreboard(struct sock *sk, int 
fast_rexmit)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_is_reno(tp)) {
-   tcp_mark_head_lost(sk, 1, 1);
-   } else {
+   if (tcp_is_sack(tp)) {
int sacked_upto = tp->sacked_out - tp->reordering;
if (sacked_upto >= 0)
tcp_mark_head_lost(sk, sacked_upto, 0);
@@ -2723,11 +2721,16 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 
prior_snd_una)
return false;
 }
 
-static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
+static void tcp_identify_packet_loss(struct sock *sk, int *ack_flag)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (tcp_is_rack(sk)) {
+   if (tcp_rtx_queue_empty(sk))
+   return;
+
+   if (unlikely(tcp_is_reno(tp))) {
+   tcp_newreno_mark_lost(sk, *ack_flag & FLAG_SND_UNA_ADVANCED);
+   } else if (tcp_is_rack(sk)) {
u32 prior_retrans = tp->retrans_out;
 
tcp_rack_mark_lost(sk);
@@ -2823,11 +2826,11 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const u32 prior_snd_una,
tcp_try_keep_open(sk);
return;
}
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
break;
case TCP_CA_Loss:
tcp_process_loss(sk, flag, is_dupack, rexmit);
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
if (!(icsk->icsk_ca_state == TCP_CA_Open ||
  (*ack_flag & FLAG_LOST_RETRANS)))
return;
@@ -2844,7 +2847,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
if (icsk->icsk_ca_state <= TCP_CA_Disorder)
tcp_try_undo_dsack(sk);
 
-   tcp_rack_identify_loss(sk, ack_flag);
+   tcp_identify_packet_loss(sk, ack_flag);
if (!tcp_time_to_recover(sk, flag)) {
tcp_try_to_open(sk, flag);
return;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 1c1bdf12a96f..299b0e38aa9a 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -216,3 +216,30 @@ void tcp_rack_update_reo_wnd(struct sock *sk, struct 
rate_sample *rs)
tp->rack.reo_wnd_steps = 1;
}
 }
+
+/* RFC6582 NewReno recovery for non-SACK connection. It simply retransmits
+ * the next unacked packet upon receiving
+ * a) three or more DUPACKs to start the fast recovery
+ * b) an ACK acknowledging new data during the fast recovery.
+ */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced)
+{
+   const u8 state = inet_csk(sk)->icsk_ca_state;
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if ((state < TCP_CA_Recovery && tp->sacked_out >= tp->reordering) ||
+   (state == TCP_CA_Recovery && snd_una_advanced)) {
+   struct sk_buff *skb = tcp_rtx_queue_head(sk);
+   u32 mss;
+
+   if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+   return;
+
+   mss = tcp_skb_mss(skb);
+   if (tcp_skb_pcount(skb) > 1 && skb->len > mss)
+   tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb,
+mss, mss, GFP_ATOMIC);
+
+   tcp_skb_mark_lost_uncond_verify(tp, 

[PATCH net-next 0/8] tcp: default RACK loss recovery

2018-05-16 Thread Yuchung Cheng
This patch set implements the features correspond to the
draft-ietf-tcpm-rack-03 version of the RACK draft.
https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00

1. SACK: implement equivalent DUPACK threshold heuristic in RACK to
   replace existing RFC6675 recovery (tcp_mark_head_lost).

2. Non-SACK: simplify RFC6582 NewReno implementation

3. RTO: apply RACK's time-based approach to avoid spuriouly
   marking very recently sent packets lost.

4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to
   mark losses based on time on S/ACK. Tail loss probe and F-RTO remain
   enabled by default as complementary mechanisms to send probes in
   CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger
   RACK time-based loss detection.

All Google web and internal servers have been running RACK-only mode
(4) for a while now. a/b experiments indicate RACK/TLP on average
reduces recovery latency by 10% compared to RFC6675. RFC6675
is default-off now but can be enabled by disabling RACK (sysctl
net.ipv4.tcp_recovery=0) for unseen issues.

Yuchung Cheng (8):
  tcp: support DUPACK threshold in RACK
  tcp: disable RFC6675 loss detection
  tcp: simpler NewReno implementation
  tcp: account lost retransmit after timeout
  tcp: new helper tcp_timeout_mark_lost
  tcp: separate loss marking and state update on RTO
  tcp: new helper tcp_rack_skb_timeout
  tcp: don't mark recently sent packets lost on RTO

 Documentation/networking/ip-sysctl.txt |  4 +-
 include/net/tcp.h  |  5 ++
 net/ipv4/tcp_input.c   | 99 ++
 net/ipv4/tcp_recovery.c| 80 -
 4 files changed, 124 insertions(+), 64 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog



Proposal

2018-05-16 Thread Miss Zeliha Omer Faruk



Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey



Re: [PATCH bpf-next] libbpf: add ifindex to enable offload support

2018-05-16 Thread Daniel Borkmann
On 05/16/2018 11:02 PM, Jakub Kicinski wrote:
> From: David Beckett 
> 
> BPF programs currently can only be offloaded using iproute2. This
> patch will allow programs to be offloaded using libbpf calls.
> 
> Signed-off-by: David Beckett 
> Reviewed-by: Jakub Kicinski 

Applied to bpf-next, thanks guys!


Re: [PATCH] bpf: add __printf verification to bpf_verifier_vlog

2018-05-16 Thread Daniel Borkmann
On 05/16/2018 10:27 PM, Mathieu Malaterre wrote:
> __printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
> function is used twice in verifier.c in both cases the caller function
> already uses the __printf gcc attribute.
> 
> Remove the following warning, triggered with W=1:
> 
>   kernel/bpf/verifier.c:176:2: warning: function might be possible candidate 
> for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
> 
> Signed-off-by: Mathieu Malaterre 

Looks good, applied to bpf-next, thanks Mathieu!


Re: [PATCH bpf-next] bpf: fix sock hashmap kmalloc warning

2018-05-16 Thread Daniel Borkmann
On 05/16/2018 11:06 PM, Yonghong Song wrote:
> syzbot reported a kernel warning below:
>   WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
> mm/slab_common.c:996
>   Kernel panic - not syncing: panic_on_warn set ...
> 
>   CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   Call Trace:
>__dump_stack lib/dump_stack.c:77 [inline]
>dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>panic+0x22f/0x4de kernel/panic.c:184
>__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>report_bug+0x252/0x2d0 lib/bug.c:186
>fixup_bug arch/x86/kernel/traps.c:178 [inline]
>do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
>   RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   RSP: 0018:8801d907fc58 EFLAGS: 00010246
>   RAX:  RBX: 8801aeecb280 RCX: 8185ebd7
>   RDX:  RSI:  RDI: ffe1
>   RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700
>   R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280
>   R13: fff4 R14: 8801ad891a00 R15: 014200c0
>__do_kmalloc mm/slab.c:3713 [inline]
>__kmalloc+0x25/0x760 mm/slab.c:3727
>kmalloc include/linux/slab.h:517 [inline]
>map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
>__do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
>__se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
>__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
>do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The test case is against sock hashmap with a key size 0xffe1.
> Such a large key size will cause the below code in function
> sock_hash_alloc() overflowing and produces a smaller elem_size,
> hence map creation will be successful.
> htab->elem_size = sizeof(struct htab_elem) +
>   round_up(htab->map.key_size, 8);
> 
> Later, when map_get_next_key is called and kernel tries
> to allocate the key unsuccessfully, it will issue
> the above warning.
> 
> Similar to hashtab, ensure the key size is at most
> MAX_BPF_STACK for a successful map creation.
> 
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song 

Applied to bpf-next, thanks Yonghong!


Re: kernel BUG at lib/string.c:LINE! (4)

2018-05-16 Thread Julian Anastasov

Hello,

On Wed, 16 May 2018, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16e9101780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d63780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1051710780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+aac887f7731986864...@syzkaller.appspotmail.com
> 
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> detected buffer overflow in strlen
> [ cut here ]
> kernel BUG at lib/string.c:1052!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>   (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051
> RSP: 0018:8801c976f800 EFLAGS: 00010282
> RAX: 0022 RBX: 0040 RCX: 
> RDX: 0022 RSI: 8160f6f1 RDI: ed00392edef6
> RBP: 8801c976f800 R08: 8801cf4c62c0 R09: ed003b5e4fb0
> R10: ed003b5e4fb0 R11: 8801daf27d87 R12: 8801c976fa20
> R13: 8801c976fae4 R14: 8801c976fae0 R15: 048b
> FS:  7fd99f75e700() GS:8801daf0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0001d6843000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> strlen include/linux/string.h:270 [inline]
> strlcpy include/linux/string.h:293 [inline]
> do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388
> nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
> nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
> ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
> udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487
> ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917
> tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057
> sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046
> __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
> __do_sys_setsockopt net/socket.c:1914 [inline]
> __se_sys_setsockopt net/socket.c:1911 [inline]
> __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x447369
> RSP: 002b:7fd99f75dda8 EFLAGS: 0246 ORIG_RAX: 0036
> RAX: ffda RBX: 006e39e4 RCX: 00447369
> RDX: 048b RSI:  RDI: 0003
> RBP:  R08: 0018 R09: 
> R10: 21c0 R11: 0246 R12: 006e39e0
> R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0001
> Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de
> 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90
> 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56
> RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: 8801c976f800
> ---[ end trace 624046f2d9af7702 ]---

Just to let you know that I tested a patch with
the syzbot, will do more tests before submitting...

Regards

--
Julian Anastasov 


Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-16 Thread Tobin C. Harding
On Wed, May 16, 2018 at 07:05:34AM -0700, William Tu wrote:
> On Mon, May 14, 2018 at 10:33 PM, Tobin C. Harding  wrote:
> > On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote:
> >> Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
> >> handled.  BSO has 4 possible values:
> >>   00 --> Good frame with no error, or unknown integrity
> >>   11 --> Payload is a Bad Frame with CRC or Alignment Error
> >>   01 --> Payload is a Short Frame
> >>   10 --> Payload is an Oversized Frame
> >>
> >> Based the short/oversized definitions in RFC1757, the patch sets
> >> the bso bit based on the mirrored packet's size.
> >>
> >> Reported-by: Xiaoyan Jin 
> >> Signed-off-by: William Tu 
> >> ---
> >>  include/net/erspan.h | 25 +
> >>  1 file changed, 25 insertions(+)
> >>
> >> diff --git a/include/net/erspan.h b/include/net/erspan.h
> >> index d044aa60cc76..5eb95f78ad45 100644
> >> --- a/include/net/erspan.h
> >> +++ b/include/net/erspan.h
> >> @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
> >>   return htonl((u32)h_usecs);
> >>  }
> >>
> >> +/* ERSPAN BSO (Bad/Short/Oversized)
> >> + *   00b --> Good frame with no error, or unknown integrity
> >> + *   01b --> Payload is a Short Frame
> >> + *   10b --> Payload is an Oversized Frame
> >> + *   11b --> Payload is a Bad Frame with CRC or Alignment Error
> >> + */
> >> +enum erspan_bso {
> >> + BSO_NOERROR,
> >> + BSO_SHORT,
> >> + BSO_OVERSIZED,
> >> + BSO_BAD,
> >> +};
> >
> > If we are relying on the values perhaps this would be clearer
> >
> > BSO_NOERROR = 0x00,
> > BSO_SHORT   = 0x01,
> > BSO_OVERSIZED   = 0x02,
> > BSO_BAD = 0x03,
> >
> 
> Yes, thanks. I will change in v2.
> 
> >> +
> >> +static inline u8 erspan_detect_bso(struct sk_buff *skb)
> >> +{
> >> + if (skb->len < ETH_ZLEN)
> >> + return BSO_SHORT;
> >> +
> >> + if (skb->len > ETH_FRAME_LEN)
> >> + return BSO_OVERSIZED;
> >> +
> >> + return BSO_NOERROR;
> >> +}
> >
> > Without having much contextual knowledge around this patch; should we be
> > doing some check on CRC or alignment (at some stage)?  Having BSO_BAD
> > seems to imply so?
> >
> 
> The definition of BSO_BAD:
> etherStatsCRCAlignErrors OBJECT-TYPE
>   SYNTAX Counter
>   ACCESS read-only
>   STATUS mandatory
>   DESCRIPTION
>   "The total number of packets received that
>   had a length (excluding framing bits, but
>   including FCS octets) of between 64 and 1518
>   octets, inclusive, but but had either a bad
>   Frame Check Sequence (FCS) with an integral
>   number of octets (FCS Error) or a bad FCS with
>   a non-integral number of octets (Alignment Error)."
>
> But I don't know how to check CRC error at this code point.
> Isn't it done by the NIC hardware?

I'll just start with; I don't know anything about ERSPAN

"ERSPAN is a Cisco proprietary feature and is available only to
Catalyst 6500, 7600, Nexus, and ASR 1000 platforms to date. The
ASR 1000 supports ERSPAN source (monitoring) only on Fast
Ethernet, Gigabit Ethernet, and port-channel interfaces."

https://supportforums.cisco.com/t5/network-infrastructure-documents/understanding-span-rspan-and-erspan/ta-p/3144951

I dug around a bit and none of the files that currently import erspan.h
actually use the 'bso' field

$ grep bso $(git grep -l 'erspan\.h')
include/net/erspan.h:   u8 bso = 0; /* Bad/Short/Oversized */
include/net/erspan.h:   ershdr->en = bso;
net/ipv4/ip_gre.c: ICMP in the real Internet is absolutely infeasible.
net/ipv4/ip_gre.c:   * ICMP in the real Internet is absolutely infeasible.


Normally, AFAICT, the FCS does not get passed to the operating system
since its a link layer mechanism.  If ERSPAN is passing the FCS when it
mirrors frames (does it mirror frames or packets, I don't know?) then
surely ERSPAN should provide a function to return the BSO value.

So IMHO this patch seems like a just pretense and not really doing
anything.

Hope this helps,
Tobin.


[PATCH] net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devices

2018-05-16 Thread Ivan Khoronzhuk
The early versions of am33xx devices, related to ES1.0 SoC revision
have errata limiting mq support. That's the same errata as
commit 7da1160002f1 ("drivers: net: cpsw: add am335x errata workarround for
interrutps")

AM33xx Errata [1] Advisory 1.0.9
http://www.ti.com/lit/er/sprz360f/sprz360f.pdf

After additional investigation were found that drivers w/a is
propagated on all AM33xx SoCs and on DM814x. But the errata exists
only for ES1.0 of AM33xx family, limiting mq support for revisions
after ES1.0. So, disable mq support only for related SoCs and use
separate polls for revisions allowing mq.

Signed-off-by: Ivan Khoronzhuk 
---

Based on net-next/master

 drivers/net/ethernet/ti/cpsw.c | 109 ++---
 1 file changed, 60 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 28d893b93d30..a7285dddfd29 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -957,7 +958,7 @@ static irqreturn_t cpsw_rx_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
 }
 
-static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget)
+static int cpsw_tx_mq_poll(struct napi_struct *napi_tx, int budget)
 {
u32 ch_map;
int num_tx, cur_budget, ch;
@@ -984,7 +985,21 @@ static int cpsw_tx_poll(struct napi_struct *napi_tx, int 
budget)
if (num_tx < budget) {
napi_complete(napi_tx);
writel(0xff, >wr_regs->tx_en);
-   if (cpsw->quirk_irq && cpsw->tx_irq_disabled) {
+   }
+
+   return num_tx;
+}
+
+static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget)
+{
+   struct cpsw_common *cpsw = napi_to_cpsw(napi_tx);
+   int num_tx;
+
+   num_tx = cpdma_chan_process(cpsw->txv[0].ch, budget);
+   if (num_tx < budget) {
+   napi_complete(napi_tx);
+   writel(0xff, >wr_regs->tx_en);
+   if (cpsw->tx_irq_disabled) {
cpsw->tx_irq_disabled = false;
enable_irq(cpsw->irqs_table[1]);
}
@@ -993,7 +1008,7 @@ static int cpsw_tx_poll(struct napi_struct *napi_tx, int 
budget)
return num_tx;
 }
 
-static int cpsw_rx_poll(struct napi_struct *napi_rx, int budget)
+static int cpsw_rx_mq_poll(struct napi_struct *napi_rx, int budget)
 {
u32 ch_map;
int num_rx, cur_budget, ch;
@@ -1020,7 +1035,21 @@ static int cpsw_rx_poll(struct napi_struct *napi_rx, int 
budget)
if (num_rx < budget) {
napi_complete_done(napi_rx, num_rx);
writel(0xff, >wr_regs->rx_en);
-   if (cpsw->quirk_irq && cpsw->rx_irq_disabled) {
+   }
+
+   return num_rx;
+}
+
+static int cpsw_rx_poll(struct napi_struct *napi_rx, int budget)
+{
+   struct cpsw_common *cpsw = napi_to_cpsw(napi_rx);
+   int num_rx;
+
+   num_rx = cpdma_chan_process(cpsw->rxv[0].ch, budget);
+   if (num_rx < budget) {
+   napi_complete_done(napi_rx, num_rx);
+   writel(0xff, >wr_regs->rx_en);
+   if (cpsw->rx_irq_disabled) {
cpsw->rx_irq_disabled = false;
enable_irq(cpsw->irqs_table[0]);
}
@@ -2364,9 +2393,9 @@ static void cpsw_get_channels(struct net_device *ndev,
 {
struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
 
+   ch->max_rx = cpsw->quirk_irq ? 1 : CPSW_MAX_QUEUES;
+   ch->max_tx = cpsw->quirk_irq ? 1 : CPSW_MAX_QUEUES;
ch->max_combined = 0;
-   ch->max_rx = CPSW_MAX_QUEUES;
-   ch->max_tx = CPSW_MAX_QUEUES;
ch->max_other = 0;
ch->other_count = 0;
ch->rx_count = cpsw->rx_ch_num;
@@ -2377,6 +2406,11 @@ static void cpsw_get_channels(struct net_device *ndev,
 static int cpsw_check_ch_settings(struct cpsw_common *cpsw,
  struct ethtool_channels *ch)
 {
+   if (cpsw->quirk_irq) {
+   dev_err(cpsw->dev, "Maximum one tx/rx queue is allowed");
+   return -EOPNOTSUPP;
+   }
+
if (ch->combined_count)
return -EINVAL;
 
@@ -2917,44 +2951,20 @@ static int cpsw_probe_dual_emac(struct cpsw_priv *priv)
return ret;
 }
 
-#define CPSW_QUIRK_IRQ BIT(0)
-
-static const struct platform_device_id cpsw_devtype[] = {
-   {
-   /* keep it for existing comaptibles */
-   .name = "cpsw",
-   .driver_data = CPSW_QUIRK_IRQ,
-   }, {
-   .name = "am335x-cpsw",
-   .driver_data = CPSW_QUIRK_IRQ,
-   }, {
-   .name = "am4372-cpsw",
-   .driver_data = 0,
-   }, {
-   .name = "dra7-cpsw",
-   .driver_data = 0,
-   }, {
-   /* sentinel */
-   }
-};

Re: [PATCH bpf-next 2/7] bpf: introduce bpf subcommand BPF_PERF_EVENT_QUERY

2018-05-16 Thread Yonghong Song



On 5/16/18 4:27 AM, Peter Zijlstra wrote:

On Tue, May 15, 2018 at 04:45:16PM -0700, Yonghong Song wrote:

Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.

There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.

This patch introduces a new bpf subcommand BPF_PERF_EVENT_QUERY.
Given a pid and fd, if the  is associated with a
tracepoint/kprobe/uprobea perf event, BPF_PERF_EVENT_QUERY will return
. prog_id
. tracepoint name, or
. k[ret]probe funcname + offset or kernel addr, or
. u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.

Signed-off-by: Yonghong Song 
---
  include/linux/trace_events.h |  15 ++
  include/uapi/linux/bpf.h |  25 ++
  kernel/bpf/syscall.c | 113 +++
  kernel/trace/bpf_trace.c |  53 
  kernel/trace/trace_kprobe.c  |  29 +++
  kernel/trace/trace_uprobe.c  |  22 +
  6 files changed, 257 insertions(+)


Why is the command called *_PERF_EVENT_* ? Are there not a lot of !perf
places to attach BPF proglets?


Just gave a complete picture, the below are major places to attach
BPF programs:
   . perf based (through perf ioctl)
   . raw tracepoint based (through bpf interface)

   . netlink interface for tc, xdp, tunneling
   . setsockopt for socket filters
   . cgroup based (bpf attachment subcommand)
 mostly networking and io devices
   . some other networking socket related (sk_skb stream/parser/verdict,
 sk_msg verdict) through bpf attachment subcommand.

Currently, for cgroup based attachment, we have BPF_PROG_QUERY with 
input cgroup file descriptor. For other networking based queries, we

may need to enumerate tc filters, networking devices, open sockets, etc.
to get the attachment information.

So to have one BPF_QUERY command line may be too complex to
cover all cases.

But you are right that BPF_PERF_EVENT_QUERY name is too narrow since
it should be used for other (pid, fd) based queries as well (e.g., 
socket, or other potential uses in the future).


How about the subcommand name BPF_TASK_FD_QUERY and make 
bpf_attr.task_fd_query extensible?


Thanks!


Re: [PATCH bpf-next v5 3/6] bpf: Add IPv6 Segment Routing helpers

2018-05-16 Thread Mathieu Xhonneux
 2018-05-14 23:40 GMT+01:00 Daniel Borkmann :
> On 05/12/2018 07:25 PM, Mathieu Xhonneux wrote:
> [...]
>> +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
>> +const void *, from, u32, len)
>> +{
>> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
>> + struct seg6_bpf_srh_state *srh_state =
>> + this_cpu_ptr(_bpf_srh_states);
>> + void *srh_tlvs, *srh_end, *ptr;
>> + struct ipv6_sr_hdr *srh;
>> + int srhoff = 0;
>> +
>> + if (ipv6_find_hdr(skb, , IPPROTO_ROUTING, NULL, NULL) < 0)
>> + return -EINVAL;
>> +
>> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
>> + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
>> + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);
>
> Do we need to check that this cannot go out of bounds wrt skb data?
input_action_bpf_end (which calls the BPF program) already verifies
using get_srh() that the whole SRH is accessible and is not out of
bounds.
The seg6 helpers (e.g. bpf_lwt_seg6_adjust_srh) then modify
srh_state->hdrlen following the evolution of the SRH in size. I don't
think that a check on srh_end is needed here as the SRH is already
verified once, and srh_state->hdrlen is then updated to keep this
bound correct.

>> + ptr = skb->data + offset;
>> + if (ptr >= srh_tlvs && ptr + len <= srh_end)
>> + srh_state->valid = 0;
>> + else if (ptr < (void *)>flags ||
>> +  ptr + len > (void *)>segments)
>> + return -EFAULT;
>> +
>> + if (unlikely(bpf_try_make_writable(skb, offset + len)))
>> + return -EFAULT;
>> +
>> + memcpy(ptr, from, len);
>
> You have a use after free here. bpf_try_make_writable() is potentially 
> changing
> underlying skb->data (e.g. see pskb_expand_head()). Therefore memcpy()'ing 
> into
> cached ptr is invalid.
>
OK.

>> + if (len > 0) {
>> + ret = skb_cow_head(skb, len);
>> + if (unlikely(ret < 0))
>> + return ret;
>> +
>> + ret = bpf_skb_net_hdr_push(skb, offset, len);
>> + } else {
>> + ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len);
>> + }
>> + if (unlikely(ret < 0))
>> + return ret;
>
> And here as well. You changed underlying pointers via skb_cow_head(), but in
> the error path you leave the cached pointers that now point to already freed
> buffer. Thus, you'd now be able to access the new skb data out of bounds since
> cb->data_end is still the old one due to missing 
> bpf_compute_data_pointers(skb).
> Please fix and audit your whole series carefully against these types of subtle
> bugs.

Right.
I went through the whole series again. I found a similar mistake in
bpf_push_seg6_encap, and added also a bpf_compute_data_pointers(skb)
there. I didn't find anything else, so I hope that we're covered here
(bpf_lwt_seg6_store_bytes, bpf_lwt_seg6_adjust_srh and
bpf_push_seg6_encap are the only functions modifying the packet in
this series).

Thanks. I'll submit a v6 ASAP.


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-16 Thread Jiri Pirko
Wed, May 16, 2018 at 11:23:41PM CEST, vla...@mellanox.com wrote:
>
>On Wed 16 May 2018 at 17:36, Roman Mashak  wrote:
>> Vlad Buslov  writes:
>>
>>> On Wed 16 May 2018 at 14:38, Roman Mashak  wrote:
 On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov  wrote:
> I'm trying to run tdc, but keep getting following error even on clean
> branch without my patches:

 Vlad, not sure if you saw my email:
 Apply Roman's patch and try again

 https://marc.info/?l=linux-netdev=152639369112020=2

 cheers,
 jamal
>>>
>>> With patch applied I get following error:
>>>
>>> Test 7d50: Add skbmod action to set destination mac
>>> exit: 255 0
>>> dst MAC address <11:22:33:44:55:66>
>>> RTNETLINK answers: No such file or directory
>>> We have an error talking to the kernel
>>>
>>
>> You may actually have broken something with your patches in this case.
>
> Results is for net-next without my patches.

 Do you have skbmod compiled in kernel or as a module?
>>>
>>> Thanks, already figured out that default config has some actions
>>> disabled.
>>> Have more errors now. Everything related to ife:
>>>
>>> Test 7682: Create valid ife encode action with mark and pass control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No such file or directory
>>> We have an error talking to the kernel
>>>
>>> Test ef47: Create valid ife encode action with mark and pipe control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No space left on device
>>> We have an error talking to the kernel
>>>
>>> Test df43: Create valid ife encode action with mark and continue control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No space left on device
>>> We have an error talking to the kernel
>>>
>>> Test e4cf: Create valid ife encode action with mark and drop control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No space left on device
>>> We have an error talking to the kernel
>>>
>>> Test ccba: Create valid ife encode action with mark and reclassify control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No space left on device
>>> We have an error talking to the kernel
>>>
>>> Test a1cf: Create valid ife encode action with mark and jump control
>>> exit: 255 0
>>> IFE type 0xED3E
>>> RTNETLINK answers: No space left on device
>>> We have an error talking to the kernel
>>>
>>> ...
>>>
>>>
>>
>> Please make sure you have these in your kernel config:
>>
>> CONFIG_NET_ACT_IFE=y
>> CONFIG_NET_IFE_SKBMARK=m
>> CONFIG_NET_IFE_SKBPRIO=m
>> CONFIG_NET_IFE_SKBTCINDEX=m

Roman, could you please add this to some file? Something similar to:
tools/testing/selftests/net/forwarding/config

Thanks!

>>
>> For tdc to run all the tests, it is assumed that all the supported tc
>> actions/filters are enabled and compiled.
>
>Enabling these options allowed all ife tests to pass. Thanks!
>
>Error in u32 test still appears however:
>
>Test e9a3: Add u32 with source match
>
>-> prepare stage *** Could not execute: "$TC qdisc add dev $DEV1 ingress"
>
>-> prepare stage *** Error message: "Cannot find device "v0p1"


[bpf PATCH 1/2] bpf: sockmap update rollback on error can incorrectly dec prog refcnt

2018-05-16 Thread John Fastabend
If the user were to only attach one of the parse or verdict programs
then it is possible a subsequent sockmap update could incorrectly
decrement the refcnt on the program. This happens because in the
rollback logic, after an error, we have to decrement the program
reference count when its been incremented. However, we only increment
the program reference count if the user has both a verdict and a
parse program. The reason for this is because, at least at the
moment, both are required for any one to be meaningful. The problem
fixed here is in the rollback path we decrement the program refcnt
even if only one existing. But we never incremented the refcnt in
the first place creating an imbalance.

This patch fixes the error path to handle this case.

Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add 
multi-map support")
Reported-by: Daniel Borkmann 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 098eca5..f03aaa8 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1717,10 +1717,10 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
if (tx_msg) {
tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
if (IS_ERR(tx_msg)) {
-   if (verdict)
-   bpf_prog_put(verdict);
-   if (parse)
+   if (parse && verdict) {
bpf_prog_put(parse);
+   bpf_prog_put(verdict);
+   }
return PTR_ERR(tx_msg);
}
}
@@ -1805,10 +1805,10 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
 out_free:
smap_release_sock(psock, sock);
 out_progs:
-   if (verdict)
-   bpf_prog_put(verdict);
-   if (parse)
+   if (parse && verdict) {
bpf_prog_put(parse);
+   bpf_prog_put(verdict);
+   }
if (tx_msg)
bpf_prog_put(tx_msg);
write_unlock_bh(>sk_callback_lock);



[bpf PATCH 2/2] bpf: parse and verdict prog attach may race with bpf map update

2018-05-16 Thread John Fastabend
In the sockmap design BPF programs (SK_SKB_STREAM_PARSER and
SK_SKB_STREAM_VERDICT) are attached to the sockmap map type and when
a sock is added to the map the programs are used by the socket.
However, sockmap updates from both userspace and BPF programs can
happen concurrently with the attach and detach of these programs.

To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
primitive to ensure the program pointer is not refeched and
possibly NULL'd before the refcnt increment. This happens inside
a RCU critical section so although the pointer reference in the map
object may be NULL (by a concurrent detach operation) the reference
from READ_ONCE will not be free'd until after grace period. This
ensures the object returned by READ_ONCE() is valid through the
RCU criticl section and safe to use as long as we "know" it may
be free'd shortly.

Daniel spotted a case in the sock update API where instead of using
the READ_ONCE() program reference we used the pointer from the
original map, stab->bpf_{verdict|parse}. The problem with this is
the logic checks the object returned from the READ_ONCE() is not
NULL and then tries to reference the object again but using the
above map pointer, which may have already been NULL'd by a parallel
detach operation. If this happened bpf_porg_inc_not_zero could
dereference a NULL pointer.

Fix this by using variable returned by READ_ONCE() that is checked
for NULL.

Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add 
multi-map support")
Reported-by: Daniel Borkmann 
Signed-off-by: John Fastabend 
---
 kernel/bpf/sockmap.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index f03aaa8..583c1eb 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1703,11 +1703,11 @@ static int sock_map_ctx_update_elem(struct 
bpf_sock_ops_kern *skops,
 * we increment the refcnt. If this is the case abort with an
 * error.
 */
-   verdict = bpf_prog_inc_not_zero(stab->bpf_verdict);
+   verdict = bpf_prog_inc_not_zero(verdict);
if (IS_ERR(verdict))
return PTR_ERR(verdict);
 
-   parse = bpf_prog_inc_not_zero(stab->bpf_parse);
+   parse = bpf_prog_inc_not_zero(parse);
if (IS_ERR(parse)) {
bpf_prog_put(verdict);
return PTR_ERR(parse);



Re: [PATCH bpf-next v6 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap

2018-05-16 Thread John Fastabend
On 05/15/2018 12:19 PM, Daniel Borkmann wrote:
> On 05/14/2018 07:00 PM, John Fastabend wrote:
> [...]


[...]

> 
> As you say in the comment above the function wrt locking notes that the
> __sock_map_ctx_update_elem() can be called concurrently.
> 
>   All operations operate on sock_map using cmpxchg and xchg operations to 
> ensure we
>   do not get stale references. Any reads into the map must be done with 
> READ_ONCE()
>   because of this.
> 
> You initially use the READ_ONCE() on the verdict/parse/tx_msg, but later on 
> when
> grabbing the reference you use again progs->bpf_verdict/bpf_parse/bpf_tx_msg 
> which
> would potentially refetch it, but if updates would happen concurrently e.g. 
> to the
> three progs, they could be NULL in the mean-time, no? bpf_prog_inc_not_zero() 
> would
> then crash. Why are not the ones used that you fetched previously via 
> READ_ONCE()
> for taking the ref?

Nice catch. We should use the reference fetched by READ_ONCE in all cases.

> 
> The second question I had is that verdict/parse/tx_msg are updated 
> independently
> from each other and each could be NULL or non-NULL. What if, say, parse is 
> NULL
> and verdict as well as tx_msg is non-NULL and the bpf_prog_inc_not_zero() on 
> the
> tx_msg prog fails. Doesn't this cause a use-after-free since a ref on verdict 
> wasn't
> taken earlier but the bpf_prog_put() will cause accidental misbalance/free of 
> the
> progs?

Also good catch. I'll send patches for both now. Thanks.

> 
> It would probably help to clarify the locking comment a bit more if indeed the
> above should be okay as is.
> 
> Thanks,
> Daniel
> 



Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-16 Thread Vlad Buslov

On Wed 16 May 2018 at 18:10, Davide Caratti  wrote:
> On Wed, 2018-05-16 at 13:36 -0400, Roman Mashak wrote:
>> Vlad Buslov  writes:
>> 
>> > On Wed 16 May 2018 at 14:38, Roman Mashak  wrote:
>> > > On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov  wrote:
>> > > > > > > > I'm trying to run tdc, but keep getting following error even 
>> > > > > > > > on clean
>> > > > > > > > branch without my patches:
>> > > > > > > 
>> > > > > > > Vlad, not sure if you saw my email:
>> > > > > > > Apply Roman's patch and try again
>> > > > > > > 
>> > > > > > > https://marc.info/?l=linux-netdev=152639369112020=2
>> > > > > > > 
>> > > > > > > cheers,
>> > > > > > > jamal
>> > > > > > 
>> > > > > > With patch applied I get following error:
>> > > > > > 
>> > > > > > Test 7d50: Add skbmod action to set destination mac
>> > > > > > exit: 255 0
>> > > > > > dst MAC address <11:22:33:44:55:66>
>> > > > > > RTNETLINK answers: No such file or directory
>> > > > > > We have an error talking to the kernel
>> > > > > > 
>> > > > > 
>> > > > > You may actually have broken something with your patches in this 
>> > > > > case.
>> > > > 
>> > > > Results is for net-next without my patches.
>> > > 
>> > > Do you have skbmod compiled in kernel or as a module?
>> > 
>> > Thanks, already figured out that default config has some actions
>> > disabled.
>> > Have more errors now. Everything related to ife:
>> > 
>> > Test 7682: Create valid ife encode action with mark and pass control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No such file or directory
>> > We have an error talking to the kernel
>> > 
>> > Test ef47: Create valid ife encode action with mark and pipe control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No space left on device
>> > We have an error talking to the kernel
>> > 
>> > Test df43: Create valid ife encode action with mark and continue control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No space left on device
>> > We have an error talking to the kernel
>> > 
>> > Test e4cf: Create valid ife encode action with mark and drop control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No space left on device
>> > We have an error talking to the kernel
>> > 
>> > Test ccba: Create valid ife encode action with mark and reclassify control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No space left on device
>> > We have an error talking to the kernel
>> > 
>> > Test a1cf: Create valid ife encode action with mark and jump control
>> > exit: 255 0
>> > IFE type 0xED3E
>> > RTNETLINK answers: No space left on device
>> > We have an error talking to the kernel
>> > 
>> > ...
>> > 
>> > 
>> 
>> Please make sure you have these in your kernel config:
>> 
>> CONFIG_NET_ACT_IFE=y
>> CONFIG_NET_IFE_SKBMARK=m
>> CONFIG_NET_IFE_SKBPRIO=m
>> CONFIG_NET_IFE_SKBTCINDEX=m
>> 
>> For tdc to run all the tests, it is assumed that all the supported tc
>> actions/filters are enabled and compiled.
> hello,
>
> looking at ife.json, it seems that we have at least 4 typos in
> 'teardown'. 
>
> It does
>
> $TC actions flush action skbedit
>
> in place of 
>
> $TC actions flush action ife
>
> On my fedora28 (with fedora28 kernel), fixing them made test 7682 return
> 'ok' (and all others in ife category, except ee94, 7ee0 and 0a7d).
>
> regards,

I can confirm that on net-next kernel version that I use, there are also
multiple teardowns of actions type skbedit after actually creating ife
action in file ife.json. However, tests pass when I enabled config
options that Roman suggested:

ok 119 - 7682 # Create valid ife encode action with mark and pass control


Re: [PATCH iproute2-next] tc-netem: fix limit description in man page

2018-05-16 Thread Stephen Hemminger
On Wed, 16 May 2018 15:17:50 -0600
David Ahern  wrote:

> On 5/15/18 6:49 PM, Marcelo Ricardo Leitner wrote:
> > As the kernel code says, limit is actually the amount of packets it can
> > hold queued at a time, as per:
> > 
> > static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> >  struct sk_buff **to_free)
> > {
> > ...
> > if (unlikely(sch->q.qlen >= sch->limit))
> > return qdisc_drop_all(skb, sch, to_free);
> > 
> > So lets fix the description of the field in the man page.
> > 
> > Signed-off-by: Marcelo Ricardo Leitner 
> > ---
> >  man/man8/tc-netem.8 | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >   
> 
> applied to iproute2-next. Thanks,
> 

Since it is an error, I will put it in master.


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-16 Thread Vlad Buslov

On Wed 16 May 2018 at 17:36, Roman Mashak  wrote:
> Vlad Buslov  writes:
>
>> On Wed 16 May 2018 at 14:38, Roman Mashak  wrote:
>>> On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov  wrote:
 I'm trying to run tdc, but keep getting following error even on clean
 branch without my patches:
>>>
>>> Vlad, not sure if you saw my email:
>>> Apply Roman's patch and try again
>>>
>>> https://marc.info/?l=linux-netdev=152639369112020=2
>>>
>>> cheers,
>>> jamal
>>
>> With patch applied I get following error:
>>
>> Test 7d50: Add skbmod action to set destination mac
>> exit: 255 0
>> dst MAC address <11:22:33:44:55:66>
>> RTNETLINK answers: No such file or directory
>> We have an error talking to the kernel
>>
>
> You may actually have broken something with your patches in this case.

 Results is for net-next without my patches.
>>>
>>> Do you have skbmod compiled in kernel or as a module?
>>
>> Thanks, already figured out that default config has some actions
>> disabled.
>> Have more errors now. Everything related to ife:
>>
>> Test 7682: Create valid ife encode action with mark and pass control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No such file or directory
>> We have an error talking to the kernel
>>
>> Test ef47: Create valid ife encode action with mark and pipe control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No space left on device
>> We have an error talking to the kernel
>>
>> Test df43: Create valid ife encode action with mark and continue control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No space left on device
>> We have an error talking to the kernel
>>
>> Test e4cf: Create valid ife encode action with mark and drop control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No space left on device
>> We have an error talking to the kernel
>>
>> Test ccba: Create valid ife encode action with mark and reclassify control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No space left on device
>> We have an error talking to the kernel
>>
>> Test a1cf: Create valid ife encode action with mark and jump control
>> exit: 255 0
>> IFE type 0xED3E
>> RTNETLINK answers: No space left on device
>> We have an error talking to the kernel
>>
>> ...
>>
>>
>
> Please make sure you have these in your kernel config:
>
> CONFIG_NET_ACT_IFE=y
> CONFIG_NET_IFE_SKBMARK=m
> CONFIG_NET_IFE_SKBPRIO=m
> CONFIG_NET_IFE_SKBTCINDEX=m
>
> For tdc to run all the tests, it is assumed that all the supported tc
> actions/filters are enabled and compiled.

Enabling these options allowed all ife tests to pass. Thanks!

Error in u32 test still appears however:

Test e9a3: Add u32 with source match

-> prepare stage *** Could not execute: "$TC qdisc add dev $DEV1 ingress"

-> prepare stage *** Error message: "Cannot find device "v0p1"


Re: [PATCH iproute2-next] tc-netem: fix limit description in man page

2018-05-16 Thread David Ahern
On 5/15/18 6:49 PM, Marcelo Ricardo Leitner wrote:
> As the kernel code says, limit is actually the amount of packets it can
> hold queued at a time, as per:
> 
> static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  struct sk_buff **to_free)
> {
>   ...
> if (unlikely(sch->q.qlen >= sch->limit))
> return qdisc_drop_all(skb, sch, to_free);
> 
> So lets fix the description of the field in the man page.
> 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  man/man8/tc-netem.8 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks,



Re: [PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> +   if (tb[TCA_CAKE_AUTORATE]) {
>> +   if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
>> +   q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
>> +   else
>> +   q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
>> +   }
>> +
>> +   if (tb[TCA_CAKE_INGRESS]) {
>> +   if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
>> +   q->rate_flags |= CAKE_FLAG_INGRESS;
>> +   else
>> +   q->rate_flags &= ~CAKE_FLAG_INGRESS;
>> +   }
>> +
>> if (tb[TCA_CAKE_MEMORY])
>> q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
>>
>> @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct 
>> sk_buff *skb)
>> if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
>> goto nla_put_failure;
>>
>> +   if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
>> +   !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
>> +   goto nla_put_failure;
>> +
>> +   if (nla_put_u32(skb, TCA_CAKE_INGRESS,
>> +   !!(q->rate_flags & CAKE_FLAG_INGRESS)))
>> +   goto nla_put_failure;
>> +
>
> Why do you want to dump each bit of the rate_flags separately rather than
> dumping the whole rate_flags as an integer?

Well, these were added one at a time, each as a new option. Isn't that
more or less congruent with how netlink attributes are supposed to be
used?

-Toke


Re: [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> When CAKE is deployed on a gateway that also performs NAT (which is a
>> common deployment mode), the host fairness mechanism cannot distinguish
>> internal hosts from each other, and so fails to work correctly.
>>
>> To fix this, we add an optional NAT awareness mode, which will query the
>> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
>> and use that in the flow and host hashing.
>>
>> When the shaper is enabled and the host is already performing NAT, the cost
>> of this lookup is negligible. However, in unlimited mode with no NAT being
>> performed, there is a significant CPU cost at higher bandwidths. For this
>> reason, the feature is turned off by default.
>>
>> Signed-off-by: Toke Høiland-Jørgensen 
>> ---
>>  net/sched/sch_cake.c |   73 
>> ++
>>  1 file changed, 73 insertions(+)
>>
>> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
>> index 65439b643c92..e1038a7b6686 100644
>> --- a/net/sched/sch_cake.c
>> +++ b/net/sched/sch_cake.c
>> @@ -71,6 +71,12 @@
>>  #include 
>>  #include 
>>
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +#include 
>> +#include 
>> +#include 
>> +#endif
>> +
>>  #define CAKE_SET_WAYS (8)
>>  #define CAKE_MAX_TINS (8)
>>  #define CAKE_QUEUES (1024)
>> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
>> return drop;
>>  }
>>
>> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
>> +
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +const struct sk_buff *skb)
>> +{
>> +   const struct nf_conntrack_tuple *tuple;
>> +   enum ip_conntrack_info ctinfo;
>> +   struct nf_conn *ct;
>> +   bool rev = false;
>> +
>> +   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
>> +   return;
>> +
>> +   ct = nf_ct_get(skb, );
>> +   if (ct) {
>> +   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
>> +   } else {
>> +   const struct nf_conntrack_tuple_hash *hash;
>> +   struct nf_conntrack_tuple srctuple;
>> +
>> +   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
>> +  NFPROTO_IPV4, dev_net(skb->dev),
>> +  ))
>> +   return;
>> +
>> +   hash = nf_conntrack_find_get(dev_net(skb->dev),
>> +_ct_zone_dflt,
>> +);
>> +   if (!hash)
>> +   return;
>> +
>> +   rev = true;
>> +   ct = nf_ct_tuplehash_to_ctrack(hash);
>> +   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
>> +   }
>> +
>> +   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
>> +   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
>> +
>> +   if (keys->ports.ports) {
>> +   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
>> +   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
>> +   }
>> +   if (rev)
>> +   nf_ct_put(ct);
>> +}
>> +#else
>> +static void cake_update_flowkeys(struct flow_keys *keys,
>> +const struct sk_buff *skb)
>> +{
>> +   /* There is nothing we can do here without CONNTRACK */
>> +}
>> +#endif
>> +
>>  /* Cake has several subtle multiple bit settings. In these cases you
>>   *  would be matching triple isolate mode as well.
>>   */
>> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const 
>> struct sk_buff *skb,
>> skb_flow_dissect_flow_keys(skb, ,
>>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>>
>> +   if (flow_mode & CAKE_FLOW_NAT_FLAG)
>> +   cake_update_flowkeys(, skb);
>> +
>> /* flow_hash_from_keys() sorts the addresses by value, so we have
>>  * to preserve their order in a separate data structure to treat
>>  * src and dst host addresses as independently selectable.
>> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct 
>> nlattr *opt,
>> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
>> CAKE_FLOW_MASK);
>>
>> +   if (tb[TCA_CAKE_NAT]) {
>> +   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
>> +   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
>> +   !!nla_get_u32(tb[TCA_CAKE_NAT]);
>> +   }
>
>
> I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK
> is not enabled.

Good point, will fix :)

-Toke


Re: [PATCH bpf-next] bpf: fix sock hashmap kmalloc warning

2018-05-16 Thread John Fastabend
On 05/16/2018 02:06 PM, Yonghong Song wrote:
> syzbot reported a kernel warning below:
>   WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
> mm/slab_common.c:996
>   Kernel panic - not syncing: panic_on_warn set ...
> 
>   CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   Call Trace:
>__dump_stack lib/dump_stack.c:77 [inline]
>dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>panic+0x22f/0x4de kernel/panic.c:184
>__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>report_bug+0x252/0x2d0 lib/bug.c:186
>fixup_bug arch/x86/kernel/traps.c:178 [inline]
>do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
>   RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   RSP: 0018:8801d907fc58 EFLAGS: 00010246
>   RAX:  RBX: 8801aeecb280 RCX: 8185ebd7
>   RDX:  RSI:  RDI: ffe1
>   RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700
>   R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280
>   R13: fff4 R14: 8801ad891a00 R15: 014200c0
>__do_kmalloc mm/slab.c:3713 [inline]
>__kmalloc+0x25/0x760 mm/slab.c:3727
>kmalloc include/linux/slab.h:517 [inline]
>map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
>__do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
>__se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
>__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
>do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The test case is against sock hashmap with a key size 0xffe1.
> Such a large key size will cause the below code in function
> sock_hash_alloc() overflowing and produces a smaller elem_size,
> hence map creation will be successful.
> htab->elem_size = sizeof(struct htab_elem) +
>   round_up(htab->map.key_size, 8);
> 
> Later, when map_get_next_key is called and kernel tries
> to allocate the key unsuccessfully, it will issue
> the above warning.
> 
> Similar to hashtab, ensure the key size is at most
> MAX_BPF_STACK for a successful map creation.
> 
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song 
> ---
>  kernel/bpf/sockmap.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 56879c9fd3a4..79f5e899 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -1990,6 +1990,12 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr 
> *attr)
>   attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
>   return ERR_PTR(-EINVAL);
>  
> + if (attr->key_size > MAX_BPF_STACK)
> + /* eBPF programs initialize keys on stack, so they cannot be
> +  * larger than max stack size
> +  */
> + return ERR_PTR(-E2BIG);
> +
>   err = bpf_tcp_ulp_register();
>   if (err && err != -EEXIST)
>   return ERR_PTR(err);
> 

Thanks!

Acked-by: John Fastabend 


Re: [iproute2-next v2 1/1] tipc: fixed node and name table listings

2018-05-16 Thread David Ahern
On 5/15/18 7:54 AM, Jon Maloy wrote:
> We make it easier for users to correlate between 128-bit node
> identities and 32-bit node hash number by extending the 'node list'
> command to also show the hash number.
> 
> We also improve the 'nametable show' command to show the node identity
> instead of the node hash number. Since the former potentially is much
> longer than the latter, we make room for it by eliminating the (to the
> user) irrelevant publication key. We also reorder some of the columns so
> that the node id comes last, since this looks nicer and is more logical.
> 
> ---
> v2: Fixed compiler warning as per comment from David Ahern
> 
> Signed-off-by: Jon Maloy 
> ---
>  tipc/misc.c  | 18 ++
>  tipc/misc.h  |  1 +
>  tipc/nametable.c | 18 ++
>  tipc/node.c  | 19 ---
>  tipc/peer.c  |  4 
>  5 files changed, 41 insertions(+), 19 deletions(-)
> 
> diff --git a/tipc/misc.c b/tipc/misc.c
> index 16849f1..e8b726f 100644
> --- a/tipc/misc.c
> +++ b/tipc/misc.c
> @@ -13,6 +13,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include "misc.h"
>  
>  #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low))
> @@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str)
>   for (i = 31; str[i] == '0'; i--)
>   str[i] = 0;
>  }
> +
> +void hash2nodestr(uint32_t hash, char *str)
> +{
> + struct tipc_sioc_nodeid_req nr = {};
> + int sd;
> +
> + sd = socket(AF_TIPC, SOCK_RDM, 0);
> + if (sd < 0) {
> + fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno));
> + return;
> + }
> + nr.peer = hash;
> + if (!ioctl(sd, SIOCGETNODEID, ))
> + nodeid2str((uint8_t *)nr.node_id, str);
> +}

you are leaking sd



Re: [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
Cong Wang  writes:

> On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
>> +
>> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg)
>> +{
>> +   return NULL;
>> +}
>> +
>> +static unsigned long cake_find(struct Qdisc *sch, u32 classid)
>> +{
>> +   return 0;
>> +}
>> +
>> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
>> +{
>> +}
>
>
> Thanks for adding the support to other TC filters, it is much better
> now!

You're welcome. Turned out not to be that hard :)

> A quick question: why class_ops->dump_stats is still NULL?
>
> It is supposed to dump the stats of each flow. Is there still any
> difficulty to map it to tc class? I thought you figured it out when
> you added the tcf_classify().

On the classify side, I solved the "multiple sets of queues" problem by
using skb->priority to select the tin (diffserv tier) and the classifier
output to select the queue within that tin. This would not work for
dumping stats; some other way of mapping queues to the linear class
space would be needed. And since we are not actually collecting any
per-flow stats that I could print, I thought it wasn't worth coming up
with a half-baked proposal for this just to add an API hook that no one
in the existing CAKE user base has ever asked for...

-Toke


Re: [PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> +   if (tb[TCA_CAKE_AUTORATE]) {
> +   if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
> +   q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
> +   else
> +   q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
> +   }
> +
> +   if (tb[TCA_CAKE_INGRESS]) {
> +   if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
> +   q->rate_flags |= CAKE_FLAG_INGRESS;
> +   else
> +   q->rate_flags &= ~CAKE_FLAG_INGRESS;
> +   }
> +
> if (tb[TCA_CAKE_MEMORY])
> q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
>
> @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
> goto nla_put_failure;
>
> +   if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
> +   !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
> +   goto nla_put_failure;
> +
> +   if (nla_put_u32(skb, TCA_CAKE_INGRESS,
> +   !!(q->rate_flags & CAKE_FLAG_INGRESS)))
> +   goto nla_put_failure;
> +

Why do you want to dump each bit of the rate_flags separately rather than
dumping the whole rate_flags as an integer?


[PATCH bpf-next] bpf: fix sock hashmap kmalloc warning

2018-05-16 Thread Yonghong Song
syzbot reported a kernel warning below:
  WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
mm/slab_common.c:996
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1b9/0x294 lib/dump_stack.c:113
   panic+0x22f/0x4de kernel/panic.c:184
   __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
   report_bug+0x252/0x2d0 lib/bug.c:186
   fixup_bug arch/x86/kernel/traps.c:178 [inline]
   do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
  RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  RSP: 0018:8801d907fc58 EFLAGS: 00010246
  RAX:  RBX: 8801aeecb280 RCX: 8185ebd7
  RDX:  RSI:  RDI: ffe1
  RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700
  R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280
  R13: fff4 R14: 8801ad891a00 R15: 014200c0
   __do_kmalloc mm/slab.c:3713 [inline]
   __kmalloc+0x25/0x760 mm/slab.c:3727
   kmalloc include/linux/slab.h:517 [inline]
   map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
   __do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
   __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

The test case is against sock hashmap with a key size 0xffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
htab->elem_size = sizeof(struct htab_elem) +
  round_up(htab->map.key_size, 8);

Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.

Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song 
---
 kernel/bpf/sockmap.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 56879c9fd3a4..79f5e899 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1990,6 +1990,12 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr 
*attr)
attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
 
+   if (attr->key_size > MAX_BPF_STACK)
+   /* eBPF programs initialize keys on stack, so they cannot be
+* larger than max stack size
+*/
+   return ERR_PTR(-E2BIG);
+
err = bpf_tcp_ulp_register();
if (err && err != -EEXIST)
return ERR_PTR(err);
-- 
2.14.3



[PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT

2018-05-16 Thread Sean Young
Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call
rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
that the last key should be repeated.

The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall;
the target_fd must be the /dev/lircN device.

Signed-off-by: Sean Young 
---
 drivers/media/rc/Kconfig   |  13 ++
 drivers/media/rc/Makefile  |   1 +
 drivers/media/rc/bpf-rawir-event.c | 363 +
 drivers/media/rc/lirc_dev.c|  24 ++
 drivers/media/rc/rc-core-priv.h|  24 ++
 drivers/media/rc/rc-ir-raw.c   |  14 +-
 include/linux/bpf_rcdev.h  |  30 +++
 include/linux/bpf_types.h  |   3 +
 include/uapi/linux/bpf.h   |  55 -
 kernel/bpf/syscall.c   |   7 +
 10 files changed, 531 insertions(+), 3 deletions(-)
 create mode 100644 drivers/media/rc/bpf-rawir-event.c
 create mode 100644 include/linux/bpf_rcdev.h

diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
index eb2c3b6eca7f..2172d65b0213 100644
--- a/drivers/media/rc/Kconfig
+++ b/drivers/media/rc/Kconfig
@@ -25,6 +25,19 @@ config LIRC
   passes raw IR to and from userspace, which is needed for
   IR transmitting (aka "blasting") and for the lirc daemon.
 
+config BPF_RAWIR_EVENT
+   bool "Support for eBPF programs attached to lirc devices"
+   depends on BPF_SYSCALL
+   depends on RC_CORE=y
+   depends on LIRC
+   help
+  Allow attaching eBPF programs to a lirc device using the bpf(2)
+  syscall command BPF_PROG_ATTACH. This is supported for raw IR
+  receivers.
+
+  These eBPF programs can be used to decode IR into scancodes, for
+  IR protocols not supported by the kernel decoders.
+
 menuconfig RC_DECODERS
bool "Remote controller decoders"
depends on RC_CORE
diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
index 2e1c87066f6c..74907823bef8 100644
--- a/drivers/media/rc/Makefile
+++ b/drivers/media/rc/Makefile
@@ -5,6 +5,7 @@ obj-y += keymaps/
 obj-$(CONFIG_RC_CORE) += rc-core.o
 rc-core-y := rc-main.o rc-ir-raw.o
 rc-core-$(CONFIG_LIRC) += lirc_dev.o
+rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o
 obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
 obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
 obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
diff --git a/drivers/media/rc/bpf-rawir-event.c 
b/drivers/media/rc/bpf-rawir-event.c
new file mode 100644
index ..7cb48b8d87b5
--- /dev/null
+++ b/drivers/media/rc/bpf-rawir-event.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+// bpf-rawir-event.c - handles bpf
+//
+// Copyright (C) 2018 Sean Young 
+
+#include 
+#include 
+#include 
+#include "rc-core-priv.h"
+
+/*
+ * BPF interface for raw IR
+ */
+const struct bpf_prog_ops rawir_event_prog_ops = {
+};
+
+BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event)
+{
+   struct ir_raw_event_ctrl *ctrl;
+
+   ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event);
+
+   rc_repeat(ctrl->dev);
+
+   return 0;
+}
+
+static const struct bpf_func_proto rc_repeat_proto = {
+   .func  = bpf_rc_repeat,
+   .gpl_only  = true, /* rc_repeat is EXPORT_SYMBOL_GPL */
+   .ret_type  = RET_INTEGER,
+   .arg1_type = ARG_PTR_TO_CTX,
+};
+
+BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol,
+  u32, scancode, u32, toggle)
+{
+   struct ir_raw_event_ctrl *ctrl;
+
+   ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event);
+
+   rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
+
+   return 0;
+}
+
+static const struct bpf_func_proto rc_keydown_proto = {
+   .func  = bpf_rc_keydown,
+   .gpl_only  = true, /* rc_keydown is EXPORT_SYMBOL_GPL */
+   .ret_type  = RET_INTEGER,
+   .arg1_type = ARG_PTR_TO_CTX,
+   .arg2_type = ARG_ANYTHING,
+   .arg3_type = ARG_ANYTHING,
+   .arg4_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+rawir_event_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+   switch (func_id) {
+   case BPF_FUNC_rc_repeat:
+   return _repeat_proto;
+   case BPF_FUNC_rc_keydown:
+   return _keydown_proto;
+   case BPF_FUNC_map_lookup_elem:
+   return _map_lookup_elem_proto;
+   case BPF_FUNC_map_update_elem:
+   return _map_update_elem_proto;
+   case BPF_FUNC_map_delete_elem:
+   return _map_delete_elem_proto;
+   case BPF_FUNC_ktime_get_ns:
+   return _ktime_get_ns_proto;
+   case BPF_FUNC_tail_call:
+   return _tail_call_proto;
+   case BPF_FUNC_get_prandom_u32:
+   return _get_prandom_u32_proto;
+   case BPF_FUNC_trace_printk:
+   if (capable(CAP_SYS_ADMIN))
+   return 

[PATCH v3 2/2] bpf: add selftest for rawir_event type program

2018-05-16 Thread Sean Young
This is simple test over rc-loopback.

Signed-off-by: Sean Young 
---
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  57 +++-
 tools/lib/bpf/libbpf.c|   1 +
 tools/testing/selftests/bpf/Makefile  |   8 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   6 +
 tools/testing/selftests/bpf/test_rawir.sh |  37 +
 .../selftests/bpf/test_rawir_event_kern.c |  26 
 .../selftests/bpf/test_rawir_event_user.c | 130 ++
 8 files changed, 261 insertions(+), 5 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/test_rawir.sh
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 9bdfdf2d3fbe..8889a4ee8577 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -71,6 +71,7 @@ static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_SK_MSG]  = "sk_msg",
[BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
+   [BPF_PROG_TYPE_RAWIR_EVENT] = "rawir_event",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1205d86a7a29..243e141e8a5b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -141,6 +141,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+   BPF_PROG_TYPE_RAWIR_EVENT,
 };
 
 enum bpf_attach_type {
@@ -158,6 +159,7 @@ enum bpf_attach_type {
BPF_CGROUP_INET6_CONNECT,
BPF_CGROUP_INET4_POST_BIND,
BPF_CGROUP_INET6_POST_BIND,
+   BPF_RAWIR_EVENT,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1829,7 +1831,6 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- *
  * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 
flags)
  * Description
  * Do FIB lookup in kernel tables using parameters in *params*.
@@ -1856,6 +1857,7 @@ union bpf_attr {
  * Egress device index on success, 0 if packet needs to continue
  * up the stack for further processing or a negative error in case
  * of failure.
+ *
  * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
  * Description
  * Add an entry to, or update a sockhash *map* referencing sockets.
@@ -1902,6 +1904,35 @@ union bpf_attr {
  * egress otherwise). This is the only flag supported for now.
  * Return
  * **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_rc_keydown(void *ctx, u32 protocol, u32 scancode, u32 toggle)
+ * Description
+ * Report decoded scancode with toggle value. For use in
+ * BPF_PROG_TYPE_RAWIR_EVENT, to report a successfully
+ * decoded scancode. This is will generate a keydown event,
+ * and a keyup event once the scancode is no longer repeated.
+ *
+ * *ctx* pointer to bpf_rawir_event, *protocol* is decoded
+ * protocol (see RC_PROTO_* enum).
+ *
+ * Some protocols include a toggle bit, in case the button
+ * was released and pressed again between consecutive scancodes,
+ * copy this bit into *toggle* if it exists, else set to 0.
+ *
+ * Return
+ * Always return 0 (for now)
+ *
+ * int bpf_rc_repeat(void *ctx)
+ * Description
+ * Repeat the last decoded scancode; some IR protocols like
+ * NEC have a special IR message for repeat last button,
+ * in case user is holding a button down; the scancode is
+ * not repeated.
+ *
+ * *ctx* pointer to bpf_rawir_event.
+ *
+ * Return
+ * Always return 0 (for now)
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1976,7 +2007,9 @@ union bpf_attr {
FN(fib_lookup), \
FN(sock_hash_update),   \
FN(msg_redirect_hash),  \
-   FN(sk_redirect_hash),
+   FN(sk_redirect_hash),   \
+   FN(rc_repeat),  \
+   FN(rc_keydown),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2043,6 +2076,26 @@ enum bpf_hdr_start_off {
BPF_HDR_START_NET,
 };
 
+/*
+ * user accessible mirror of in-kernel ir_raw_event
+ */
+#define BPF_RAWIR_EVENT_SPACE  0
+#define BPF_RAWIR_EVENT_PULSE  1
+#define BPF_RAWIR_EVENT_TIMEOUT2
+#define BPF_RAWIR_EVENT_RESET  3
+#define BPF_RAWIR_EVENT_CARRIER 

[PATCH v3 0/2] IR decoding using BPF

2018-05-16 Thread Sean Young
The kernel IR decoders (drivers/media/rc/ir-*-decoder.c) support the most
widely used IR protocols, but there are many protocols which are not
supported[1]. For example, the lirc-remotes[2] repo has over 2700 remotes,
many of which are not supported by rc-core. There is a "long tail" of
unsupported IR protocols, for which lircd is need to decode the IR .

IR encoding is done in such a way that some simple circuit can decode it;
therefore, bpf is ideal.

In order to support all these protocols, here we have bpf based IR decoding.
The idea is that user-space can define a decoder in bpf, attach it to
the rc device through the lirc chardev.

Separate work is underway to extend ir-keytable to have an extensive library
of bpf-based decoders, and a much expanded library of rc keymaps.

Another future application would be to compile IRP[3] to a IR BPF program, and
so support virtually every remote without having to write a decoder for each.
It might also be possible to support non-button devices such as analog
directional pads or air conditioning remote controls and decode the target
temperature in bpf, and pass that to an input device.

Thanks,

Sean Young

[1] http://www.hifi-remote.com/wiki/index.php?title=DecodeIR
[2] https://sourceforge.net/p/lirc-remotes/code/ci/master/tree/remotes/
[3] http://www.hifi-remote.com/wiki/index.php?title=IRP_Notation

Changes since v2:
 - Fixed locking issues
 - Improved self-test to cover more cases
 - Rebased on bpf-next again

Changes since v1:
 - Code review comments from Y Song  and
   Randy Dunlap 
 - Re-wrote sample bpf to be selftest
 - Renamed RAWIR_DECODER -> RAWIR_EVENT (Kconfig, context, bpf prog type)
 - Rebase on bpf-next
 - Introduced bpf_rawir_event context structure with simpler access checking

Sean Young (2):
  media: rc: introduce BPF_PROG_RAWIR_EVENT
  bpf: add selftest for rawir_event type program

 drivers/media/rc/Kconfig  |  13 +
 drivers/media/rc/Makefile |   1 +
 drivers/media/rc/bpf-rawir-event.c| 363 ++
 drivers/media/rc/lirc_dev.c   |  24 ++
 drivers/media/rc/rc-core-priv.h   |  24 ++
 drivers/media/rc/rc-ir-raw.c  |  14 +-
 include/linux/bpf_rcdev.h |  30 ++
 include/linux/bpf_types.h |   3 +
 include/uapi/linux/bpf.h  |  55 ++-
 kernel/bpf/syscall.c  |   7 +
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  57 ++-
 tools/lib/bpf/libbpf.c|   1 +
 tools/testing/selftests/bpf/Makefile  |   8 +-
 tools/testing/selftests/bpf/bpf_helpers.h |   6 +
 tools/testing/selftests/bpf/test_rawir.sh |  37 ++
 .../selftests/bpf/test_rawir_event_kern.c |  26 ++
 .../selftests/bpf/test_rawir_event_user.c | 130 +++
 18 files changed, 792 insertions(+), 8 deletions(-)
 create mode 100644 drivers/media/rc/bpf-rawir-event.c
 create mode 100644 include/linux/bpf_rcdev.h
 create mode 100755 tools/testing/selftests/bpf/test_rawir.sh
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c

-- 
2.17.0



[PATCH bpf-next] libbpf: add ifindex to enable offload support

2018-05-16 Thread Jakub Kicinski
From: David Beckett 

BPF programs currently can only be offloaded using iproute2. This
patch will allow programs to be offloaded using libbpf calls.

Signed-off-by: David Beckett 
Reviewed-by: Jakub Kicinski 
---
 tools/lib/bpf/bpf.c|  2 ++
 tools/lib/bpf/bpf.h|  2 ++
 tools/lib/bpf/libbpf.c | 18 +++---
 tools/lib/bpf/libbpf.h |  1 +
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index a3a8fb2ac697..6a8a00097fd8 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -91,6 +91,7 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr 
*create_attr)
attr.btf_fd = create_attr->btf_fd;
attr.btf_key_id = create_attr->btf_key_id;
attr.btf_value_id = create_attr->btf_value_id;
+   attr.map_ifindex = create_attr->map_ifindex;
 
return sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
@@ -201,6 +202,7 @@ int bpf_load_program_xattr(const struct 
bpf_load_program_attr *load_attr,
attr.log_size = 0;
attr.log_level = 0;
attr.kern_version = load_attr->kern_version;
+   attr.prog_ifindex = load_attr->prog_ifindex;
memcpy(attr.prog_name, load_attr->name,
   min(name_len, BPF_OBJ_NAME_LEN - 1));
 
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index fb3a146d92ff..15bff7728cf1 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -38,6 +38,7 @@ struct bpf_create_map_attr {
__u32 btf_fd;
__u32 btf_key_id;
__u32 btf_value_id;
+   __u32 map_ifindex;
 };
 
 int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
@@ -64,6 +65,7 @@ struct bpf_load_program_attr {
size_t insns_cnt;
const char *license;
__u32 kern_version;
+   __u32 prog_ifindex;
 };
 
 /* Recommend log buffer size */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index df54c4c9e48a..3dbe217bf23e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -178,6 +178,7 @@ struct bpf_program {
/* Index in elf obj file, for relocation use. */
int idx;
char *name;
+   int prog_ifindex;
char *section_name;
struct bpf_insn *insns;
size_t insns_cnt, main_prog_cnt;
@@ -213,6 +214,7 @@ struct bpf_map {
int fd;
char *name;
size_t offset;
+   int map_ifindex;
struct bpf_map_def def;
uint32_t btf_key_id;
uint32_t btf_value_id;
@@ -1091,6 +1093,7 @@ bpf_object__create_maps(struct bpf_object *obj)
int *pfd = >fd;
 
create_attr.name = map->name;
+   create_attr.map_ifindex = map->map_ifindex;
create_attr.map_type = def->type;
create_attr.map_flags = def->map_flags;
create_attr.key_size = def->key_size;
@@ -1273,7 +1276,7 @@ static int bpf_object__collect_reloc(struct bpf_object 
*obj)
 static int
 load_program(enum bpf_prog_type type, enum bpf_attach_type 
expected_attach_type,
 const char *name, struct bpf_insn *insns, int insns_cnt,
-char *license, u32 kern_version, int *pfd)
+char *license, u32 kern_version, int *pfd, int prog_ifindex)
 {
struct bpf_load_program_attr load_attr;
char *log_buf;
@@ -1287,6 +1290,7 @@ load_program(enum bpf_prog_type type, enum 
bpf_attach_type expected_attach_type,
load_attr.insns_cnt = insns_cnt;
load_attr.license = license;
load_attr.kern_version = kern_version;
+   load_attr.prog_ifindex = prog_ifindex;
 
if (!load_attr.insns || !load_attr.insns_cnt)
return -EINVAL;
@@ -1368,7 +1372,8 @@ bpf_program__load(struct bpf_program *prog,
}
err = load_program(prog->type, prog->expected_attach_type,
   prog->name, prog->insns, prog->insns_cnt,
-  license, kern_version, );
+  license, kern_version, ,
+  prog->prog_ifindex);
if (!err)
prog->instances.fds[0] = fd;
goto out;
@@ -1399,7 +1404,8 @@ bpf_program__load(struct bpf_program *prog,
err = load_program(prog->type, prog->expected_attach_type,
   prog->name, result.new_insn_ptr,
   result.new_insn_cnt,
-  license, kern_version, );
+  license, kern_version, ,
+  prog->prog_ifindex);
 
if (err) {
pr_warning("Loading the %dth instance of program '%s' 
failed\n",
@@ -2188,6 +2194,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr 
*attr,
enum bpf_attach_type expected_attach_type;
enum bpf_prog_type 

Re: [PATCH 0/3] ibmvnic: Fix bugs and memory leaks

2018-05-16 Thread Thomas Falcon
On 05/16/2018 03:49 PM, Thomas Falcon wrote:
> This is a small patch series fixing up some bugs and memory leaks
> in the ibmvnic driver. The first fix frees up previously allocated
> memory that should be freed in case of an error. The second fixes
> a reset case that was failing due to TX/RX queue IRQ's being
> erroneously disabled without being enabled again. The final patch
> fixes incorrect reallocated of statistics buffers during a device
> reset, resulting in loss of statistics information and a memory leak.
>
> Thomas Falcon (3):
>   ibmvnic: Free coherent DMA memory if FW map failed
>   ibmvnic: Fix non-fatal firmware error reset
>   ibmvnic: Fix statistics buffers memory leak

Sorry, these are meant for the 'net' tree.

Tom

>
>  drivers/net/ethernet/ibm/ibmvnic.c | 28 +---
>  1 file changed, 17 insertions(+), 11 deletions(-)
>



Re: [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> When CAKE is deployed on a gateway that also performs NAT (which is a
> common deployment mode), the host fairness mechanism cannot distinguish
> internal hosts from each other, and so fails to work correctly.
>
> To fix this, we add an optional NAT awareness mode, which will query the
> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
> and use that in the flow and host hashing.
>
> When the shaper is enabled and the host is already performing NAT, the cost
> of this lookup is negligible. However, in unlimited mode with no NAT being
> performed, there is a significant CPU cost at higher bandwidths. For this
> reason, the feature is turned off by default.
>
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  net/sched/sch_cake.c |   73 
> ++
>  1 file changed, 73 insertions(+)
>
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 65439b643c92..e1038a7b6686 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -71,6 +71,12 @@
>  #include 
>  #include 
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +#include 
> +#include 
> +#include 
> +#endif
> +
>  #define CAKE_SET_WAYS (8)
>  #define CAKE_MAX_TINS (8)
>  #define CAKE_QUEUES (1024)
> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
> return drop;
>  }
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   const struct nf_conntrack_tuple *tuple;
> +   enum ip_conntrack_info ctinfo;
> +   struct nf_conn *ct;
> +   bool rev = false;
> +
> +   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
> +   return;
> +
> +   ct = nf_ct_get(skb, );
> +   if (ct) {
> +   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
> +   } else {
> +   const struct nf_conntrack_tuple_hash *hash;
> +   struct nf_conntrack_tuple srctuple;
> +
> +   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
> +  NFPROTO_IPV4, dev_net(skb->dev),
> +  ))
> +   return;
> +
> +   hash = nf_conntrack_find_get(dev_net(skb->dev),
> +_ct_zone_dflt,
> +);
> +   if (!hash)
> +   return;
> +
> +   rev = true;
> +   ct = nf_ct_tuplehash_to_ctrack(hash);
> +   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
> +   }
> +
> +   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
> +   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
> +
> +   if (keys->ports.ports) {
> +   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
> +   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
> +   }
> +   if (rev)
> +   nf_ct_put(ct);
> +}
> +#else
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   /* There is nothing we can do here without CONNTRACK */
> +}
> +#endif
> +
>  /* Cake has several subtle multiple bit settings. In these cases you
>   *  would be matching triple isolate mode as well.
>   */
> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const 
> struct sk_buff *skb,
> skb_flow_dissect_flow_keys(skb, ,
>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>
> +   if (flow_mode & CAKE_FLOW_NAT_FLAG)
> +   cake_update_flowkeys(, skb);
> +
> /* flow_hash_from_keys() sorts the addresses by value, so we have
>  * to preserve their order in a separate data structure to treat
>  * src and dst host addresses as independently selectable.
> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct 
> nlattr *opt,
> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
> CAKE_FLOW_MASK);
>
> +   if (tb[TCA_CAKE_NAT]) {
> +   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
> +   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
> +   !!nla_get_u32(tb[TCA_CAKE_NAT]);
> +   }


I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK
is not enabled.


> +
> if (tb[TCA_CAKE_RTT]) {
> q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
>
> @@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
> goto nla_put_failure;
>
> +   if (nla_put_u32(skb, TCA_CAKE_NAT,
> +   !!(q->flow_mode & 

Re: [PATCH v2 net] net/ipv4: Initialize proto and ports in flow struct

2018-05-16 Thread Roopa Prabhu
On Wed, May 16, 2018 at 1:36 PM, David Ahern  wrote:
> Updating the FIB tracepoint for the recent change to allow rules using
> the protocol and ports exposed a few places where the entries in the flow
> struct are not initialized.
>
> For __fib_validate_source add the call to fib4_rules_early_flow_dissect
> since it is invoked for the input path. For netfilter, add the memset on
> the flow struct to avoid future problems like this. In ip_route_input_slow
> need to set the fields if the skb dissection does not happen.
>
> Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport 
> and dport")
> Signed-off-by: David Ahern 
> ---

LGTM,
Acked-by: Roopa Prabhu 


[PATCH 1/3] ibmvnic: Free coherent DMA memory if FW map failed

2018-05-16 Thread Thomas Falcon
If the firmware map fails for whatever reason, remember to free
up the memory after.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 6e8d6a6..9e08917 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -192,6 +192,7 @@ static int alloc_long_term_buff(struct ibmvnic_adapter 
*adapter,
if (adapter->fw_done_rc) {
dev_err(dev, "Couldn't map long term buffer,rc = %d\n",
adapter->fw_done_rc);
+   dma_free_coherent(dev, ltb->size, ltb->buff, ltb->addr);
return -1;
}
return 0;
-- 
1.8.3.1



[PATCH 2/3] ibmvnic: Fix non-fatal firmware error reset

2018-05-16 Thread Thomas Falcon
It is not necessary to disable interrupt lines here during a reset
to handle a non-fatal firmware error. Move that call within the code
block that handles the other cases that do require interrupts to be
disabled and re-enabled.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 9e08917..1b9c22f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1822,9 +1822,8 @@ static int do_reset(struct ibmvnic_adapter *adapter,
if (rc)
return rc;
}
+   ibmvnic_disable_irqs(adapter);
}
-
-   ibmvnic_disable_irqs(adapter);
adapter->state = VNIC_CLOSED;
 
if (reset_state == VNIC_CLOSED)
-- 
1.8.3.1



[PATCH 3/3] ibmvnic: Fix statistics buffers memory leak

2018-05-16 Thread Thomas Falcon
Move initialization of statistics buffers from ibmvnic_init function
into ibmvnic_probe. In the current state, ibmvnic_init will be called
again during a device reset, resulting in the allocation of new
buffers without freeing the old ones.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1b9c22f..4bb4646 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -4586,14 +4586,6 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
release_crq_queue(adapter);
}
 
-   rc = init_stats_buffers(adapter);
-   if (rc)
-   return rc;
-
-   rc = init_stats_token(adapter);
-   if (rc)
-   return rc;
-
return rc;
 }
 
@@ -4662,13 +4654,21 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
goto ibmvnic_init_fail;
} while (rc == EAGAIN);
 
+   rc = init_stats_buffers(adapter);
+   if (rc)
+   goto ibmvnic_init_fail;
+
+   rc = init_stats_token(adapter);
+   if (rc)
+   goto ibmvnic_stats_fail;
+
netdev->mtu = adapter->req_mtu - ETH_HLEN;
netdev->min_mtu = adapter->min_mtu - ETH_HLEN;
netdev->max_mtu = adapter->max_mtu - ETH_HLEN;
 
rc = device_create_file(>dev, _attr_failover);
if (rc)
-   goto ibmvnic_init_fail;
+   goto ibmvnic_dev_file_err;
 
netif_carrier_off(netdev);
rc = register_netdev(netdev);
@@ -4687,6 +4687,12 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
 ibmvnic_register_fail:
device_remove_file(>dev, _attr_failover);
 
+ibmvnic_dev_file_err:
+   release_stats_token(adapter);
+
+ibmvnic_stats_fail:
+   release_stats_buffers(adapter);
+
 ibmvnic_init_fail:
release_sub_crqs(adapter, 1);
release_crq_queue(adapter);
-- 
1.8.3.1



[PATCH 0/3] ibmvnic: Fix bugs and memory leaks

2018-05-16 Thread Thomas Falcon
This is a small patch series fixing up some bugs and memory leaks
in the ibmvnic driver. The first fix frees up previously allocated
memory that should be freed in case of an error. The second fixes
a reset case that was failing due to TX/RX queue IRQ's being
erroneously disabled without being enabled again. The final patch
fixes incorrect reallocated of statistics buffers during a device
reset, resulting in loss of statistics information and a memory leak.

Thomas Falcon (3):
  ibmvnic: Free coherent DMA memory if FW map failed
  ibmvnic: Fix non-fatal firmware error reset
  ibmvnic: Fix statistics buffers memory leak

 drivers/net/ethernet/ibm/ibmvnic.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

-- 
1.8.3.1



Re: [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> +
> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg)
> +{
> +   return NULL;
> +}
> +
> +static unsigned long cake_find(struct Qdisc *sch, u32 classid)
> +{
> +   return 0;
> +}
> +
> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +}


Thanks for adding the support to other TC filters, it is much better now!

A quick question: why class_ops->dump_stats is still NULL?

It is supposed to dump the stats of each flow. Is there still any difficulty
to map it to tc class? I thought you figured it out when you added the
tcf_classify().


Re: [PATCH 1/3] sh_eth: add RGMII support

2018-05-16 Thread Andrew Lunn
> > Hi Sergei
> > 
> > What about
> > PHY_INTERFACE_MODE_RGMII_ID,
> > PHY_INTERFACE_MODE_RGMII_RXID,
> > PHY_INTERFACE_MODE_RGMII_TXID,
> 
>Oops, totally forgot about those... :-/

Everybody does. I keep intending to write a email template for
this, and phy_interface_mode_is_rgmii() :-)

Andrew


[PATCH v2 net] net/ipv4: Initialize proto and ports in flow struct

2018-05-16 Thread David Ahern
Updating the FIB tracepoint for the recent change to allow rules using
the protocol and ports exposed a few places where the entries in the flow
struct are not initialized.

For __fib_validate_source add the call to fib4_rules_early_flow_dissect
since it is invoked for the input path. For netfilter, add the memset on
the flow struct to avoid future problems like this. In ip_route_input_slow
need to set the fields if the skb dissection does not happen.

Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and 
dport")
Signed-off-by: David Ahern 
---
Have not seen any problems with the IPv6 version

v2
- do not remove tracepoint in __fib_validate_source (sent the net-next
  version of this patch)
- add set of ports and proto to ip_route_input_slow if skb dissect
  is not done

 net/ipv4/fib_frontend.c   | 8 +++-
 net/ipv4/netfilter/ipt_rpfilter.c | 2 +-
 net/ipv4/route.c  | 7 ++-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index f05afaf3235c..4d622112bf95 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -326,10 +326,11 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
 u8 tos, int oif, struct net_device *dev,
 int rpf, struct in_device *idev, u32 *itag)
 {
+   struct net *net = dev_net(dev);
+   struct flow_keys flkeys;
int ret, no_addr;
struct fib_result res;
struct flowi4 fl4;
-   struct net *net = dev_net(dev);
bool dev_match;
 
fl4.flowi4_oif = 0;
@@ -347,6 +348,11 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
no_addr = idev->ifa_list == NULL;
 
fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
+   if (!fib4_rules_early_flow_dissect(net, skb, , )) {
+   fl4.flowi4_proto = 0;
+   fl4.fl4_sport = 0;
+   fl4.fl4_dport = 0;
+   }
 
trace_fib_validate_source(dev, );
 
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c 
b/net/ipv4/netfilter/ipt_rpfilter.c
index fd01f13c896a..12843c9ef142 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -89,10 +89,10 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct 
xt_action_param *par)
return true ^ invert;
}
 
+   memset(, 0, sizeof(flow));
flow.flowi4_iif = LOOPBACK_IFINDEX;
flow.daddr = iph->saddr;
flow.saddr = rpfilter_get_saddr(iph->daddr);
-   flow.flowi4_oif = 0;
flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0;
flow.flowi4_tos = RT_TOS(iph->tos);
flow.flowi4_scope = RT_SCOPE_UNIVERSE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 29268efad247..2cfa1b518f8d 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1961,8 +1961,13 @@ static int ip_route_input_slow(struct sk_buff *skb, 
__be32 daddr, __be32 saddr,
fl4.saddr = saddr;
fl4.flowi4_uid = sock_net_uid(net, NULL);
 
-   if (fib4_rules_early_flow_dissect(net, skb, , &_flkeys))
+   if (fib4_rules_early_flow_dissect(net, skb, , &_flkeys)) {
flkeys = &_flkeys;
+   } else {
+   fl4.flowi4_proto = 0;
+   fl4.fl4_sport = 0;
+   fl4.fl4_dport = 0;
+   }
 
err = fib_lookup(net, , res, 0);
if (err != 0) {
-- 
2.11.0



Re: [PATCH 1/3] sh_eth: add RGMII support

2018-05-16 Thread Sergei Shtylyov
On 05/16/2018 11:30 PM, Andrew Lunn wrote:

>> The R-Car V3H (AKA R8A77980) GEther controller  adds support for the RGMII
>> PHY interface mode as a new  value  for the RMII_MII register.
>>
>> Based on the original (and large) patch by Vladimir Barinov.
>>
>> Signed-off-by: Vladimir Barinov 
>> Signed-off-by: Sergei Shtylyov 
>>
>> ---
>>  drivers/net/ethernet/renesas/sh_eth.c |3 +++
>>  1 file changed, 3 insertions(+)
>>
>> Index: net-next/drivers/net/ethernet/renesas/sh_eth.c
>> ===
>> --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c
>> +++ net-next/drivers/net/ethernet/renesas/sh_eth.c
>> @@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net
>>  u32 value;
>>  
>>  switch (mdp->phy_interface) {
>> +case PHY_INTERFACE_MODE_RGMII:
>> +value = 0x3;
>> +break;
> 
> Hi Sergei
> 
> What about
>   PHY_INTERFACE_MODE_RGMII_ID,
>   PHY_INTERFACE_MODE_RGMII_RXID,
>   PHY_INTERFACE_MODE_RGMII_TXID,

   Oops, totally forgot about those... :-/

>  Andrew

MBR, Sergei


Re: [PATCH 1/3] sh_eth: add RGMII support

2018-05-16 Thread Andrew Lunn
On Wed, May 16, 2018 at 10:56:45PM +0300, Sergei Shtylyov wrote:
> The R-Car V3H (AKA R8A77980) GEther controller  adds support for the RGMII
> PHY interface mode as a new  value  for the RMII_MII register.
> 
> Based on the original (and large) patch by Vladimir Barinov.
> 
> Signed-off-by: Vladimir Barinov 
> Signed-off-by: Sergei Shtylyov 
> 
> ---
>  drivers/net/ethernet/renesas/sh_eth.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: net-next/drivers/net/ethernet/renesas/sh_eth.c
> ===
> --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c
> +++ net-next/drivers/net/ethernet/renesas/sh_eth.c
> @@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net
>   u32 value;
>  
>   switch (mdp->phy_interface) {
> + case PHY_INTERFACE_MODE_RGMII:
> + value = 0x3;
> + break;

Hi Sergei

What about
PHY_INTERFACE_MODE_RGMII_ID,
PHY_INTERFACE_MODE_RGMII_RXID,
PHY_INTERFACE_MODE_RGMII_TXID,

 Andrew


Re: [PATCH net-next v3 1/3] ipv4: support sport, dport and ip_proto in RTM_GETROUTE

2018-05-16 Thread Roopa Prabhu
On Wed, May 16, 2018 at 11:37 AM, David Miller  wrote:
> From: Roopa Prabhu 
> Date: Tue, 15 May 2018 20:55:06 -0700
>
>> +static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct nlmsghdr 
>> *nlh,
>> +__be32 dst, __be32 src, struct flowi4 *fl4,
>> +struct rtable *rt, struct fib_result *res)
>> +{
>> + struct net *net = sock_net(in_skb->sk);
>> + struct rtmsg *rtm = nlmsg_data(nlh);
>> + u32 table_id = RT_TABLE_MAIN;
>> + struct sk_buff *skb;
>> + int err = 0;
>> +
>> + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
>> + if (!skb)
>> + return -ENOMEM;
>
> If the caller can use GFP_KERNEL, so can this allocation.

yes, but we hold rcu read lock before calling the reply function for fib result.
I did consider allocating the skb before the read lock..but then the
refactoring (into a separate netlink reply func) would seem
unnecessary.

I am fine with pre-allocating and undoing the refactoring if that works better.


[PATCH net-next v12 5/7] sch_cake: Add DiffServ handling

2018-05-16 Thread Toke Høiland-Jørgensen
This adds support for DiffServ-based priority queueing to CAKE. If the
shaper is in use, each priority tier gets its own virtual clock, which
limits that tier's rate to a fraction of the overall shaped rate, to
discourage trying to game the priority mechanism.

CAKE defaults to a simple, three-tier mode that interprets most code points
as "best effort", but places CS1 traffic into a low-priority "bulk" tier
which is assigned 1/16 of the total rate, and a few code points indicating
latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
The other supported DiffServ modes are a 4-tier mode matching the 802.11e
precedence rules, as well as two 8-tier modes, one of which implements
strict precedence of the eight priority levels.

This commit also adds an optional DiffServ 'wash' mode, which will zero out
the DSCP fields of any packet passing through CAKE. While this can
technically be done with other mechanisms in the kernel, having the feature
available in CAKE significantly decreases configuration complexity; and the
implementation cost is low on top of the other DiffServ-handling code.

Filters and applications can set the skb->priority field to override the
DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
CAKE's qdisc handle, the minor number will be interpreted as a priority
tier if it is less than or equal to the number of configured priority
tiers.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |  407 +-
 1 file changed, 401 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index e1038a7b6686..f0f94d536e51 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -297,6 +297,68 @@ static void cobalt_set_enqueue_time(struct sk_buff *skb,
 
 static u16 quantum_div[CAKE_QUEUES + 1] = {0};
 
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {
+   0, 0, 0, 0, 0, 0, 0, 0,
+   1, 1, 1, 1, 1, 1, 1, 1,
+   2, 2, 2, 2, 2, 2, 2, 2,
+   3, 3, 3, 3, 3, 3, 3, 3,
+   4, 4, 4, 4, 4, 4, 4, 4,
+   5, 5, 5, 5, 5, 5, 5, 5,
+   6, 6, 6, 6, 6, 6, 6, 6,
+   7, 7, 7, 7, 7, 7, 7, 7,
+};
+
+static const u8 diffserv8[] = {
+   2, 5, 1, 2, 4, 2, 2, 2,
+   0, 2, 1, 2, 1, 2, 1, 2,
+   5, 2, 4, 2, 4, 2, 4, 2,
+   3, 2, 3, 2, 3, 2, 3, 2,
+   6, 2, 3, 2, 3, 2, 3, 2,
+   6, 2, 2, 2, 6, 2, 6, 2,
+   7, 2, 2, 2, 2, 2, 2, 2,
+   7, 2, 2, 2, 2, 2, 2, 2,
+};
+
+static const u8 diffserv4[] = {
+   0, 2, 0, 0, 2, 0, 0, 0,
+   1, 0, 0, 0, 0, 0, 0, 0,
+   2, 0, 2, 0, 2, 0, 2, 0,
+   2, 0, 2, 0, 2, 0, 2, 0,
+   3, 0, 2, 0, 2, 0, 2, 0,
+   3, 0, 0, 0, 3, 0, 3, 0,
+   3, 0, 0, 0, 0, 0, 0, 0,
+   3, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 diffserv3[] = {
+   0, 0, 0, 0, 2, 0, 0, 0,
+   1, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 2, 0, 2, 0,
+   2, 0, 0, 0, 0, 0, 0, 0,
+   2, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 besteffort[] = {
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+   0, 0, 0, 0, 0, 0, 0, 0,
+};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
 #define REC_INV_SQRT_CACHE (16)
 static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
 
@@ -1219,6 +1281,46 @@ static unsigned int cake_drop(struct Qdisc *sch, struct 
sk_buff **to_free)
return idx + (tin << 16);
 }
 
+static void cake_wash_diffserv(struct sk_buff *skb)
+{
+   switch (skb->protocol) {
+   case htons(ETH_P_IP):
+   ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+   break;
+   case htons(ETH_P_IPV6):
+   ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+   break;
+   default:
+   break;
+   }
+}
+
+static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+   u8 dscp;
+
+   switch (skb->protocol) {
+   case htons(ETH_P_IP):
+   dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+   if (wash && dscp)
+   ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+   return dscp;
+
+   case htons(ETH_P_IPV6):
+   dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+   if (wash && dscp)
+   ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+   return dscp;
+
+   case htons(ETH_P_ARP):
+   return 0x38;  /* CS7 - Net Control */
+
+   default:
+   /* If there is no Diffserv field, treat 

[PATCH net-next v12 7/7] sch_cake: Conditionally split GSO segments

2018-05-16 Thread Toke Høiland-Jørgensen
At lower bandwidths, the transmission time of a single GSO segment can add
an unacceptable amount of latency due to HOL blocking. Furthermore, with a
software shaper, any tuning mechanism employed by the kernel to control the
maximum size of GSO segments is thrown off by the artificial limit on
bandwidth. For this reason, we split GSO segments into their individual
packets iff the shaper is active and configured to a bandwidth <= 1 Gbps.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   99 +-
 1 file changed, 73 insertions(+), 26 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 1ce81d919f73..dca276806e9f 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -82,6 +82,7 @@
 #define CAKE_QUEUES (1024)
 #define CAKE_FLOW_MASK 63
 #define CAKE_FLOW_NAT_FLAG 64
+#define CAKE_SPLIT_GSO_THRESHOLD (12500) /* 1Gbps */
 
 /* struct cobalt_params - contains codel and blue parameters
  * @interval:  codel initial drop rate
@@ -1474,36 +1475,73 @@ static s32 cake_enqueue(struct sk_buff *skb, struct 
Qdisc *sch,
if (unlikely(len > b->max_skblen))
b->max_skblen = len;
 
-   cobalt_set_enqueue_time(skb, now);
-   get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
-   flow_queue_add(flow, skb);
-
-   if (q->ack_filter)
-   ack = cake_ack_filter(q, flow);
+   if (skb_is_gso(skb) && q->rate_flags & CAKE_FLAG_SPLIT_GSO) {
+   struct sk_buff *segs, *nskb;
+   netdev_features_t features = netif_skb_features(skb);
+   unsigned int slen = 0;
+
+   segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+   if (IS_ERR_OR_NULL(segs))
+   return qdisc_drop(skb, sch, to_free);
+
+   while (segs) {
+   nskb = segs->next;
+   segs->next = NULL;
+   qdisc_skb_cb(segs)->pkt_len = segs->len;
+   cobalt_set_enqueue_time(segs, now);
+   get_cobalt_cb(segs)->adjusted_len = cake_overhead(q,
+ segs);
+   flow_queue_add(flow, segs);
+
+   sch->q.qlen++;
+   slen += segs->len;
+   q->buffer_used += segs->truesize;
+   b->packets++;
+   segs = nskb;
+   }
 
-   if (ack) {
-   b->ack_drops++;
-   sch->qstats.drops++;
-   b->bytes += qdisc_pkt_len(ack);
-   len -= qdisc_pkt_len(ack);
-   q->buffer_used += skb->truesize - ack->truesize;
-   if (q->rate_flags & CAKE_FLAG_INGRESS)
-   cake_advance_shaper(q, b, ack, now, true);
+   /* stats */
+   b->bytes+= slen;
+   b->backlogs[idx]+= slen;
+   b->tin_backlog  += slen;
+   sch->qstats.backlog += slen;
+   q->avg_window_bytes += slen;
 
-   qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
-   consume_skb(ack);
+   qdisc_tree_reduce_backlog(sch, 1, len);
+   consume_skb(skb);
} else {
-   sch->q.qlen++;
-   q->buffer_used  += skb->truesize;
-   }
+   /* not splitting */
+   cobalt_set_enqueue_time(skb, now);
+   get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
+   flow_queue_add(flow, skb);
+
+   if (q->ack_filter)
+   ack = cake_ack_filter(q, flow);
+
+   if (ack) {
+   b->ack_drops++;
+   sch->qstats.drops++;
+   b->bytes += qdisc_pkt_len(ack);
+   len -= qdisc_pkt_len(ack);
+   q->buffer_used += skb->truesize - ack->truesize;
+   if (q->rate_flags & CAKE_FLAG_INGRESS)
+   cake_advance_shaper(q, b, ack, now, true);
+
+   qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+   consume_skb(ack);
+   } else {
+   sch->q.qlen++;
+   q->buffer_used  += skb->truesize;
+   }
 
-   /* stats */
-   b->packets++;
-   b->bytes+= len;
-   b->backlogs[idx]+= len;
-   b->tin_backlog  += len;
-   sch->qstats.backlog += len;
-   q->avg_window_bytes += len;
+   /* stats */
+   b->packets++;
+   b->bytes+= len;
+   b->backlogs[idx]+= len;
+   b->tin_backlog  += len;
+   sch->qstats.backlog += len;
+   q->avg_window_bytes += len;
+   }
 
if 

[PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-16 Thread Toke Høiland-Jørgensen
The ACK filter is an optional feature of CAKE which is designed to improve
performance on links with very asymmetrical rate limits. On such links
(which are unfortunately quite prevalent, especially for DSL and cable
subscribers), the downstream throughput can be limited by the number of
ACKs capable of being transmitted in the *upstream* direction.

Filtering ACKs can, in general, have adverse effects on TCP performance
because it interferes with ACK clocking (especially in slow start), and it
reduces the flow's resiliency to ACKs being dropped further along the path.
To alleviate these drawbacks, the ACK filter in CAKE tries its best to
always keep enough ACKs queued to ensure forward progress in the TCP flow
being filtered. It does this by only filtering redundant ACKs. In its
default 'conservative' mode, the filter will always keep at least two
redundant ACKs in the queue, while in 'aggressive' mode, it will filter
down to a single ACK.

The ACK filter works by inspecting the per-flow queue on every packet
enqueue. Starting at the head of the queue, the filter looks for another
eligible packet to drop (so the ACK being dropped is always closer to the
head of the queue than the packet being enqueued). An ACK is eligible only
if it ACKs *fewer* cumulative bytes than the new packet being enqueued.
This prevents duplicate ACKs from being filtered (unless there is also SACK
options present), to avoid interfering with retransmission logic. In
aggressive mode, an eligible packet is always dropped, while in
conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
(with no data segments) are considered eligible for dropping, but when an
ACK with data segments is enqueued, this can cause another pure ACK to
become eligible for dropping.

The approach described above ensures that this ACK filter avoids most of
the drawbacks of a naive filtering mechanism that only keeps flow state but
does not inspect the queue. This is the rationale for including the ACK
filter in CAKE itself rather than as separate module (as the TC filter, for
instance).

Our performance evaluation has shown that on a 30/1 Mbps link with a
bidirectional traffic test (RRUL), turning on the ACK filter on the
upstream link improves downstream throughput by ~20% (both modes) and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for instance,
the ACK filter increases the achievable downstream throughput on a link
with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
Mbps to ~25 Mbps).

Finally, even though we consider the ACK filter to be safer than most, we
do not recommend turning it on everywhere: on more symmetrical link
bandwidths the effect is negligible at best.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |  260 ++
 1 file changed, 258 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index d515f18f8460..65439b643c92 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -755,6 +755,239 @@ static void flow_queue_add(struct cake_flow *flow, struct 
sk_buff *skb)
skb->next = NULL;
 }
 
+static struct iphdr *cake_get_iphdr(const struct sk_buff *skb,
+   struct ipv6hdr *buf)
+{
+   unsigned int offset = skb_network_offset(skb);
+   struct iphdr *iph;
+
+   iph = skb_header_pointer(skb, offset, sizeof(struct iphdr), buf);
+
+   if (!iph)
+   return NULL;
+
+   if (iph->version == 4 && iph->protocol == IPPROTO_IPV6)
+   return skb_header_pointer(skb, offset + iph->ihl * 4,
+ sizeof(struct ipv6hdr), buf);
+
+   else if (iph->version == 4)
+   return iph;
+
+   else if (iph->version == 6)
+   return skb_header_pointer(skb, offset, sizeof(struct ipv6hdr),
+ buf);
+
+   return NULL;
+}
+
+static struct tcphdr *cake_get_tcphdr(const struct sk_buff *skb,
+ void *buf, unsigned int bufsize)
+{
+   unsigned int offset = skb_network_offset(skb);
+   const struct ipv6hdr *ipv6h;
+   const struct tcphdr *tcph;
+   const struct iphdr *iph;
+   struct ipv6hdr _ipv6h;
+   struct tcphdr _tcph;
+
+   ipv6h = skb_header_pointer(skb, offset, sizeof(_ipv6h), &_ipv6h);
+
+   if (!ipv6h)
+   return NULL;
+
+   if (ipv6h->version == 4) {
+   iph = (struct iphdr *)ipv6h;
+   offset += iph->ihl * 4;
+
+   /* special-case 6in4 tunnelling, as that is a common way to get
+* v6 connectivity in the home
+*/
+   if (iph->protocol == IPPROTO_IPV6) {
+   ipv6h = 

[PATCH net-next v12 6/7] sch_cake: Add overhead compensation support to the rate shaper

2018-05-16 Thread Toke Høiland-Jørgensen
This commit adds configurable overhead compensation support to the rate
shaper. With this feature, userspace can configure the actual bottleneck
link overhead and encapsulation mode used, which will be used by the shaper
to calculate the precise duration of each packet on the wire.

This feature is needed because CAKE is often deployed one or two hops
upstream of the actual bottleneck (which can be, e.g., inside a DSL or
cable modem). In this case, the link layer characteristics and overhead
reported by the kernel does not match the actual bottleneck. Being able to
set the actual values in use makes it possible to configure the shaper rate
much closer to the actual bottleneck rate (our experience shows it is
possible to get with 0.1% of the actual physical bottleneck rate), thus
keeping latency low without sacrificing bandwidth.

The overhead compensation has three tunables: A fixed per-packet overhead
size (which, if set, will be accounted from the IP packet header), a
minimum packet size (MPU) and a framing mode supporting either ATM or PTM
framing. We include a set of common keywords in TC to help users configure
the right parameters. If no overhead value is set, the value reported by
the kernel is used.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |  124 ++
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index f0f94d536e51..1ce81d919f73 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -271,6 +271,7 @@ enum {
 
 struct cobalt_skb_cb {
ktime_t enqueue_time;
+   u32 adjusted_len;
 };
 
 static u64 us_to_ns(u64 us)
@@ -1120,6 +1121,88 @@ static u64 cake_ewma(u64 avg, u64 sample, u32 shift)
return avg;
 }
 
+static u32 cake_calc_overhead(struct cake_sched_data *q, u32 len, u32 off)
+{
+   if (q->rate_flags & CAKE_FLAG_OVERHEAD)
+   len -= off;
+
+   if (q->max_netlen < len)
+   q->max_netlen = len;
+   if (q->min_netlen > len)
+   q->min_netlen = len;
+
+   len += q->rate_overhead;
+
+   if (len < q->rate_mpu)
+   len = q->rate_mpu;
+
+   if (q->atm_mode == CAKE_ATM_ATM) {
+   len += 47;
+   len /= 48;
+   len *= 53;
+   } else if (q->atm_mode == CAKE_ATM_PTM) {
+   /* Add one byte per 64 bytes or part thereof.
+* This is conservative and easier to calculate than the
+* precise value.
+*/
+   len += (len + 63) / 64;
+   }
+
+   if (q->max_adjlen < len)
+   q->max_adjlen = len;
+   if (q->min_adjlen > len)
+   q->min_adjlen = len;
+
+   return len;
+}
+
+static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
+{
+   const struct skb_shared_info *shinfo = skb_shinfo(skb);
+   unsigned int hdr_len, last_len = 0;
+   u32 off = skb_network_offset(skb);
+   u32 len = qdisc_pkt_len(skb);
+   u16 segs = 1;
+
+   q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
+
+   if (!shinfo->gso_size)
+   return cake_calc_overhead(q, len, off);
+
+   /* borrowed from qdisc_pkt_len_init() */
+   hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
+
+   /* + transport layer */
+   if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
+   SKB_GSO_TCPV6))) {
+   const struct tcphdr *th;
+   struct tcphdr _tcphdr;
+
+   th = skb_header_pointer(skb, skb_transport_offset(skb),
+   sizeof(_tcphdr), &_tcphdr);
+   if (likely(th))
+   hdr_len += __tcp_hdrlen(th);
+   } else {
+   struct udphdr _udphdr;
+
+   if (skb_header_pointer(skb, skb_transport_offset(skb),
+  sizeof(_udphdr), &_udphdr))
+   hdr_len += sizeof(struct udphdr);
+   }
+
+   if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
+   segs = DIV_ROUND_UP(skb->len - hdr_len,
+   shinfo->gso_size);
+   else
+   segs = shinfo->gso_segs;
+
+   len = shinfo->gso_size + hdr_len;
+   last_len = skb->len - shinfo->gso_size * (segs - 1);
+
+   return (cake_calc_overhead(q, len, off) * (segs - 1) +
+   cake_calc_overhead(q, last_len, off));
+}
+
 static void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
 {
struct cake_heap_entry ii = q->overflow_heap[i];
@@ -1197,7 +1280,7 @@ static int cake_advance_shaper(struct cake_sched_data *q,
   struct sk_buff *skb,
   ktime_t now, bool drop)
 {
-   u32 len = qdisc_pkt_len(skb);
+   u32 len = get_cobalt_cb(skb)->adjusted_len;
 
/* charge packet bandwidth to 

[PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Toke Høiland-Jørgensen
When CAKE is deployed on a gateway that also performs NAT (which is a
common deployment mode), the host fairness mechanism cannot distinguish
internal hosts from each other, and so fails to work correctly.

To fix this, we add an optional NAT awareness mode, which will query the
kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
and use that in the flow and host hashing.

When the shaper is enabled and the host is already performing NAT, the cost
of this lookup is negligible. However, in unlimited mode with no NAT being
performed, there is a significant CPU cost at higher bandwidths. For this
reason, the feature is turned off by default.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   73 ++
 1 file changed, 73 insertions(+)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 65439b643c92..e1038a7b6686 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -71,6 +71,12 @@
 #include 
 #include 
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+#include 
+#include 
+#include 
+#endif
+
 #define CAKE_SET_WAYS (8)
 #define CAKE_MAX_TINS (8)
 #define CAKE_QUEUES (1024)
@@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
return drop;
 }
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+
+static void cake_update_flowkeys(struct flow_keys *keys,
+const struct sk_buff *skb)
+{
+   const struct nf_conntrack_tuple *tuple;
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+   bool rev = false;
+
+   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+   return;
+
+   ct = nf_ct_get(skb, );
+   if (ct) {
+   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+   } else {
+   const struct nf_conntrack_tuple_hash *hash;
+   struct nf_conntrack_tuple srctuple;
+
+   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+  NFPROTO_IPV4, dev_net(skb->dev),
+  ))
+   return;
+
+   hash = nf_conntrack_find_get(dev_net(skb->dev),
+_ct_zone_dflt,
+);
+   if (!hash)
+   return;
+
+   rev = true;
+   ct = nf_ct_tuplehash_to_ctrack(hash);
+   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+   }
+
+   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+   if (keys->ports.ports) {
+   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+   }
+   if (rev)
+   nf_ct_put(ct);
+}
+#else
+static void cake_update_flowkeys(struct flow_keys *keys,
+const struct sk_buff *skb)
+{
+   /* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
 /* Cake has several subtle multiple bit settings. In these cases you
  *  would be matching triple isolate mode as well.
  */
@@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct 
sk_buff *skb,
skb_flow_dissect_flow_keys(skb, ,
   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
+   if (flow_mode & CAKE_FLOW_NAT_FLAG)
+   cake_update_flowkeys(, skb);
+
/* flow_hash_from_keys() sorts the addresses by value, so we have
 * to preserve their order in a separate data structure to treat
 * src and dst host addresses as independently selectable.
@@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct nlattr 
*opt,
q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
CAKE_FLOW_MASK);
 
+   if (tb[TCA_CAKE_NAT]) {
+   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+   !!nla_get_u32(tb[TCA_CAKE_NAT]);
+   }
+
if (tb[TCA_CAKE_RTT]) {
q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
 
@@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
*skb)
if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
goto nla_put_failure;
 
+   if (nla_put_u32(skb, TCA_CAKE_NAT,
+   !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+   goto nla_put_failure;
+
return nla_nest_end(skb, opts);
 
 nla_put_failure:



[PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).

This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the c...@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 
100.0ms raw overhead 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 0b of 500b
 capacity estimate: 100Mbit
 min/max network layer size:65535 /   0
 min/max overhead-adjusted size:65535 /   0
 average network hdr offset:0

   Bulk  Best EffortVoice
  thresh   6250Kbit  100Mbit   25Mbit
  target  5.0ms5.0ms5.0ms
  interval  100.0ms  100.0ms  100.0ms
  pk_delay  0us  0us  0us
  av_delay  0us  0us  0us
  sp_delay  0us  0us  0us
  pkts000
  bytes   000
  way_inds000
  way_miss000
  way_cols000
  drops   000
  marks   000
  ack_drop000
  sp_flows000
  bk_flows000
  un_flows000
  max_len 000
  quantum   300 1514  762

Tested-by: Pete Heist 
Tested-by: Georgios Amanakis 
Signed-off-by: Dave Taht 
Signed-off-by: Toke Høiland-Jørgensen 
---
 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig  |   11 
 net/sched/Makefile |1 
 net/sched/sch_cake.c   | 1739 
 4 files changed, 1856 insertions(+)
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..883e84f008d7 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,109 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+   TCA_CAKE_UNSPEC,
+   TCA_CAKE_BASE_RATE64,
+   TCA_CAKE_DIFFSERV_MODE,
+   TCA_CAKE_ATM,
+   TCA_CAKE_FLOW_MODE,
+   TCA_CAKE_OVERHEAD,
+   TCA_CAKE_RTT,
+   TCA_CAKE_TARGET,
+   TCA_CAKE_AUTORATE,
+   TCA_CAKE_MEMORY,
+   TCA_CAKE_NAT,
+   TCA_CAKE_RAW,
+   TCA_CAKE_WASH,
+   TCA_CAKE_MPU,
+ 

[PATCH net-next v12 0/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Toke Høiland-Jørgensen
This patch series adds the CAKE qdisc, and has been split up to ease
review.

I have attempted to split out each configurable feature into its own patch.
The first commit adds the base shaper and packet scheduler, while
subsequent commits add the optional features. The full userspace API and
most data structures are included in this commit, but options not
understood in the base version will be ignored.

The result of applying the entire series is identical to the out of tree
version that have seen extensive testing in previous deployments, most
notably as an out of tree patch to OpenWrt. However, note that I have only
compile tested the individual patches; so the whole series should be
considered as a unit.

---
Changelog

v12:
  - Get rid of custom time typedefs. Use ktime_t for time and u64 for
duration instead.

v11:
  - Fix overhead compensation calculation for GSO packets
  - Change configured rate to be u64 (I ran out of bits before I ran out
of CPU when testing the effects of the above)

v10:
  - Christmas tree gardening (fix variable declarations to be in reverse
line length order)

v9:
  - Remove duplicated checks around kvfree() and just call it
unconditionally.
  - Don't pass __GFP_NOWARN when allocating memory
  - Move options in cake_dump() that are related to optional features to
later patches implementing the features.
  - Support attaching filters to the qdisc and use the classification
result to select flow queue.
  - Support overriding diffserv priority tin from skb->priority

v8:
  - Remove inline keyword from function definitions
  - Simplify ACK filter; remove the complex state handling to make the
logic easier to follow. This will potentially be a bit less efficient,
but I have not been able to measure a difference.

v7:
  - Split up patch into a series to ease review.
  - Constify the ACK filter.

v6:
  - Fix 6in4 encapsulation checks in ACK filter code
  - Checkpatch fixes

v5:
  - Refactor ACK filter code and hopefully fix the safety issues
properly this time.

v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
complaint).
  - Several fixes to the ACK filter:
- Check pskb_may_pull() before deref of transport headers.
- Don't run ACK filter logic on split GSO packets
- Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
while writing the paper.

---

Toke Høiland-Jørgensen (7):
  sched: Add Common Applications Kept Enhanced (cake) qdisc
  sch_cake: Add ingress mode
  sch_cake: Add optional ACK filter
  sch_cake: Add NAT awareness to packet classifier
  sch_cake: Add DiffServ handling
  sch_cake: Add overhead compensation support to the rate shaper
  sch_cake: Conditionally split GSO segments


 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig  |   11 
 net/sched/Makefile |1 
 net/sched/sch_cake.c   | 2709 
 4 files changed, 2826 insertions(+)
 create mode 100644 net/sched/sch_cake.c



[PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Toke Høiland-Jørgensen
The ingress mode is meant to be enabled when CAKE runs downlink of the
actual bottleneck (such as on an IFB device). The mode changes the shaper
to also account dropped packets to the shaped rate, as these have already
traversed the bottleneck.

Enabling ingress mode will also tune the AQM to always keep at least two
packets queued *for each flow*. This is done by scaling the minimum queue
occupancy level that will disable the AQM by the number of active bulk
flows. The rationale for this is that retransmits are more expensive in
ingress mode, since dropped packets have to traverse the bottleneck again
when they are retransmitted; thus, being more lenient and keeping a minimum
number of packets queued will improve throughput in cases where the number
of active flows are so large that they saturate the bottleneck even at
their minimum window size.

This commit also adds a separate switch to enable ingress mode rate
autoscaling. If enabled, the autoscaling code will observe the actual
traffic rate and adjust the shaper rate to match it. This can help avoid
latency increases in the case where the actual bottleneck rate decreases
below the shaped rate. The scaling filters out spikes by an EWMA filter.

Signed-off-by: Toke Høiland-Jørgensen 
---
 net/sched/sch_cake.c |   85 --
 1 file changed, 81 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 422cfccbf37f..d515f18f8460 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -433,7 +433,8 @@ static bool cobalt_queue_empty(struct cobalt_vars *vars,
 static bool cobalt_should_drop(struct cobalt_vars *vars,
   struct cobalt_params *p,
   ktime_t now,
-  struct sk_buff *skb)
+  struct sk_buff *skb,
+  u32 bulk_flows)
 {
bool next_due, over_target, drop = false;
ktime_t schedule;
@@ -457,6 +458,7 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
sojourn = ktime_to_ns(ktime_sub(now, cobalt_get_enqueue_time(skb)));
schedule = ktime_sub(now, vars->drop_next);
over_target = sojourn > p->target &&
+ sojourn > p->mtu_time * bulk_flows * 2 &&
  sojourn > p->mtu_time * 4;
next_due = vars->count && schedule >= 0;
 
@@ -910,6 +912,9 @@ static unsigned int cake_drop(struct Qdisc *sch, struct 
sk_buff **to_free)
b->tin_dropped++;
sch->qstats.drops++;
 
+   if (q->rate_flags & CAKE_FLAG_INGRESS)
+   cake_advance_shaper(q, b, skb, now, true);
+
__qdisc_drop(skb, to_free);
sch->q.qlen--;
 
@@ -986,8 +991,46 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
cake_heapify_up(q, b->overflow_idx[idx]);
 
/* incoming bandwidth capacity estimate */
-   q->avg_window_bytes = 0;
-   q->last_packet_time = now;
+   if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+   u64 packet_interval = \
+   ktime_to_ns(ktime_sub(now, q->last_packet_time));
+
+   if (packet_interval > NSEC_PER_SEC)
+   packet_interval = NSEC_PER_SEC;
+
+   /* filter out short-term bursts, eg. wifi aggregation */
+   q->avg_packet_interval = \
+   cake_ewma(q->avg_packet_interval,
+ packet_interval,
+ (packet_interval > q->avg_packet_interval ?
+ 2 : 8));
+
+   q->last_packet_time = now;
+
+   if (packet_interval > q->avg_packet_interval) {
+   u64 window_interval = \
+   ktime_to_ns(ktime_sub(now,
+ q->avg_window_begin));
+   u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+   do_div(b, window_interval);
+   q->avg_peak_bandwidth =
+   cake_ewma(q->avg_peak_bandwidth, b,
+ b > q->avg_peak_bandwidth ? 2 : 8);
+   q->avg_window_bytes = 0;
+   q->avg_window_begin = now;
+
+   if (ktime_after(now,
+   ktime_add_ms(q->last_reconfig_time,
+250))) {
+   q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+   cake_reconfigure(sch);
+   }
+   }
+   } else {
+   q->avg_window_bytes = 0;
+   q->last_packet_time = now;
+   }
 
/* flowchain */
if (!flow->set || flow->set == CAKE_SET_DECAYING) {
@@ -1246,14 +1289,26 @@ static struct sk_buff 

[PATCH] bpf: add __printf verification to bpf_verifier_vlog

2018-05-16 Thread Mathieu Malaterre
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
function is used twice in verifier.c in both cases the caller function
already uses the __printf gcc attribute.

Remove the following warning, triggered with W=1:

  kernel/bpf/verifier.c:176:2: warning: function might be possible candidate 
for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]

Signed-off-by: Mathieu Malaterre 
---
 include/linux/bpf_verifier.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 7e61c395fddf..ebf78f8ddfa1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -197,8 +197,8 @@ struct bpf_verifier_env {
u32 subprog_cnt;
 };
 
-void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt,
-  va_list args);
+__printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
+ const char *fmt, va_list args);
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
   const char *fmt, ...);
 
-- 
2.11.0



Re: [PATCH net-next v2 2/2] drivers: net: Remove device_node checks with of_mdiobus_register()

2018-05-16 Thread Sergei Shtylyov
Hello!

On 05/16/2018 02:56 AM, Florian Fainelli wrote:

> A number of drivers have the following pattern:
> 
> if (np)
>   of_mdiobus_register()
> else
>   mdiobus_register()
> 
> which the implementation of of_mdiobus_register() now takes care of.
> Remove that pattern in drivers that strictly adhere to it.
> 
> Signed-off-by: Florian Fainelli 
[...]

> diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
> index ac621f44237a..02e8982519ce 100644
> --- a/drivers/net/dsa/bcm_sf2.c
> +++ b/drivers/net/dsa/bcm_sf2.c
> @@ -450,12 +450,8 @@ static int bcm_sf2_mdio_register(struct dsa_switch *ds)
>   priv->slave_mii_bus->parent = ds->dev->parent;
>   priv->slave_mii_bus->phy_mask = ~priv->indir_phy_mask;
>  
> - if (dn)
> - err = of_mdiobus_register(priv->slave_mii_bus, dn);
> - else
> - err = mdiobus_register(priv->slave_mii_bus);
> -
> - if (err)
> + err = of_mdiobus_register(priv->slave_mii_bus, dn);
> + if (err && dn)

   of_node_put() checks for NULL.

>   of_node_put(dn);
>  
>   return err;
[...]
> diff --git a/drivers/net/ethernet/freescale/fec_main.c 
> b/drivers/net/ethernet/freescale/fec_main.c
> index d4604bc8eb5b..f3e43db0d6cb 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -2052,13 +2052,9 @@ static int fec_enet_mii_init(struct platform_device 
> *pdev)
>   fep->mii_bus->parent = >dev;
>  
>   node = of_get_child_by_name(pdev->dev.of_node, "mdio");
> - if (node) {
> - err = of_mdiobus_register(fep->mii_bus, node);
> + err = of_mdiobus_register(fep->mii_bus, node);
> + if (node)
>   of_node_put(node);

   Same comment here.

[...]
> diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
> b/drivers/net/ethernet/renesas/sh_eth.c
> index 5970d9e5ddf1..8dd41e08a6c6 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
> @@ -3025,15 +3025,10 @@ static int sh_mdio_init(struct sh_eth_private *mdp,
>pdev->name, pdev->id);
>  
>   /* register MDIO bus */
> - if (dev->of_node) {
> - ret = of_mdiobus_register(mdp->mii_bus, dev->of_node);
> - } else {
> - if (pd->phy_irq > 0)
> - mdp->mii_bus->irq[pd->phy] = pd->phy_irq;
> -
> - ret = mdiobus_register(mdp->mii_bus);
> - }
> + if (pd->phy_irq > 0)
> + mdp->mii_bus->irq[pd->phy] = pd->phy_irq;
>  
> + ret = of_mdiobus_register(mdp->mii_bus, dev->of_node);
>   if (ret)
>   goto out_free_bus;
>  

   This part is:

Acked-by: Sergei Shtylyov 

[...]
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index 91761436709a..8dff87ec6d99 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -1843,12 +1843,9 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev)
>   }
>  
>   node = of_get_child_by_name(dev->udev->dev.of_node, "mdio");
> - if (node) {
> - ret = of_mdiobus_register(dev->mdiobus, node);
> + ret = of_mdiobus_register(dev->mdiobus, node);
> + if (node)
>   of_node_put(node);

   of_node_put() checks for NULL, again...

MBR, Sergei


Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather

2018-05-16 Thread Willem de Bruijn
On Tue, May 15, 2018 at 7:57 PM, Willem de Bruijn
 wrote:
> On Tue, May 15, 2018 at 4:04 PM, Willem de Bruijn
>  wrote:
>> On Tue, May 15, 2018 at 10:14 AM, Willem de Bruijn
>>  wrote:
>>> On Mon, May 14, 2018 at 7:45 PM, Eric Dumazet  
>>> wrote:


 On 05/14/2018 04:30 PM, Willem de Bruijn wrote:

> I don't quite follow. The reported crash happens in the protocol layer,
> because of this check. With pagedlen we have not allocated
> sufficient space for the skb_put.
>
> if (!(rt->dst.dev->features_F_SG)) {
> unsigned int off;
>
> off = skb->len;
> if (getfrag(from, skb_put(skb, copy),
> offset, copy, off, skb) < 0) {
> __skb_trim(skb, off);
> err = -EFAULT;
> goto error;
> }
> } else {
> int i = skb_shinfo(skb)->nr_frags;
>
> Are you referring to a separate potential issue in the gso layer?
> If a bonding device advertises SG, but a slave does not, then
> skb_segment on the slave should build linear segs? I have not
> tested that.

 Given that the device attribute could change under us, we need to not
 crash, even if initially we thought NETIF_F_SG was available.

 Unless you want to hold RTNL in UDP xmit :)

 Ideally, GSO should be always on, as we did for TCP.

 Otherwise, I can guarantee syzkaller will hit again.
>>>
>>> Ah, right. Thanks, Eric!
>>>
>>> I'll read that feature bit only once.
>>
>> This issue is actually deeper and not specific to gso.
>> With corking it is trivial to turn off sg in between calls.
>>
>> I'll need to send a separate fix for that.
>
> This would do it. The extra branch is unfortunate, but I see no easy
> way around it for the corking case.
>
> It will obviously not build a linear skb, but validate_xmit_skb will clean
> that up for such edge cases.
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 66340ab750e6..e7daec7c7421 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1040,7 +1040,8 @@ static int __ip_append_data(struct sock *sk,
> if (copy > length)
> copy = length;
>
> -   if (!(rt->dst.dev->features_F_SG)) {
> +   if (!(rt->dst.dev->features_F_SG) &&
> +   skb_tailroom(skb) >= copy) {
> unsigned int off;

Reminder that this is a separate draft patch to net unrelated to gso.

A simpler branch

> -   if (!(rt->dst.dev->features_F_SG)) {
> +   if (skb_tailroom(skb) >= copy) {

is probably sufficient, but might have subtle side-effects when SG is
off, where allocation padding allows data to fit that would currently is
added as frag. Risky for a stable patch with no significant benefit.

On the other extreme, I can define

  bool sg = rt->dst.dev->features & NETIF_F_SG;

and refer to that in both current sites that test the flag. But this
will not help the corking case where the function is entered twice
for the same skb. I'll add that in the net-next gso fix where the flag
is tested three times.

But intend to send this snippet (also for v6) as is.


Re: [PATCH bpf-next] samples/bpf: Decrement ttl in fib forwarding example

2018-05-16 Thread Daniel Borkmann
On 05/16/2018 01:20 AM, David Ahern wrote:
> Only consider forwarding packets if ttl in received packet is > 1 and
> decrement ttl before handing off to bpf_redirect_map.
> 
> Signed-off-by: David Ahern 

Looks good, applied to bpf-next, thanks David!


Re: [PATCH bpf-next v6 2/4] bpf: sockmap, add hash map support

2018-05-16 Thread Daniel Borkmann
On 05/15/2018 11:09 PM, Y Song wrote:
> On Tue, May 15, 2018 at 12:01 PM, Daniel Borkmann  
> wrote:
>> On 05/14/2018 07:00 PM, John Fastabend wrote:
[...]
>>>  enum bpf_prog_type {
>>> @@ -1855,6 +1856,52 @@ struct bpf_stack_build_id {
>>>   * Egress device index on success, 0 if packet needs to 
>>> continue
>>>   * up the stack for further processing or a negative error in 
>>> case
>>>   * of failure.
>>> + * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct 
>>> bpf_map *map, void *key, u64 flags)
>>
>> When you rebase please fix this up properly next time and add a newline in 
>> between
>> the helpers. I fixed this up while applying.
> 
> I guess the tools/include/uapi/linux/bpf.h may also need fixup to be
> in sync with main bpf.h.

Yep agree, just fixed it up, thanks!


Re: [RFC bpf-next 00/11] Add socket lookup support

2018-05-16 Thread Alexei Starovoitov
On Wed, May 16, 2018 at 12:05:06PM -0700, Joe Stringer wrote:
> >
> > A few open points:
> > * Currently, the lookup interface only returns either a valid socket or a 
> > NULL
> >   pointer. This means that if there is any kind of issue with the tuple, 
> > such
> >   as it provides an unsupported protocol number, or the socket can't be 
> > found,
> >   then we are unable to differentiate these cases from one another. One 
> > natural
> >   approach to improve this could be to return an ERR_PTR from the
> >   bpf_sk_lookup() helper. This would be more complicated but maybe it's
> >   worthwhile.
> 
> This suggestion would add a lot of complexity, and there's not many
> legitimately different error cases. There's:
> * Unsupported socket type
> * Cannot find netns
> * Tuple argument is the wrong size
> * Can't find socket
> 
> If we split the helpers into protocol-specific types, the first one
> would be addressed. The last one is addressed by returning NULL. It
> seems like a reasonable compromise to me to return NULL also in the
> middle two cases as well, and rely on the BPF writer to provide valid
> arguments.
> 
> > * No ordering is defined between sockets. If the tuple could find multiple
> >   sockets, then it will arbitrarily return one. It is up to the caller to
> >   handle this. If we wish to handle this more reliably in future, we could
> >   encode an ordering preference in the flags field.
> 
> Doesn't need to be addressed with this series, there is scope for
> addressing these cases when the use case arises.

Thanks for summarizing the conf call discussion.
Looking forward to non-rfc patches :)



[PATCH 3/3] sh_eth: add R8A77980 support

2018-05-16 Thread Sergei Shtylyov
Finally, add support for the DT probing of the R-Car V3H (AKA R8A77980) --
it's the only R-Car gen3 SoC having the GEther controller -- others have
only EtherAVB...

Based on the original (and large) patch by Vladimir Barinov.

Signed-off-by: Vladimir Barinov 
Signed-off-by: Sergei Shtylyov 

---
 Documentation/devicetree/bindings/net/sh_eth.txt |1 
 drivers/net/ethernet/renesas/sh_eth.c|   44 +++
 2 files changed, 45 insertions(+)

Index: net-next/Documentation/devicetree/bindings/net/sh_eth.txt
===
--- net-next.orig/Documentation/devicetree/bindings/net/sh_eth.txt
+++ net-next/Documentation/devicetree/bindings/net/sh_eth.txt
@@ -14,6 +14,7 @@ Required properties:
  "renesas,ether-r8a7791"  if the device is a part of R8A7791 SoC.
  "renesas,ether-r8a7793"  if the device is a part of R8A7793 SoC.
  "renesas,ether-r8a7794"  if the device is a part of R8A7794 SoC.
+ "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC.
  "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC.
  "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device.
  "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1
Index: net-next/drivers/net/ethernet/renesas/sh_eth.c
===
--- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net-next/drivers/net/ethernet/renesas/sh_eth.c
@@ -753,6 +753,49 @@ static struct sh_eth_cpu_data rcar_gen2_
.rmiimode   = 1,
.magic  = 1,
 };
+
+/* R8A77980 */
+static struct sh_eth_cpu_data r8a77980_data = {
+   .soft_reset = sh_eth_soft_reset_gether,
+
+   .set_duplex = sh_eth_set_duplex,
+   .set_rate   = sh_eth_set_rate_gether,
+
+   .register_type  = SH_ETH_REG_GIGABIT,
+
+   .edtrr_trns = EDTRR_TRNS_GETHER,
+   .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD,
+   .ecsipr_value   = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP |
+ ECSIPR_MPDIP,
+   .eesipr_value   = EESIPR_RFCOFIP | EESIPR_ECIIP |
+ EESIPR_FTCIP | EESIPR_TDEIP | EESIPR_TFUFIP |
+ EESIPR_FRIP | EESIPR_RDEIP | EESIPR_RFOFIP |
+ EESIPR_RMAFIP | EESIPR_RRFIP |
+ EESIPR_RTLFIP | EESIPR_RTSFIP |
+ EESIPR_PREIP | EESIPR_CERFIP,
+
+   .tx_check   = EESR_FTC | EESR_CD | EESR_RTO,
+   .eesr_err_check = EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT |
+ EESR_RFE | EESR_RDE | EESR_RFRMER |
+ EESR_TFE | EESR_TDE | EESR_ECI,
+   .fdr_value  = 0x070f,
+
+   .apr= 1,
+   .mpr= 1,
+   .tpauser= 1,
+   .bculr  = 1,
+   .hw_swap= 1,
+   .nbst   = 1,
+   .rpadir = 1,
+   .rpadir_value   = 2 << 16,
+   .no_trimd   = 1,
+   .no_ade = 1,
+   .xdfar_rw   = 1,
+   .hw_checksum= 1,
+   .select_mii = 1,
+   .magic  = 1,
+   .cexcr  = 1,
+};
 #endif /* CONFIG_OF */
 
 static void sh_eth_set_rate_sh7724(struct net_device *ndev)
@@ -3134,6 +3177,7 @@ static const struct of_device_id sh_eth_
{ .compatible = "renesas,ether-r8a7791", .data = _gen2_data },
{ .compatible = "renesas,ether-r8a7793", .data = _gen2_data },
{ .compatible = "renesas,ether-r8a7794", .data = _gen2_data },
+   { .compatible = "renesas,gether-r8a77980", .data = _data },
{ .compatible = "renesas,ether-r7s72100", .data = _data },
{ .compatible = "renesas,rcar-gen1-ether", .data = _gen1_data },
{ .compatible = "renesas,rcar-gen2-ether", .data = _gen2_data },


[PATCH 2/3] sh_eth: add EDMR.NBST support

2018-05-16 Thread Sergei Shtylyov
The R-Car V3H (AKA R8A77980) GEther controller adds the DMA burst mode bit
(NBST) in EDMR and the manual tells to always set it before doing any DMA.

Based on the original (and large) patch by Vladimir Barinov.

Signed-off-by: Vladimir Barinov 
Signed-off-by: Sergei Shtylyov 

---
 drivers/net/ethernet/renesas/sh_eth.c |4 
 drivers/net/ethernet/renesas/sh_eth.h |2 ++
 2 files changed, 6 insertions(+)

Index: net-next/drivers/net/ethernet/renesas/sh_eth.c
===
--- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net-next/drivers/net/ethernet/renesas/sh_eth.c
@@ -1434,6 +1434,10 @@ static int sh_eth_dev_init(struct net_de
 
sh_eth_write(ndev, mdp->cd->trscer_err_mask, TRSCER);
 
+   /* DMA transfer burst mode */
+   if (mdp->cd->nbst)
+   sh_eth_modify(ndev, EDMR, EDMR_NBST, EDMR_NBST);
+
if (mdp->cd->bculr)
sh_eth_write(ndev, 0x800, BCULR);   /* Burst sycle set */
 
Index: net-next/drivers/net/ethernet/renesas/sh_eth.h
===
--- net-next.orig/drivers/net/ethernet/renesas/sh_eth.h
+++ net-next/drivers/net/ethernet/renesas/sh_eth.h
@@ -184,6 +184,7 @@ enum GECMR_BIT {
 
 /* EDMR */
 enum DMAC_M_BIT {
+   EDMR_NBST = 0x80,
EDMR_EL = 0x40, /* Litte endian */
EDMR_DL1 = 0x20, EDMR_DL0 = 0x10,
EDMR_SRST_GETHER = 0x03,
@@ -505,6 +506,7 @@ struct sh_eth_cpu_data {
unsigned bculr:1;   /* EtherC have BCULR */
unsigned tsu:1; /* EtherC have TSU */
unsigned hw_swap:1; /* E-DMAC have DE bit in EDMR */
+   unsigned nbst:1;/* E-DMAC has NBST bit in EDMR */
unsigned rpadir:1;  /* E-DMAC have RPADIR */
unsigned no_trimd:1;/* E-DMAC DO NOT have TRIMD */
unsigned no_ade:1;  /* E-DMAC DO NOT have ADE bit in EESR */


[PATCH 1/3] sh_eth: add RGMII support

2018-05-16 Thread Sergei Shtylyov
The R-Car V3H (AKA R8A77980) GEther controller  adds support for the RGMII
PHY interface mode as a new  value  for the RMII_MII register.

Based on the original (and large) patch by Vladimir Barinov.

Signed-off-by: Vladimir Barinov 
Signed-off-by: Sergei Shtylyov 

---
 drivers/net/ethernet/renesas/sh_eth.c |3 +++
 1 file changed, 3 insertions(+)

Index: net-next/drivers/net/ethernet/renesas/sh_eth.c
===
--- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net-next/drivers/net/ethernet/renesas/sh_eth.c
@@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net
u32 value;
 
switch (mdp->phy_interface) {
+   case PHY_INTERFACE_MODE_RGMII:
+   value = 0x3;
+   break;
case PHY_INTERFACE_MODE_GMII:
value = 0x2;
break;


[PATCH 0/3] Add R8A77980 GEther support

2018-05-16 Thread Sergei Shtylyov
Hello!

Here's a set of 3 patches against DaveM's 'net-next.git' repo. They (gradually)
add R8A77980 GEther support to the 'sh_eth' driver, starting with couple new
register bits/values introduced with this chip, and ending with adding a new
'struct sh_eth_cpu_data' instance connected to the new DT "compatible" prop
value...

[1/1] sh_eth: add RGMII support
[2/3] sh_eth: add EDMR.NBST support
[3/3] sh_eth: add R8A77980 support

MBR, Sergei


[RFC PATCH] net: hns3: hns3_pci_sriov_configure() can be static

2018-05-16 Thread kbuild test robot

Fixes: fdb793670a00 ("net: hns3: Add support of .sriov_configure in HNS3 
driver")
Signed-off-by: Fengguang Wu 
---
 hns3_enet.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index e85ff38..3617b9d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1579,7 +1579,7 @@ static void hns3_remove(struct pci_dev *pdev)
  * Enable or change the number of VFs. Called when the user updates the number
  * of VFs in sysfs.
  **/
-int hns3_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
+static int hns3_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
 {
int ret;
 


Re: [PATCH net-next 09/10] net: hns3: Add support of .sriov_configure in HNS3 driver

2018-05-16 Thread kbuild test robot
Hi Peng,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Salil-Mehta/Misc-Bug-Fixes-and-clean-ups-for-HNS3-Driver/20180516-211239
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:266:16: sparse: expression 
using sizeof(void)
   drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:266:16: sparse: expression 
using sizeof(void)
>> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:1582:5: sparse: symbol 
>> 'hns3_pci_sriov_configure' was not declared. Should it be static?
   drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2513:21: sparse: expression 
using sizeof(void)
   drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2706:22: sparse: expression 
using sizeof(void)
   drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2706:22: sparse: expression 
using sizeof(void)

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST

2018-05-16 Thread Florian Fainelli
On 05/16/2018 12:07 PM, David Miller wrote:
> From: David Miller 
> Date: Wed, 16 May 2018 15:06:59 -0400 (EDT)
> 
>> So applied, thanks.
> 
> Nevermind, eventually got a build failure:
> 
> ERROR: "knav_queue_open" [drivers/net/ethernet/ti/keystone_netcp.ko] 
> undefined!
> make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
> make: *** [Makefile:1276: modules] Error 2

Snap, ok, let me  do some more serious build testing with different
architectures here.

Sorry about that.
-- 
Florian


[PATCH v3] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-16 Thread Christophe JAILLET
When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.

Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing")
Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx rules")
Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to query vport 
RoCE fields")
Signed-off-by: Christophe JAILLET 
---
v1 -> v2: More places to update have been added to the patch
v2 -> v3: Add Fixes tag

3 patches with one Fixes tag each should probably be better, but honestly, I 
won't send a v4.
Fill free to split it if needed.
---
 drivers/infiniband/hw/mlx5/cq.c| 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/vport.c| 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 77d257ec899b..6d52ea03574e 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct 
ib_udata *udata,
return 0;
 
 err_cqb:
-   kfree(*cqb);
+   kvfree(*cqb);
 
 err_db:
mlx5_ib_db_unmap_user(to_mucontext(context), >db);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 35e256eb2f6e..b123f8a52ad8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch 
*esw)
 
esw->offloads.vport_rx_group = g;
 out:
-   kfree(flow_group_in);
+   kvfree(flow_group_in);
return err;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c 
b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
index 177e076b8d17..719cecb182c6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
@@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct 
mlx5_core_dev *mdev,
*system_image_guid = MLX5_GET64(query_nic_vport_context_out, out,
nic_vport_context.system_image_guid);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
@@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev 
*mdev, u64 *node_guid)
*node_guid = MLX5_GET64(query_nic_vport_context_out, out,
nic_vport_context.node_guid);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
@@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct 
mlx5_core_dev *mdev,
*qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out,
   nic_vport_context.qkey_violation_counter);
 
-   kfree(out);
+   kvfree(out);
 
return 0;
 }
-- 
2.17.0



Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST

2018-05-16 Thread David Miller
From: David Miller 
Date: Wed, 16 May 2018 15:06:59 -0400 (EDT)

> So applied, thanks.

Nevermind, eventually got a build failure:

ERROR: "knav_queue_open" [drivers/net/ethernet/ti/keystone_netcp.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1
make: *** [Makefile:1276: modules] Error 2

Reverted.


Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST

2018-05-16 Thread David Miller
From: Florian Fainelli 
Date: Wed, 16 May 2018 11:52:55 -0700

> This patch series includes more drivers to be build tested with COMPILE_TEST
> enabled. This helps cover some of the issues I just ran into with missing
> a driver *sigh*.
> 
> Changes in v2:
> 
> - allow FEC to build outside of CONFIG_ARM/ARM64 by defining a layout of
>   registers, this is not meant to run, so this is not a real issue if we
>   are not matching the correct register layout

Ok, this is a lot better.

But man, some of these drivers...

drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_desc_pool_destroy’:
drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects 
argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned 
int}’ [-Wformat=]
   "cpdma_desc_pool size %d != avail %d",
   ^
   gen_pool_size(pool->gen_pool),
   ~

and on and on and on...

But I'm really happy to see FEC and others at least being build tested
in more scenerios.

So applied, thanks.


Re: [RFC bpf-next 00/11] Add socket lookup support

2018-05-16 Thread Joe Stringer
On 9 May 2018 at 14:06, Joe Stringer  wrote:
> This series proposes a new helper for the BPF API which allows BPF programs to
> perform lookups for sockets in a network namespace. This would allow programs
> to determine early on in processing whether the stack is expecting to receive
> the packet, and perform some action (eg drop, forward somewhere) based on this
> information.
>
> The series is structured roughly into:
> * Misc refactor
> * Add the socket pointer type
> * Add reference tracking to ensure that socket references are freed
> * Extend the BPF API to add sk_lookup() / sk_release() functions
> * Add tests/documentation
>
> The helper proposed in this series includes a parameter for a tuple which must
> be filled in by the caller to determine the socket to look up. The simplest
> case would be filling with the contents of the packet, ie mapping the packet's
> 5-tuple into the parameter. In common cases, it may alternatively be useful to
> reverse the direction of the tuple and perform a lookup, to find the socket
> that initiates this connection; and if the BPF program ever performs a form of
> IP address translation, it may further be useful to be able to look up
> arbitrary tuples that are not based upon the packet, but instead based on 
> state
> held in BPF maps or hardcoded in the BPF program.
>
> Currently, access into the socket's fields are limited to those which are
> otherwise already accessible, and are restricted to read-only access.
>
> A few open points:
> * Currently, the lookup interface only returns either a valid socket or a NULL
>   pointer. This means that if there is any kind of issue with the tuple, such
>   as it provides an unsupported protocol number, or the socket can't be found,
>   then we are unable to differentiate these cases from one another. One 
> natural
>   approach to improve this could be to return an ERR_PTR from the
>   bpf_sk_lookup() helper. This would be more complicated but maybe it's
>   worthwhile.

This suggestion would add a lot of complexity, and there's not many
legitimately different error cases. There's:
* Unsupported socket type
* Cannot find netns
* Tuple argument is the wrong size
* Can't find socket

If we split the helpers into protocol-specific types, the first one
would be addressed. The last one is addressed by returning NULL. It
seems like a reasonable compromise to me to return NULL also in the
middle two cases as well, and rely on the BPF writer to provide valid
arguments.

> * No ordering is defined between sockets. If the tuple could find multiple
>   sockets, then it will arbitrarily return one. It is up to the caller to
>   handle this. If we wish to handle this more reliably in future, we could
>   encode an ordering preference in the flags field.

Doesn't need to be addressed with this series, there is scope for
addressing these cases when the use case arises.

> * Currently this helper is only defined for TC hook point, but it should also
>   be valid at XDP and perhaps some other hooks.

Easy to add support for XDP on demand, initial implementation doesn't need it.


Re: [PATCH net-next] cxgb4: update LE-TCAM collection for T6

2018-05-16 Thread David Miller
From: Rahul Lakkireddy 
Date: Wed, 16 May 2018 19:51:15 +0530

> For T6, clip table is separated from main TCAM. So, update LE-TCAM
> collection logic to collect clip table TCAM as well. IPv6 takes
> 4 entries in clip table TCAM compared to 2 entries in main TCAM.
> 
> Also, in case of errors, keep LE-TCAM collected so far and set the
> status to partial dump.
> 
> Signed-off-by: Rahul Lakkireddy 
> Signed-off-by: Ganesh Goudar 

Applied, thanks.


  1   2   3   4   >