Re: [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY
On Tue, 15 May 2018 21:06:08 +0200 Björn Töpelwrote: > @@ -82,6 +88,10 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff > *xdp) > int metasize; > int headroom; > > + // XXX implement clone, copy, use "native" MEM_TYPE > + if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) > + return NULL; > + There is going to be significant tradeoffs between AF_XDP zero-copy and copy-variant. The copy-variant, still have very attractive RX-performance, and other benefits like no exposing unrelated packets to userspace (but limit these to the XDP filter). Thus, as a user I would like to choose between AF_XDP zero-copy and copy-variant. Even if my NIC support zero-copy, I can be interested in only enabling the copy-variant. This patchset doesn't let me choose. How do we expose this to userspace? (Maybe as simple as an sockaddr_xdp->sxdp_flags flag?) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH 11/40] ipv6/flowlabel: simplify pid namespace lookup
Christoph Hellwigwrites: > On Sat, May 05, 2018 at 07:37:33AM -0500, Eric W. Biederman wrote: >> Christoph Hellwig writes: >> >> > The shole seq_file sequence already operates under a single RCU lock pair, >> > so move the pid namespace lookup into it, and stop grabbing a reference >> > and remove all kinds of boilerplate code. >> >> This is wrong. >> >> Move task_active_pid_ns(current) from open to seq_start actually means >> that the results if you pass this proc file between callers the results >> will change. So this breaks file descriptor passing. >> >> Open is a bad place to access current. In the middle of read/write is >> broken. >> >> >> In this particular instance looking up the pid namespace with >> task_active_pid_ns was a personal brain fart. What the code should be >> doing (with an appropriate helper) is: >> >> struct pid_namespace *pid_ns = inode->i_sb->s_fs_info; >> >> Because each mount of proc is bound to a pid namespace. Looking up the >> pid namespace from the super_block is a much better way to go. > > What do you have in mind for the helper? For now I've thrown it in > opencoded into my working tree, but I'd be glad to add a helper. > > struct pid_namespace *proc_pid_namespace(struct inode *inode) > { > // maybe warn on for s_magic not on procfs?? > return inode->i_sb->s_fs_info; > } That should work. Ideally out of line for the proc_fs.h version. Basically it should be a cousin of PDE_DATA. Eric
Re: [PATCH net-next 1/3] net: ethernet: ti: Allow most drivers with COMPILE_TEST
Hi Florian, I love your patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807 config: xtensa-allyesconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All warnings (new ones prefixed by >>): drivers/net//ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': >> drivers/net//ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument >> 1 of 'writel_relaxed' makes integer from pointer without a cast >> [-Wint-conversion] writel_relaxed(token, >sw_token); ^ In file included from arch/xtensa/include/asm/io.h:83:0, from include/linux/scatterlist.h:9, from include/linux/dma-mapping.h:11, from drivers/net//ethernet/ti/davinci_cpdma.c:21: include/asm-generic/io.h:303:24: note: expected 'u32 {aka unsigned int}' but argument is of type 'void *' #define writel_relaxed writel_relaxed ^ >> include/asm-generic/io.h:304:20: note: in expansion of macro 'writel_relaxed' static inline void writel_relaxed(u32 value, volatile void __iomem *addr) ^~ -- drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 of 'writel_relaxed' makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ In file included from arch/xtensa/include/asm/io.h:83:0, from include/linux/scatterlist.h:9, from include/linux/dma-mapping.h:11, from drivers/net/ethernet/ti/davinci_cpdma.c:21: include/asm-generic/io.h:303:24: note: expected 'u32 {aka unsigned int}' but argument is of type 'void *' #define writel_relaxed writel_relaxed ^ >> include/asm-generic/io.h:304:20: note: in expansion of macro 'writel_relaxed' static inline void writel_relaxed(u32 value, volatile void __iomem *addr) ^~ vim +/writel_relaxed +1083 drivers/net//ethernet/ti/davinci_cpdma.c ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1029 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1030 int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data, aef614e1 drivers/net/ethernet/ti/davinci_cpdma.c Sebastian Siewior 2013-04-23 1031 int len, int directed) ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1032 { ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1033 struct cpdma_ctlr *ctlr = chan->ctlr; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1034 struct cpdma_desc __iomem *desc; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1035 dma_addr_t buffer; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1036 unsigned long flags; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1037 u32 mode; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1038 int ret = 0; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1039 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1040 spin_lock_irqsave(>lock, flags); ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1041 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1042 if (chan->state == CPDMA_STATE_TEARDOWN) { ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1043 ret = -EINVAL; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1044 goto unlock_ret; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1045 } ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1046 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1047 if (chan->count >= chan->desc_num) { 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1048 chan->stats.desc_alloc_fail++; 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1049 ret = -ENOMEM; 742fb20f
Re: Grant
I Mikhail Fridman. has selected you specially as one of my beneficiaries for my Charitable Donation, Just as I have declared on May 23, 2016 to give my fortune as charity. Check the link below for confirmation: http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604 Reply as soon as possible with further directives. Best Regards, Mikhail Fridman.
Re: [PATCH net-next 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST
Hi Florian, I love your patch! Yet something to improve: [auto build test ERROR on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807 config: m68k-allmodconfig (attached as .config) compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k All errors (new ones prefixed by >>): In file included from include/linux/swab.h:5:0, from include/uapi/linux/byteorder/big_endian.h:13, from include/linux/byteorder/big_endian.h:5, from arch/m68k/include/uapi/asm/byteorder.h:5, from include/asm-generic/bitops/le.h:6, from arch/m68k/include/asm/bitops.h:519, from include/linux/bitops.h:38, from include/linux/kernel.h:11, from include/linux/list.h:9, from include/linux/module.h:9, from drivers/net//ethernet/freescale/fec_main.c:24: drivers/net//ethernet/freescale/fec_main.c: In function 'fec_restart': >> drivers/net//ethernet/freescale/fec_main.c:959:26: error: 'FEC_RACC' >> undeclared (first use in this function); did you mean 'FEC_RXIC1'? val = readl(fep->hwp + FEC_RACC); ^ include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32' (__builtin_constant_p((__u32)(x)) ? \ ^ include/linux/byteorder/generic.h:89:21: note: in expansion of macro '__le32_to_cpu' #define le32_to_cpu __le32_to_cpu ^ arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32' #define readl(addr) in_le32(addr) ^~~ drivers/net//ethernet/freescale/fec_main.c:959:9: note: in expansion of macro 'readl' val = readl(fep->hwp + FEC_RACC); ^ drivers/net//ethernet/freescale/fec_main.c:959:26: note: each undeclared identifier is reported only once for each function it appears in val = readl(fep->hwp + FEC_RACC); ^ include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32' (__builtin_constant_p((__u32)(x)) ? \ ^ include/linux/byteorder/generic.h:89:21: note: in expansion of macro '__le32_to_cpu' #define le32_to_cpu __le32_to_cpu ^ arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32' #define readl(addr) in_le32(addr) ^~~ drivers/net//ethernet/freescale/fec_main.c:959:9: note: in expansion of macro 'readl' val = readl(fep->hwp + FEC_RACC); ^ In file included from arch/m68k/include/asm/io_mm.h:27:0, from arch/m68k/include/asm/io.h:5, from include/linux/scatterlist.h:9, from include/linux/dma-mapping.h:11, from include/linux/skbuff.h:34, from include/linux/if_ether.h:23, from include/uapi/linux/ethtool.h:19, from include/linux/ethtool.h:18, from include/linux/netdevice.h:41, from drivers/net//ethernet/freescale/fec_main.c:34: drivers/net//ethernet/freescale/fec_main.c:968:38: error: 'FEC_FTRL' undeclared (first use in this function); did you mean 'FEC_ECNTRL'? writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL); ^ arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32' #define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = cpu_to_le32(l)) ^~~~ drivers/net//ethernet/freescale/fec_main.c:968:3: note: in expansion of macro 'writel' writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL); ^~ drivers/net//ethernet/freescale/fec_main.c:1034:38: error: 'FEC_R_FIFO_RSEM' undeclared (first use in this function); did you mean 'FEC_FIFO_RAM'? writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM); ^ arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32' #define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = cpu_to_le32(l)) ^~~~ drivers/net//ethernet/freescale/fec_main.c:1034:3: note: in expansion of macro 'writel' writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM); ^~ drivers/net//ethernet/freescale/fec_main.c:1035:38: error: 'FEC_R_FIFO_RSFL' undeclared (first use in
[PATCH net-next] vmxnet3: Replace msleep(1) with usleep_range()
As documented in Documentation/timers/timers-howto.txt, replace msleep(1) with usleep_range(). Signed-off-by: YueHaibing--- drivers/net/vmxnet3/vmxnet3_drv.c | 6 +++--- drivers/net/vmxnet3/vmxnet3_ethtool.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c index 9ebe2a6..2234a33 100644 --- a/drivers/net/vmxnet3/vmxnet3_drv.c +++ b/drivers/net/vmxnet3/vmxnet3_drv.c @@ -2945,7 +2945,7 @@ vmxnet3_close(struct net_device *netdev) * completion. */ while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state)) - msleep(1); + usleep_range(1000, 2000); vmxnet3_quiesce_dev(adapter); @@ -2995,7 +2995,7 @@ vmxnet3_change_mtu(struct net_device *netdev, int new_mtu) * completion. */ while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state)) - msleep(1); + usleep_range(1000, 2000); if (netif_running(netdev)) { vmxnet3_quiesce_dev(adapter); @@ -3567,7 +3567,7 @@ static void vmxnet3_shutdown_device(struct pci_dev *pdev) * completion. */ while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state)) - msleep(1); + usleep_range(1000, 2000); if (test_and_set_bit(VMXNET3_STATE_BIT_QUIESCED, >state)) { diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c index 2ff2731..559db05 100644 --- a/drivers/net/vmxnet3/vmxnet3_ethtool.c +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c @@ -600,7 +600,7 @@ vmxnet3_set_ringparam(struct net_device *netdev, * completion. */ while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, >state)) - msleep(1); + usleep_range(1000, 2000); if (netif_running(netdev)) { vmxnet3_quiesce_dev(adapter); -- 2.7.0
Re: [PATCH net-next 2/3] net: ethernet: freescale: Allow FEC with COMPILE_TEST
Hi Florian, I love your patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807 config: m68k-allyesconfig (attached as .config) compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=m68k All warnings (new ones prefixed by >>): In file included from include/linux/swab.h:5:0, from include/uapi/linux/byteorder/big_endian.h:13, from include/linux/byteorder/big_endian.h:5, from arch/m68k/include/uapi/asm/byteorder.h:5, from include/asm-generic/bitops/le.h:6, from arch/m68k/include/asm/bitops.h:519, from include/linux/bitops.h:38, from include/linux/kernel.h:11, from include/linux/list.h:9, from include/linux/module.h:9, from drivers/net/ethernet/freescale/fec_main.c:24: drivers/net/ethernet/freescale/fec_main.c: In function 'fec_restart': drivers/net/ethernet/freescale/fec_main.c:959:26: error: 'FEC_RACC' undeclared (first use in this function); did you mean 'FEC_RXIC0'? val = readl(fep->hwp + FEC_RACC); ^ include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32' (__builtin_constant_p((__u32)(x)) ? \ ^ >> include/linux/byteorder/generic.h:89:21: note: in expansion of macro >> '__le32_to_cpu' #define le32_to_cpu __le32_to_cpu ^ >> arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32' #define readl(addr) in_le32(addr) ^~~ >> drivers/net/ethernet/freescale/fec_main.c:959:9: note: in expansion of macro >> 'readl' val = readl(fep->hwp + FEC_RACC); ^ drivers/net/ethernet/freescale/fec_main.c:959:26: note: each undeclared identifier is reported only once for each function it appears in val = readl(fep->hwp + FEC_RACC); ^ include/uapi/linux/swab.h:117:32: note: in definition of macro '__swab32' (__builtin_constant_p((__u32)(x)) ? \ ^ >> include/linux/byteorder/generic.h:89:21: note: in expansion of macro >> '__le32_to_cpu' #define le32_to_cpu __le32_to_cpu ^ >> arch/m68k/include/asm/io_mm.h:452:26: note: in expansion of macro 'in_le32' #define readl(addr) in_le32(addr) ^~~ >> drivers/net/ethernet/freescale/fec_main.c:959:9: note: in expansion of macro >> 'readl' val = readl(fep->hwp + FEC_RACC); ^ In file included from arch/m68k/include/asm/io_mm.h:27:0, from arch/m68k/include/asm/io.h:5, from include/linux/scatterlist.h:9, from include/linux/dma-mapping.h:11, from include/linux/skbuff.h:34, from include/linux/if_ether.h:23, from include/uapi/linux/ethtool.h:19, from include/linux/ethtool.h:18, from include/linux/netdevice.h:41, from drivers/net/ethernet/freescale/fec_main.c:34: drivers/net/ethernet/freescale/fec_main.c:968:38: error: 'FEC_FTRL' undeclared (first use in this function); did you mean 'FEC_ECNTRL'? writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL); ^ arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32' #define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = cpu_to_le32(l)) ^~~~ >> drivers/net/ethernet/freescale/fec_main.c:968:3: note: in expansion of macro >> 'writel' writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL); ^~ drivers/net/ethernet/freescale/fec_main.c:1034:38: error: 'FEC_R_FIFO_RSEM' undeclared (first use in this function); did you mean 'FEC_FIFO_RAM'? writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM); ^ arch/m68k/include/asm/raw_io.h:48:64: note: in definition of macro 'out_le32' #define out_le32(addr,l) (void)((*(__force volatile __le32 *) (addr)) = cpu_to_le32(l)) ^~~~ drivers/net/ethernet/freescale/fec_main.c:1034:3: note: in expansion of macro 'writel' writel(FEC_ENET_RSEM_V, fep->hwp + FEC_R_FIFO_RSEM); ^~ drivers/net/ethernet/freescale/fec_main.c:1035:38: error: 'FEC_R_FIFO_RSFL' undeclared (first
Re: Donation
I Mikhail Fridman. has selected you specially as one of my beneficiaries for my Charitable Donation, Just as I have declared on May 23, 2016 to give my fortune as charity. Check the link below for confirmation: http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604 Reply as soon as possible with further directives. Best Regards, Mikhail Fridman.
Re: [PATCH net-next 1/3] net: ethernet: ti: Allow most drivers with COMPILE_TEST
Hi Florian, I love your patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Florian-Fainelli/net-Allow-more-drivers-with-COMPILE_TEST/20180517-092807 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): In file included from arch/x86/include/asm/realmode.h:15:0, from arch/x86/include/asm/acpi.h:33, from arch/x86/include/asm/fixmap.h:19, from arch/x86/include/asm/apic.h:10, from arch/x86/include/asm/smp.h:13, from include/linux/smp.h:64, from include/linux/topology.h:33, from include/linux/gfp.h:9, from include/linux/idr.h:16, from include/linux/kernfs.h:14, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from drivers/net/ethernet/ti/davinci_cpdma.c:17: drivers/net/ethernet/ti/davinci_cpdma.c: In function 'cpdma_chan_submit': >> drivers/net/ethernet/ti/davinci_cpdma.c:1083:17: warning: passing argument 1 >> of '__writel' makes integer from pointer without a cast [-Wint-conversion] writel_relaxed(token, >sw_token); ^ arch/x86/include/asm/io.h:88:39: note: in definition of macro 'writel_relaxed' #define writel_relaxed(v, a) __writel(v, a) ^ arch/x86/include/asm/io.h:71:18: note: expected 'unsigned int' but argument is of type 'void *' build_mmio_write(__writel, "l", unsigned int, "r", ) ^ arch/x86/include/asm/io.h:53:20: note: in definition of macro 'build_mmio_write' static inline void name(type val, volatile void __iomem *addr) \ ^~~~ vim +/__writel +1083 drivers/net/ethernet/ti/davinci_cpdma.c ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1029 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1030 int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data, aef614e1 drivers/net/ethernet/ti/davinci_cpdma.c Sebastian Siewior 2013-04-23 1031 int len, int directed) ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1032 { ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1033 struct cpdma_ctlr *ctlr = chan->ctlr; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1034 struct cpdma_desc __iomem *desc; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1035 dma_addr_t buffer; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1036 unsigned long flags; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1037 u32 mode; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1038 int ret = 0; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1039 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1040 spin_lock_irqsave(>lock, flags); ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1041 ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1042 if (chan->state == CPDMA_STATE_TEARDOWN) { ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1043 ret = -EINVAL; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1044 goto unlock_ret; ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1045 } ef8c2dab drivers/net/davinci_cpdma.c Cyril Chemparathy 2010-09-15 1046 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1047 if (chan->count >= chan->desc_num) { 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1048 chan->stats.desc_alloc_fail++; 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1049 ret = -ENOMEM; 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1050 goto unlock_ret; 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1051 } 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1052 742fb20f drivers/net/ethernet/ti/davinci_cpdma.c Grygorii Strashko 2016-06-27 1053 desc = cpdma_desc_alloc(ctlr->pool); ef8c2dab
Re: xdp and fragments with virtio
On 5/16/18 1:24 AM, Jason Wang wrote: > > > On 2018年05月16日 11:51, David Ahern wrote: >> Hi Jason: >> >> I am trying to test MTU changes to the BPF fib_lookup helper and seeing >> something odd. Hoping you can help. >> >> I have a VM with multiple virtio based NICs and tap backends. I install >> the xdp program on eth1 and eth2 to do forwarding. In the host I send a >> large packet to eth1: >> >> $ ping -s 1500 9.9.9.9 >> >> >> The tap device in the host sees 2 packets: >> >> $ sudo tcpdump -nv -i vm02-eth1 >> 20:44:33.943160 IP (tos 0x0, ttl 64, id 58746, offset 0, flags [+], >> proto ICMP (1), length 1500) >> 10.100.1.254 > 9.9.9.9: ICMP echo request, id 17917, seq 1, >> length 1480 >> 20:44:33.943172 IP (tos 0x0, ttl 64, id 58746, offset 1480, flags >> [none], proto ICMP (1), length 48) >> 10.100.1.254 > 9.9.9.9: ip-proto-1 >> >> >> In the VM, the XDP program only sees the first packet, not the fragment. >> I added a printk to the program (see diff below): >> >> $ cat trace_pipe >> -0 [003] ..s2 254.436467: 0: packet length 1514 >> >> >> Anything come to mind in the virtio xdp implementation that affects >> fragment packets? I see this with both IPv4 and v6. > > Not yet. But we do turn of tap gso when virtio has XDP set, but it > shouldn't matter this case. > > Will try to see what's wrong. > I added this to the command line for the NICs and it works: "mrg_rxbuf=off,guest_tso4=off,guest_tso6=off,guest_ecn=off,guest_ufo=off" XDP program sees the full size packet and the fragment. Fun fact: only adding mrg_rxbuf=off so that mergeable_rx_bufs is false but big_packets is true generates a panic when it receives large packets.
Re: pull-request: bpf-next 2018-05-17
From: Daniel BorkmannDate: Thu, 17 May 2018 03:09:48 +0200 > The following pull-request contains BPF updates for your *net-next* > tree. Looks good, pulled, thanks Daniel.
Re: [PATCH net-next v3 1/3] ipv4: support sport, dport and ip_proto in RTM_GETROUTE
From: Roopa PrabhuDate: Wed, 16 May 2018 13:30:28 -0700 > yes, but we hold rcu read lock before calling the reply function for > fib result. I did consider allocating the skb before the read > lock..but then the refactoring (into a separate netlink reply func) > would seem unnecessary. > > I am fine with pre-allocating and undoing the refactoring if that works > better. Hmmm... I also notice that with this change we end up doing the rtnl_unicast() under the RCU lock which is unnecessary too. So yes, please pull the "out_skb" allocation before the rcu_read_lock(), and push the rtnl_unicast() after the rcu_read_unlock(). It really is a shame that sharing the ETH_P_IP skb between the route route lookup and the netlink response doesn't work properly. I was using RTM_GETROUTE at one point for route/fib lookup performance measurements. It never was great at that, but now that there is going to be two SKB allocations instead of one it is going to be even less useful for that kind of usage.
[PATCH] bonding: introduce link change helper
Introduce an new common helper to avoid redundancy. Signed-off-by: Tonghao Zhang--- drivers/net/bonding/bond_main.c | 40 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 718e491..3063a9c 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2135,6 +2135,24 @@ static int bond_miimon_inspect(struct bonding *bond) return commit; } +static void bond_miimon_link_change(struct bonding *bond, + struct slave *slave, + char link) +{ + switch (BOND_MODE(bond)) { + case BOND_MODE_8023AD: + bond_3ad_handle_link_change(slave, link); + break; + case BOND_MODE_TLB: + case BOND_MODE_ALB: + bond_alb_handle_link_change(bond, slave, link); + break; + case BOND_MODE_XOR: + bond_update_slave_arr(bond, NULL); + break; + } +} + static void bond_miimon_commit(struct bonding *bond) { struct list_head *iter; @@ -2176,16 +2194,7 @@ static void bond_miimon_commit(struct bonding *bond) slave->speed == SPEED_UNKNOWN ? 0 : slave->speed, slave->duplex ? "full" : "half"); - /* notify ad that the link status has changed */ - if (BOND_MODE(bond) == BOND_MODE_8023AD) - bond_3ad_handle_link_change(slave, BOND_LINK_UP); - - if (bond_is_lb(bond)) - bond_alb_handle_link_change(bond, slave, - BOND_LINK_UP); - - if (BOND_MODE(bond) == BOND_MODE_XOR) - bond_update_slave_arr(bond, NULL); + bond_miimon_link_change(bond, slave, BOND_LINK_UP); if (!bond->curr_active_slave || slave == primary) goto do_failover; @@ -2207,16 +2216,7 @@ static void bond_miimon_commit(struct bonding *bond) netdev_info(bond->dev, "link status definitely down for interface %s, disabling it\n", slave->dev->name); - if (BOND_MODE(bond) == BOND_MODE_8023AD) - bond_3ad_handle_link_change(slave, - BOND_LINK_DOWN); - - if (bond_is_lb(bond)) - bond_alb_handle_link_change(bond, slave, - BOND_LINK_DOWN); - - if (BOND_MODE(bond) == BOND_MODE_XOR) - bond_update_slave_arr(bond, NULL); + bond_miimon_link_change(bond, slave, BOND_LINK_DOWN); if (slave == rcu_access_pointer(bond->curr_active_slave)) goto do_failover; -- 1.8.3.1
Re: [PATCH ghak81 V3 3/3] audit: collect audit task parameters
Hi Richard, Thank you for the patch! Yet something to improve: [auto build test ERROR on next-20180516] [cannot apply to linus/master tip/sched/core v4.17-rc5 v4.17-rc4 v4.17-rc3 v4.17-rc5] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-group-task-params/20180517-090703 config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): kernel/fork.c: In function 'copy_process': >> kernel/fork.c:1739:3: error: 'struct task_struct' has no member named 'audit' p->audit = NULL; ^~ vim +1739 kernel/fork.c 1728 1729 p->default_timer_slack_ns = current->timer_slack_ns; 1730 1731 task_io_accounting_init(>ioac); 1732 acct_clear_integrals(p); 1733 1734 posix_cpu_timers_init(p); 1735 1736 p->start_time = ktime_get_ns(); 1737 p->real_start_time = ktime_get_boot_ns(); 1738 p->io_context = NULL; > 1739 p->audit = NULL; 1740 cgroup_fork(p); 1741 #ifdef CONFIG_NUMA 1742 p->mempolicy = mpol_dup(p->mempolicy); 1743 if (IS_ERR(p->mempolicy)) { 1744 retval = PTR_ERR(p->mempolicy); 1745 p->mempolicy = NULL; 1746 goto bad_fork_cleanup_threadgroup_lock; 1747 } 1748 #endif 1749 #ifdef CONFIG_CPUSETS 1750 p->cpuset_mem_spread_rotor = NUMA_NO_NODE; 1751 p->cpuset_slab_spread_rotor = NUMA_NO_NODE; 1752 seqcount_init(>mems_allowed_seq); 1753 #endif 1754 #ifdef CONFIG_TRACE_IRQFLAGS 1755 p->irq_events = 0; 1756 p->hardirqs_enabled = 0; 1757 p->hardirq_enable_ip = 0; 1758 p->hardirq_enable_event = 0; 1759 p->hardirq_disable_ip = _THIS_IP_; 1760 p->hardirq_disable_event = 0; 1761 p->softirqs_enabled = 1; 1762 p->softirq_enable_ip = _THIS_IP_; 1763 p->softirq_enable_event = 0; 1764 p->softirq_disable_ip = 0; 1765 p->softirq_disable_event = 0; 1766 p->hardirq_context = 0; 1767 p->softirq_context = 0; 1768 #endif 1769 1770 p->pagefault_disabled = 0; 1771 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
linux-next: manual merge of the net-next tree with the vfs tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: net/ipv4/ipconfig.c between commits: 3f3942aca6da ("proc: introduce proc_create_single{,_data}") c04d2cb2009f ("ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers") from the vfs tree and commit: 4d019b3f80dc ("ipconfig: Create /proc/net/ipconfig directory") from the net-next tree. I fixed it up (see below - there may be more to do) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/ipv4/ipconfig.c index bbcbcc113d19,86c9f755de3d.. --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@@ -1282,6 -1317,74 +1317,61 @@@ static int pnp_seq_show(struct seq_fil _servaddr); return 0; } - -static int pnp_seq_open(struct inode *indoe, struct file *file) -{ - return single_open(file, pnp_seq_show, NULL); -} - -static const struct file_operations pnp_seq_fops = { - .open = pnp_seq_open, - .read = seq_read, - .llseek = seq_lseek, - .release= single_release, -}; - + /* Create the /proc/net/ipconfig directory */ + static int __init ipconfig_proc_net_init(void) + { + ipconfig_dir = proc_net_mkdir(_net, "ipconfig", init_net.proc_net); + if (!ipconfig_dir) + return -ENOMEM; + + return 0; + } + + /* Create a new file under /proc/net/ipconfig */ + static int ipconfig_proc_net_create(const char *name, + const struct file_operations *fops) + { + char *pname; + struct proc_dir_entry *p; + + if (!ipconfig_dir) + return -ENOMEM; + + pname = kasprintf(GFP_KERNEL, "%s%s", "ipconfig/", name); + if (!pname) + return -ENOMEM; + + p = proc_create(pname, 0444, init_net.proc_net, fops); + kfree(pname); + if (!p) + return -ENOMEM; + + return 0; + } + + /* Write NTP server IP addresses to /proc/net/ipconfig/ntp_servers */ + static int ntp_servers_seq_show(struct seq_file *seq, void *v) + { + int i; + + for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) { + if (ic_ntp_servers[i] != NONE) + seq_printf(seq, "%pI4\n", _ntp_servers[i]); + } + return 0; + } + + static int ntp_servers_seq_open(struct inode *inode, struct file *file) + { + return single_open(file, ntp_servers_seq_show, NULL); + } + + static const struct file_operations ntp_servers_seq_fops = { + .open = ntp_servers_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, + }; #endif /* CONFIG_PROC_FS */ /* @@@ -1356,8 -1459,20 +1446,20 @@@ static int __init ip_auto_config(void int err; unsigned int i; + /* Initialise all name servers and NTP servers to NONE (but only if the +* "ip=" or "nfsaddrs=" kernel command line parameters weren't decoded, +* otherwise we'll overwrite the IP addresses specified there) +*/ + if (ic_set_manually == 0) { + ic_nameservers_predef(); + ic_ntp_servers_predef(); + } + #ifdef CONFIG_PROC_FS - proc_create("pnp", 0444, init_net.proc_net, _seq_fops); + proc_create_single("pnp", 0444, init_net.proc_net, pnp_seq_show); + + if (ipconfig_proc_net_init() == 0) + ipconfig_proc_net_create("ntp_servers", _servers_seq_fops); #endif /* CONFIG_PROC_FS */ if (!ic_enable) pgp6lRKBz8avo.pgp Description: OpenPGP digital signature
Re: [PATCH 34/40] atm: simplify procfs code
Christoph Hellwigwrites: > On Sat, May 05, 2018 at 07:51:18AM -0500, Eric W. Biederman wrote: >> Christoph Hellwig writes: >> >> > Use remove_proc_subtree to remove the whole subtree on cleanup, and >> > unwind the registration loop into individual calls. Switch to use >> > proc_create_seq where applicable. >> >> Can you please explain why you are removing the error handling when >> you are unwinding the registration loop? > > Because there is no point in handling these errors. The code work > perfectly fine without procfs, or without given proc files and the > removal works just fine if they don't exist either. This is a very > common patter in various parts of the kernel already. > > I'll document it better in the changelog. Thank you. That is the kind of thing that could be a signal of inattentiveness and problems, especially when it is not documented. Eric
pull-request: bpf-next 2018-05-17
Hi David, The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Provide a new BPF helper for doing a FIB and neighbor lookup in the kernel tables from an XDP or tc BPF program. The helper provides a fast-path for forwarding packets. The API supports IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are implemented in this initial work, from David (Ahern). 2) Just a tiny diff but huge feature enabled for nfp driver by extending the BPF offload beyond a pure host processing offload. Offloaded XDP programs are allowed to set the RX queue index and thus opening the door for defining a fully programmable RSS/n-tuple filter replacement. Once BPF decided on a queue already, the device data-path will skip the conventional RSS processing completely, from Jakub. 3) The original sockmap implementation was array based similar to devmap. However unlike devmap where an ifindex has a 1:1 mapping into the map there are use cases with sockets that need to be referenced using longer keys. Hence, sockhash map is added reusing as much of the sockmap code as possible, from John. 4) Introduce BTF ID. The ID is allocatd through an IDR similar as with BPF maps and progs. It also makes BTF accessible to user space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data through BPF_OBJ_GET_INFO_BY_FD, from Martin. 5) Enable BPF stackmap with build_id also in NMI context. Due to the up_read() of current->mm->mmap_sem build_id cannot be parsed. This work defers the up_read() via a per-cpu irq_work so that at least limited support can be enabled, from Song. 6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND JIT conversion as well as implementation of an optimized 32/64 bit immediate load in the arm64 JIT that allows to reduce the number of emitted instructions; in case of tested real-world programs they were shrinking by three percent, from Daniel. 7) Add ifindex parameter to the libbpf loader in order to enable BPF offload support. Right now only iproute2 can load offloaded BPF and this will also enable libbpf for direct integration into other applications, from David (Beckett). 8) Convert the plain text documentation under Documentation/bpf/ into RST format since this is the appropriate standard the kernel is moving to for all documentation. Also add an overview README.rst, from Jesper. 9) Add __printf verification attribute to the bpf_verifier_vlog() helper. Though it uses va_list we can still allow gcc to check the format string, from Mathieu. 10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...' is a bash 4.0+ feature which is not guaranteed to be available when calling out to shell, therefore use a more portable variant, from Joe. 11) Fix a 64 bit division in xdp_umem_reg() by using div_u64() instead of relying on the gcc built-in, from Björn. 12) Fix a sock hashmap kmalloc warning reported by syzbot when an overly large key size is used in hashmap then causing overflows in htab->elem_size. Reject bogus attr->key_size early in the sock_hash_alloc(), from Yonghong. 13) Ensure in BPF selftests when urandom_read is being linked that --build-id is always enabled so that test_stacktrace_build_id[_nmi] won't be failing, from Alexei. 14) Add bitsperlong.h as well as errno.h uapi headers into the tools header infrastructure which point to one of the arch specific uapi headers. This was needed in order to fix a build error on some systems for the BPF selftests, from Sirio. 15) Allow for short options to be used in the xdp_monitor BPF sample code. And also a bpf.h tools uapi header sync in order to fix a selftest build failure. Both from Prashant. 16) More formally clarify the meaning of ID in the direct packet access section of the BPF documentation, from Wang. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! The following changes since commit 53a7bdfb2a2756cce8003b90817f8a6fb4d830d9: dt-bindings: dsa: Remove unnecessary #address/#size-cells (2018-05-08 20:28:44 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git for you to fetch changes up to e23afe5e7cba89cd0744c5218eda1b3553455c17: bpf: sockmap, on update propagate errors back to userspace (2018-05-17 01:48:22 +0200) Alexei Starovoitov (4): Merge branch 'bpf-jit-cleanups' Merge branch 'fix-samples' Merge branch 'convert-doc-to-rst' selftests/bpf: make sure build-id is on Björn Töpel (1): xsk: fix 64-bit division Daniel Borkmann (14): Merge branch 'bpf-btf-id' Merge branch 'bpf-nfp-programmable-rss'
Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier
On 14 May 2018 at 20:04, Alexei Starovoitovwrote: > On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote: >> Allow helper functions to acquire a reference and return it into a >> register. Specific pointer types such as the PTR_TO_SOCKET will >> implicitly represent such a reference. The verifier must ensure that >> these references are released exactly once in each path through the >> program. >> >> To achieve this, this commit assigns an id to the pointer and tracks it >> in the 'bpf_func_state', then when the function or program exits, >> verifies that all of the acquired references have been freed. When the >> pointer is passed to a function that frees the reference, it is removed >> from the 'bpf_func_state` and all existing copies of the pointer in >> registers are marked invalid. >> >> Signed-off-by: Joe Stringer >> --- >> include/linux/bpf_verifier.h | 18 ++- >> kernel/bpf/verifier.c| 295 >> --- >> 2 files changed, 292 insertions(+), 21 deletions(-) >> >> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h >> index 9dcd87f1d322..8dbee360b3ec 100644 >> --- a/include/linux/bpf_verifier.h >> +++ b/include/linux/bpf_verifier.h >> @@ -104,6 +104,11 @@ struct bpf_stack_state { >> u8 slot_type[BPF_REG_SIZE]; >> }; >> >> +struct bpf_reference_state { >> + int id; >> + int insn_idx; /* allocation insn */ > > the insn_idx is for more verbose messages, right? > It doesn't seem to affect the safety of algorithm. > Please add a comment to clarify that. Yup, will do. >> +/* Acquire a pointer id from the env and update the state->refs to include >> + * this new pointer reference. >> + * On success, returns a valid pointer id to associate with the register >> + * On failure, returns a negative errno. >> + */ >> +static int acquire_reference_state(struct bpf_verifier_env *env, int >> insn_idx) >> +{ >> + struct bpf_func_state *state = cur_func(env); >> + int new_ofs = state->acquired_refs; >> + int id, err; >> + >> + err = realloc_reference_state(state, state->acquired_refs + 1, true); >> + if (err) >> + return err; >> + id = ++env->id_gen; >> + state->refs[new_ofs].id = id; >> + state->refs[new_ofs].insn_idx = insn_idx; > > I thought that we may avoid this extra 'ref_state' array if we store > 'id' into 'aux' array which is one to one to array of instructions > and avoid this expensive reallocs, but then I realized we can go > through the same instruction that returns a pointer to socket > multiple times and every time it needs to be different 'id' and > tracked indepdently, so yeah. All that infra is necessary. > Would be good to document the algorithm a bit more. Good point, I'll add these details to the bpf_reference_state definition. Will consider other areas that could receive some docs attention. >> @@ -2498,6 +2711,15 @@ static int check_helper_call(struct bpf_verifier_env >> *env, int func_id, int insn >> return err; >> } >> >> + /* If the function is a release() function, mark all copies of the same >> + * pointer as "freed" in all registers and in the stack. >> + */ >> + if (is_release_function(func_id)) { >> + err = release_reference(env); > > I think this can be improved if check_func_arg() stores ptr_id into meta. > Then this loop > for (i = BPF_REG_1; i < BPF_REG_6; i++) { >if (reg_is_refcounted([i])) { > in release_reference() won't be needed. That's a nice cleanup. > Also the macros from the previous patch look ugly, but considering this patch > I guess it's justified. At least I don't see a better way of doing it. Completely agree, ugly, but I also didn't see a great alternative.
Re: [PATCH v3] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
On Wed, 2018-05-16 at 21:07 +0200, Christophe JAILLET wrote: > When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used > to > free it. > > Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing") > Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx > rules") > Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to > query vport RoCE fields") > Signed-off-by: Christophe JAILLET> --- > v1 -> v2: More places to update have been added to the patch > v2 -> v3: Add Fixes tag > > 3 patches with one Fixes tag each should probably be better, but > honestly, I won't send a v4. > Fill free to split it if needed. Applied to mlx5-next, thanks Christophe! > --- > drivers/infiniband/hw/mlx5/cq.c| 2 +- > drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +- > drivers/net/ethernet/mellanox/mlx5/core/vport.c| 6 +++ > --- > 3 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx5/cq.c > b/drivers/infiniband/hw/mlx5/cq.c > index 77d257ec899b..6d52ea03574e 100644 > --- a/drivers/infiniband/hw/mlx5/cq.c > +++ b/drivers/infiniband/hw/mlx5/cq.c > @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev > *dev, struct ib_udata *udata, > return 0; > > err_cqb: > - kfree(*cqb); > + kvfree(*cqb); > > err_db: > mlx5_ib_db_unmap_user(to_mucontext(context), >db); > diff --git > a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c > b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c > index 35e256eb2f6e..b123f8a52ad8 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c > @@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct > mlx5_eswitch *esw) > > esw->offloads.vport_rx_group = g; > out: > - kfree(flow_group_in); > + kvfree(flow_group_in); > return err; > } > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c > b/drivers/net/ethernet/mellanox/mlx5/core/vport.c > index 177e076b8d17..719cecb182c6 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c > @@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct > mlx5_core_dev *mdev, > *system_image_guid = MLX5_GET64(query_nic_vport_context_out, > out, > nic_vport_context.system_ima > ge_guid); > > - kfree(out); > + kvfree(out); > > return 0; > } > @@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct > mlx5_core_dev *mdev, u64 *node_guid) > *node_guid = MLX5_GET64(query_nic_vport_context_out, out, > nic_vport_context.node_guid); > > - kfree(out); > + kvfree(out); > > return 0; > } > @@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct > mlx5_core_dev *mdev, > *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out, > nic_vport_context.qkey_violation_ > counter); > > - kfree(out); > + kvfree(out); > > return 0; > }
[PATCH net] erspan: fix invalid erspan version.
ERSPAN only support version 1 and 2. When packets send to an erspan device which does not have proper version number set, drop the packet. In real case, we observe multicast packets sent to the erspan pernet device, erspan0, which does not have erspan version configured. Reported-by: Greg RoseSigned-off-by: William Tu --- net/ipv4/ip_gre.c | 4 +++- net/ipv6/ip6_gre.c | 5 - 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 2409e648454d..2d8efeecf619 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -734,10 +734,12 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb, erspan_build_header(skb, ntohl(tunnel->parms.o_key), tunnel->index, truncate, true); - else + else if (tunnel->erspan_ver == 2) erspan_build_header_v2(skb, ntohl(tunnel->parms.o_key), tunnel->dir, tunnel->hwid, truncate, true); + else + goto free_skb; tunnel->parms.o_flags &= ~TUNNEL_KEY; __gre_xmit(skb, dev, >parms.iph, htons(ETH_P_ERSPAN)); diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index bede77f24784..d20072fc38cb 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -991,11 +991,14 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb, erspan_build_header(skb, ntohl(t->parms.o_key), t->parms.index, truncate, false); - else + else if (t->parms.erspan_ver == 2) erspan_build_header_v2(skb, ntohl(t->parms.o_key), t->parms.dir, t->parms.hwid, truncate, false); + else + goto tx_err; + fl6.daddr = t->parms.raddr; } -- 2.7.4
Re: [RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
On 14 May 2018 at 19:37, Alexei Starovoitovwrote: > On Wed, May 09, 2018 at 02:07:02PM -0700, Joe Stringer wrote: >> Teach the verifier a little bit about a new type of pointer, a >> PTR_TO_SOCKET. This pointer type is accessed from BPF through the >> 'struct bpf_sock' structure. >> >> Signed-off-by: Joe Stringer >> --- >> include/linux/bpf.h | 19 +- >> include/linux/bpf_verifier.h | 2 ++ >> kernel/bpf/verifier.c| 86 >> ++-- >> net/core/filter.c| 30 +--- >> 4 files changed, 114 insertions(+), 23 deletions(-) > > Ack for patches 1-3. In this one few nits: > >> @@ -1723,6 +1752,16 @@ static int check_mem_access(struct bpf_verifier_env >> *env, int insn_idx, u32 regn >> err = check_packet_access(env, regno, off, size, false); >> if (!err && t == BPF_READ && value_regno >= 0) >> mark_reg_unknown(env, regs, value_regno); >> + >> + } else if (reg->type == PTR_TO_SOCKET) { >> + if (t == BPF_WRITE) { >> + verbose(env, "cannot write into socket\n"); >> + return -EACCES; >> + } >> + err = check_sock_access(env, regno, off, size, t); >> + if (!err && t == BPF_READ && value_regno >= 0) > > t == BPF_READ check is unnecessary. > >> @@ -5785,7 +5845,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr >> *attr) >> >> if (ret == 0) >> /* program is valid, convert *(u32*)(ctx + off) accesses */ >> - ret = convert_ctx_accesses(env); >> + ret = convert_ctx_accesses(env, env->ops->convert_ctx_access, >> +PTR_TO_CTX); >> + >> + if (ret == 0) >> + /* Convert *(u32*)(sock_ops + off) accesses */ >> + ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access, >> +PTR_TO_SOCKET); > > Overall looks great. > Only this part is missing for PTR_TO_SOCKET: > } else if (dst_reg_type != *prev_dst_type && > (dst_reg_type == PTR_TO_CTX || > *prev_dst_type == PTR_TO_CTX)) { > verbose(env, "same insn cannot be used with different > pointers\n"); > return -EINVAL; > similar logic has to be added. > Otherwise the following will be accepted: > > R1 = sock_ptr > goto X; > ... > R1 = some_other_valid_ptr; > goto X; > ... > > R2 = *(u32 *)(R1 + 0); > this will be rewritten for first branch, > but it's wrong for second. > Thanks for the review, will address these comments.
Re: [bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace
On 05/17/2018 01:38 AM, John Fastabend wrote: > When an error happens in the update sockmap element logic also pass > the err up to the user. > > Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with > hashmap") > Signed-off-by: John FastabendAgree, applied to bpf-next, thanks John!
[PATCH bpf] bpf: fix truncated jump targets on heavy expansions
Recently during testing, I ran into the following panic: [ 207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 9604 [#1] SMP [ 207.901637] Modules linked in: binfmt_misc [...] [ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: GW 4.17.0-rc3+ #7 [ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017 [ 207.982428] pstate: 6045 (nZCv daif +PAN -UAO) [ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 207.992603] lr : 0x00bdb754 [ 207.996080] sp : 13703ca0 [ 207.999384] x29: 13703ca0 x28: 0001 [ 208.004688] x27: 0001 x26: [ 208.009992] x25: 13703ce0 x24: 800fb4afcb00 [ 208.015295] x23: 7d2f5038 x22: 7d2f5000 [ 208.020599] x21: feff2a6f x20: 000a [ 208.025903] x19: 09578000 x18: 0a03 [ 208.031206] x17: x16: [ 208.036510] x15: 9de83000 x14: [ 208.041813] x13: x12: [ 208.047116] x11: 0001 x10: 089e7f18 [ 208.052419] x9 : feff2a6f x8 : [ 208.057723] x7 : 000a x6 : 00280c616000 [ 208.063026] x5 : 0018 x4 : 7db6 [ 208.068329] x3 : 0008647a x2 : 19868179b1484500 [ 208.073632] x1 : x0 : 09578c08 [ 208.078938] Process test_verifier (pid: 2256, stack limit = 0x49ca7974) [ 208.086235] Call trace: [ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 208.093713] 0x00bdb754 [ 208.096845] bpf_test_run+0x78/0xf8 [ 208.100324] bpf_prog_test_run_skb+0x148/0x230 [ 208.104758] sys_bpf+0x314/0x1198 [ 208.108064] el0_svc_naked+0x30/0x34 [ 208.111632] Code: 91302260 f941 f9001fa1 d281 (29500680) [ 208.117717] ---[ end trace 263cb8a59b5bf29f ]--- The program itself which caused this had a long jump over the whole instruction sequence where all of the inner instructions required heavy expansions into multiple BPF instructions. Additionally, I also had BPF hardening enabled which requires once more rewrites of all constant values in order to blind them. Each time we rewrite insns, bpf_adj_branches() would need to potentially adjust branch targets which cross the patchlet boundary to accommodate for the additional delta. Eventually that lead to the case where the target offset could not fit into insn->off's upper 0x7fff limit anymore where then offset wraps around becoming negative (in s16 universe), or vice versa depending on the jump direction. Therefore it becomes necessary to detect and reject any such occasions in a generic way for native eBPF and cBPF to eBPF migrations. For the latter we can simply check bounds in the bpf_convert_filter()'s BPF_EMIT_JMP helper macro and bail out once we surpass limits. The bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case of subsequent hardening) is a bit more complex in that we need to detect such truncations before hitting the bpf_prog_realloc(). Thus the latter is split into an extra pass to probe problematic offsets on the original program in order to fail early. With that in place and carefully tested I no longer hit the panic and the rewrites are rejected properly. The above example panic I've seen on bpf-next, though the issue itself is generic in that a guard against this issue in bpf seems more appropriate in this case. Signed-off-by: Daniel Borkmann--- [ Will follow up with an additional test case in bpf-next. ] kernel/bpf/core.c | 100 -- net/core/filter.c | 11 -- 2 files changed, 84 insertions(+), 27 deletions(-) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index ba03ec3..6ef6746 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -218,47 +218,84 @@ int bpf_prog_calc_tag(struct bpf_prog *fp) return 0; } -static void bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta) +static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta, + u32 curr, const bool probe_pass) { + const s64 imm_min = S32_MIN, imm_max = S32_MAX; + s64 imm = insn->imm; + + if (curr < pos && curr + imm + 1 > pos) + imm += delta; + else if (curr > pos + delta && curr + imm + 1 <= pos + delta) + imm -= delta; + if (imm < imm_min || imm > imm_max) + return -ERANGE; + if (!probe_pass) + insn->imm = imm; + return 0; +} + +static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta, + u32 curr, const bool probe_pass) +{ + const s32 off_min = S16_MIN, off_max = S16_MAX; + s32 off = insn->off; + + if (curr < pos &&
[bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace
When an error happens in the update sockmap element logic also pass the err up to the user. Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap") Signed-off-by: John Fastabend--- kernel/bpf/sockmap.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 79f5e89..c6de139 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1875,7 +1875,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, write_unlock_bh(>sk_callback_lock); } out: - return 0; + return err; } int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
[PATCH net-next 5/8] tcp: new helper tcp_timeout_mark_lost
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets lost upon RTO. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- net/ipv4/tcp_input.c | 50 +--- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6fb0a28977a0..af32accda2a9 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1917,18 +1917,43 @@ static inline void tcp_init_undo(struct tcp_sock *tp) tp->undo_retrans = tp->retrans_out ? : -1; } -/* Enter Loss state. If we detect SACK reneging, forget all SACK information +/* If we detect SACK reneging, forget all SACK information * and reset tags completely, otherwise preserve SACKs. If receiver * dropped its ofo queue, we will know this due to reneging detection. */ +static void tcp_timeout_mark_lost(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; + bool is_reneg; /* is receiver reneging on SACKs? */ + + skb = tcp_rtx_queue_head(sk); + is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED); + if (is_reneg) { + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING); + tp->sacked_out = 0; + /* Mark SACK reneging until we recover from this loss event. */ + tp->is_sack_reneg = 1; + } else if (tcp_is_reno(tp)) { + tcp_reset_reno_sack(tp); + } + + skb_rbtree_walk_from(skb) { + if (is_reneg) + TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED; + tcp_mark_skb_lost(sk, skb); + } + tcp_verify_left_out(tp); + tcp_clear_all_retrans_hints(tp); +} + +/* Enter Loss state. */ void tcp_enter_loss(struct sock *sk) { const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct net *net = sock_net(sk); - struct sk_buff *skb; bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery; - bool is_reneg; /* is receiver reneging on SACKs? */ /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk->icsk_ca_state <= TCP_CA_Disorder || @@ -1944,24 +1969,7 @@ void tcp_enter_loss(struct sock *sk) tp->snd_cwnd_cnt = 0; tp->snd_cwnd_stamp = tcp_jiffies32; - if (tcp_is_reno(tp)) - tcp_reset_reno_sack(tp); - - skb = tcp_rtx_queue_head(sk); - is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED); - if (is_reneg) { - NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING); - tp->sacked_out = 0; - /* Mark SACK reneging until we recover from this loss event. */ - tp->is_sack_reneg = 1; - } - skb_rbtree_walk_from(skb) { - if (is_reneg) - TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED; - tcp_mark_skb_lost(sk, skb); - } - tcp_verify_left_out(tp); - tcp_clear_all_retrans_hints(tp); + tcp_timeout_mark_lost(sk); /* Timeout in disordered state after receiving substantial DUPACKs * suggests that the degree of reordering is over-estimated. -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 1/8] tcp: support DUPACK threshold in RACK
This patch adds support for the classic DUPACK threshold rule (#DupThresh) in RACK. When the number of packets SACKed is greater or equal to the threshold, RACK sets the reordering window to zero which would immediately mark all the unsacked packets below the highest SACKed sequence lost. Since this approach is known to not work well with reordering, RACK only uses it if no reordering has been observed. The DUPACK threshold rule is a particularly useful extension to the fast recoveries triggered by RACK reordering timer. For example data-center transfers where the RTT is much smaller than a timer tick, or high RTT path where the default RTT/4 may take too long. Note that this patch differs slightly from RFC6675. RFC6675 considers a packet lost when at least #DupThresh higher-sequence packets are SACKed. With RACK, for connections that have seen reordering, RACK continues to use a dynamically-adaptive time-based reordering window to detect losses. But for connections on which we have not yet seen reordering, this patch considers a packet lost when at least one higher sequence packet is SACKed and the total number of SACKed packets is at least DupThresh. For example, suppose a connection has not seen reordering, and sends 10 packets, and packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2 lost. RACK considers packets 1, 2, 4, 6 lost. There is some small risk of spurious retransmits here due to reordering. However, this is mostly limited to the first flight of a connection on which the sender receives SACKs from reordering. And RFC 6675 and FACK loss detection have a similar risk on the first flight with reordering (it's just that the risk of spurious retransmits from reordering was slightly narrower for those older algorithms due to the margin of 3*MSS). Also the minimum reordering window is reduced from 1 msec to 0 to recover quicker on short RTT transfers. Therefore RACK is more aggressive in marking packets lost during recovery to reduce the reordering window timeouts. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- Documentation/networking/ip-sysctl.txt | 1 + include/net/tcp.h | 1 + net/ipv4/tcp_recovery.c| 40 +- 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 59afc9a10b4f..13bbac50dc8b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -451,6 +451,7 @@ tcp_recovery - INTEGER RACK: 0x1 enables the RACK loss detection for fast detection of lost retransmissions and tail drops. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). + RACK: 0x4 disables RACK's DUPACK threshold heuristic Default: 0x1 diff --git a/include/net/tcp.h b/include/net/tcp.h index 3b1d617b0110..85000c85ddcd 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -245,6 +245,7 @@ extern long sysctl_tcp_mem[3]; #define TCP_RACK_LOSS_DETECTION 0x1 /* Use RACK to detect losses */ #define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */ +#define TCP_RACK_NO_DUPTHRESH0x4 /* Do not use DUPACK threshold in RACK */ extern atomic_long_t tcp_memory_allocated; extern struct percpu_counter tcp_sockets_allocated; diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index 3a81720ac0c4..1c1bdf12a96f 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -21,6 +21,32 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2) return t1 > t2 || (t1 == t2 && after(seq1, seq2)); } +u32 tcp_rack_reo_wnd(const struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (!tp->rack.reord) { + /* If reordering has not been observed, be aggressive during +* the recovery or starting the recovery by DUPACK threshold. +*/ + if (inet_csk(sk)->icsk_ca_state >= TCP_CA_Recovery) + return 0; + + if (tp->sacked_out >= tp->reordering && + !(sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_NO_DUPTHRESH)) + return 0; + } + + /* To be more reordering resilient, allow min_rtt/4 settling delay. +* Use min_rtt instead of the smoothed RTT because reordering is +* often a path property and less related to queuing or delayed ACKs. +* Upon receiving DSACKs, linearly increase the window up to the +* smoothed RTT. +*/ + return min((tcp_min_rtt(tp) >> 2) * tp->rack.reo_wnd_steps, + tp->srtt_us >> 3); +} + /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01): * * Marks a packet
[PATCH net-next 4/8] tcp: account lost retransmit after timeout
The previous approach for the lost and retransmit bits was to wipe the slate clean: zero all the lost and retransmit bits, correspondingly zero the lost_out and retrans_out counters, and then add back the lost bits (and correspondingly increment lost_out). The new approach is to treat this very much like marking packets lost in fast recovery. We don’t wipe the slate clean. We just say that for all packets that were not yet marked sacked or lost, we now mark them as lost in exactly the same way we do for fast recovery. This fixes the lost retransmit accounting at RTO time and greatly simplifies the RTO code by sharing much of the logic with Fast Recovery. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- include/net/tcp.h | 1 + net/ipv4/tcp_input.c| 18 +++--- net/ipv4/tcp_recovery.c | 4 ++-- 3 files changed, 6 insertions(+), 17 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index d7f81325bee5..402484ed9b57 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1878,6 +1878,7 @@ void tcp_v4_init(void); void tcp_init(void); /* tcp_recovery.c */ +void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb); void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced); extern void tcp_rack_mark_lost(struct sock *sk); extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 076206873e3e..6fb0a28977a0 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1929,7 +1929,6 @@ void tcp_enter_loss(struct sock *sk) struct sk_buff *skb; bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery; bool is_reneg; /* is receiver reneging on SACKs? */ - bool mark_lost; /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk->icsk_ca_state <= TCP_CA_Disorder || @@ -1945,9 +1944,6 @@ void tcp_enter_loss(struct sock *sk) tp->snd_cwnd_cnt = 0; tp->snd_cwnd_stamp = tcp_jiffies32; - tp->retrans_out = 0; - tp->lost_out = 0; - if (tcp_is_reno(tp)) tcp_reset_reno_sack(tp); @@ -1959,21 +1955,13 @@ void tcp_enter_loss(struct sock *sk) /* Mark SACK reneging until we recover from this loss event. */ tp->is_sack_reneg = 1; } - tcp_clear_all_retrans_hints(tp); - skb_rbtree_walk_from(skb) { - mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) || -is_reneg); - if (mark_lost) - tcp_sum_lost(tp, skb); - TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED; - if (mark_lost) { + if (is_reneg) TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED; - TCP_SKB_CB(skb)->sacked |= TCPCB_LOST; - tp->lost_out += tcp_skb_pcount(skb); - } + tcp_mark_skb_lost(sk, skb); } tcp_verify_left_out(tp); + tcp_clear_all_retrans_hints(tp); /* Timeout in disordered state after receiving substantial DUPACKs * suggests that the degree of reordering is over-estimated. diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index 299b0e38aa9a..b2f9be388bf3 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -2,7 +2,7 @@ #include #include -static void tcp_rack_mark_skb_lost(struct sock *sk, struct sk_buff *skb) +void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); @@ -95,7 +95,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout) remaining = tp->rack.rtt_us + reo_wnd - tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp); if (remaining <= 0) { - tcp_rack_mark_skb_lost(sk, skb); + tcp_mark_skb_lost(sk, skb); list_del_init(>tcp_tsorted_anchor); } else { /* Record maximum wait time */ -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 8/8] tcp: don't mark recently sent packets lost on RTO
An RTO event indicates the head has not been acked for a long time after its last (re)transmission. But the other packets are not necessarily lost if they have been only sent recently (for example due to application limit). This patch would prohibit marking packets sent within an RTT to be lost on RTO event, using similar logic in TCP RACK detection. Normally the head (SND.UNA) would be marked lost since RTO should fire strictly after the head was sent. An exception is when the most recent RACK RTT measurement is larger than the (previous) RTO. To address this exception the head is always marked lost. Congestion control interaction: since we may not mark every packet lost, the congestion window may be more than 1 (inflight plus 1). But only one packet will be retransmitted after RTO, since tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The connection still performs slow start from one packet (with Cubic congestion control). This commit was tested in an A/B test with Google web servers, and showed a reduction of 2% in (spurious) retransmits post timeout (SlowStartRetrans), and correspondingly reduced DSACKs (DSACKIgnoredOld) by 7%. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- net/ipv4/tcp_input.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ba8a8e3464aa..0bf032839548 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1929,11 +1929,11 @@ static bool tcp_is_rack(const struct sock *sk) static void tcp_timeout_mark_lost(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; + struct sk_buff *skb, *head; bool is_reneg; /* is receiver reneging on SACKs? */ - skb = tcp_rtx_queue_head(sk); - is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED); + head = tcp_rtx_queue_head(sk); + is_reneg = head && (TCP_SKB_CB(head)->sacked & TCPCB_SACKED_ACKED); if (is_reneg) { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING); tp->sacked_out = 0; @@ -1943,9 +1943,13 @@ static void tcp_timeout_mark_lost(struct sock *sk) tcp_reset_reno_sack(tp); } + skb = head; skb_rbtree_walk_from(skb) { if (is_reneg) TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED; + else if (tcp_is_rack(sk) && skb != head && +tcp_rack_skb_timeout(tp, skb, 0) > 0) + continue; /* Don't mark recently sent ones lost yet */ tcp_mark_skb_lost(sk, skb); } tcp_verify_left_out(tp); @@ -1972,7 +1976,7 @@ void tcp_enter_loss(struct sock *sk) tcp_ca_event(sk, CA_EVENT_LOSS); tcp_init_undo(tp); } - tp->snd_cwnd = 1; + tp->snd_cwnd = tcp_packets_in_flight(tp) + 1; tp->snd_cwnd_cnt = 0; tp->snd_cwnd_stamp = tcp_jiffies32; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 7/8] tcp: new helper tcp_rack_skb_timeout
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack to prepare the final RTO change. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- include/net/tcp.h | 2 ++ net/ipv4/tcp_input.c| 10 +- net/ipv4/tcp_recovery.c | 9 +++-- 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 402484ed9b57..b46d0f9adbdb 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1880,6 +1880,8 @@ void tcp_init(void); /* tcp_recovery.c */ void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb); void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced); +extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, + u32 reo_wnd); extern void tcp_rack_mark_lost(struct sock *sk); extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, u64 xmit_time); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1ccc97b368c7..ba8a8e3464aa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1917,6 +1917,11 @@ static inline void tcp_init_undo(struct tcp_sock *tp) tp->undo_retrans = tp->retrans_out ? : -1; } +static bool tcp_is_rack(const struct sock *sk) +{ + return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION; +} + /* If we detect SACK reneging, forget all SACK information * and reset tags completely, otherwise preserve SACKs. If receiver * dropped its ofo queue, we will know this due to reneging detection. @@ -2031,11 +2036,6 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp) return tp->sacked_out + 1; } -static bool tcp_is_rack(const struct sock *sk) -{ - return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION; -} - /* Linux NewReno/SACK/ECN state machine. * -- * diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index b2f9be388bf3..30cbfb69b1de 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -47,6 +47,12 @@ u32 tcp_rack_reo_wnd(const struct sock *sk) tp->srtt_us >> 3); } +s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd) +{ + return tp->rack.rtt_us + reo_wnd - + tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp); +} + /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01): * * Marks a packet lost, if some packet sent later has been (s)acked. @@ -92,8 +98,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout) /* A packet is lost if it has not been s/acked beyond * the recent RTT plus the reordering window. */ - remaining = tp->rack.rtt_us + reo_wnd - - tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp); + remaining = tcp_rack_skb_timeout(tp, skb, reo_wnd); if (remaining <= 0) { tcp_mark_skb_lost(sk, skb); list_del_init(>tcp_tsorted_anchor); -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 6/8] tcp: separate loss marking and state update on RTO
Previously when TCP times out, it first updates cwnd and ssthresh, marks packets lost, and then updates congestion state again. This was fine because everything not yet delivered is marked lost, so the inflight is always 0 and cwnd can be safely set to 1 to retransmit one packet on timeout. But the inflight may not always be 0 on timeout if TCP changes to mark packets lost based on packet sent time. Therefore we must first mark the packet lost, then set the cwnd based on the (updated) inflight. This is not a pure refactor. Congestion control may potentially break if it uses (not yet updated) inflight to compute ssthresh. Fortunately all existing congestion control modules does not do that. Also it changes the inflight when CA_LOSS_EVENT is called, and only westwood processes such an event but does not use inflight. This change has two other minor side benefits: 1) consistent with Fast Recovery s.t. the inflight is updated first before tcp_enter_recovery flips state to CA_Recovery. 2) avoid intertwining loss marking with state update, making the code more readable. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- net/ipv4/tcp_input.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index af32accda2a9..1ccc97b368c7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1955,6 +1955,8 @@ void tcp_enter_loss(struct sock *sk) struct net *net = sock_net(sk); bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery; + tcp_timeout_mark_lost(sk); + /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk->icsk_ca_state <= TCP_CA_Disorder || !after(tp->high_seq, tp->snd_una) || @@ -1969,8 +1971,6 @@ void tcp_enter_loss(struct sock *sk) tp->snd_cwnd_cnt = 0; tp->snd_cwnd_stamp = tcp_jiffies32; - tcp_timeout_mark_lost(sk); - /* Timeout in disordered state after receiving substantial DUPACKs * suggests that the degree of reordering is over-estimated. */ -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 2/8] tcp: disable RFC6675 loss detection
This patch disables RFC6675 loss detection and make sysctl net.ipv4.tcp_recovery = 1 controls a binary choice between RACK (1) or RFC6675 (0). Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- Documentation/networking/ip-sysctl.txt | 3 ++- net/ipv4/tcp_input.c | 12 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 13bbac50dc8b..ea304a23c8d7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -449,7 +449,8 @@ tcp_recovery - INTEGER features. RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. + retransmissions and tail drops. It also subsumes and disables + RFC6675 recovery for SACK connections. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). RACK: 0x4 disables RACK's DUPACK threshold heuristic diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b188e0d75edd..ccbe04f80040 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2035,6 +2035,11 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp) return tp->sacked_out + 1; } +static bool tcp_is_rack(const struct sock *sk) +{ + return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION; +} + /* Linux NewReno/SACK/ECN state machine. * -- * @@ -2141,7 +2146,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) return true; /* Not-A-Trick#2 : Classic rule... */ - if (tcp_dupack_heuristics(tp) > tp->reordering) + if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering) return true; return false; @@ -2722,8 +2727,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag) { struct tcp_sock *tp = tcp_sk(sk); - /* Use RACK to detect loss */ - if (sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) { + if (tcp_is_rack(sk)) { u32 prior_retrans = tp->retrans_out; tcp_rack_mark_lost(sk); @@ -2862,7 +2866,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, fast_rexmit = 1; } - if (do_lost) + if (!tcp_is_rack(sk) && do_lost) tcp_update_scoreboard(sk, fast_rexmit); *rexmit = REXMIT_LOST; } -- 2.17.0.441.gb46fe60e1d-goog
[PATCH net-next 3/8] tcp: simpler NewReno implementation
This is a rewrite of NewReno loss recovery implementation that is simpler and standalone for readability and better performance by using less states. Note that NewReno refers to RFC6582 as a modification to the fast recovery algorithm. It is used only if the connection does not support SACK in Linux. It should not to be confused with the Reno (AIMD) congestion control. Signed-off-by: Yuchung ChengSigned-off-by: Neal Cardwell Reviewed-by: Eric Dumazet Reviewed-by: Soheil Hassas Yeganeh Reviewed-by: Priyaranjan Jha --- include/net/tcp.h | 1 + net/ipv4/tcp_input.c| 19 +++ net/ipv4/tcp_recovery.c | 27 +++ 3 files changed, 39 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 85000c85ddcd..d7f81325bee5 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1878,6 +1878,7 @@ void tcp_v4_init(void); void tcp_init(void); /* tcp_recovery.c */ +void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced); extern void tcp_rack_mark_lost(struct sock *sk); extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq, u64 xmit_time); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index ccbe04f80040..076206873e3e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2223,9 +2223,7 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_is_reno(tp)) { - tcp_mark_head_lost(sk, 1, 1); - } else { + if (tcp_is_sack(tp)) { int sacked_upto = tp->sacked_out - tp->reordering; if (sacked_upto >= 0) tcp_mark_head_lost(sk, sacked_upto, 0); @@ -2723,11 +2721,16 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una) return false; } -static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag) +static void tcp_identify_packet_loss(struct sock *sk, int *ack_flag) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_is_rack(sk)) { + if (tcp_rtx_queue_empty(sk)) + return; + + if (unlikely(tcp_is_reno(tp))) { + tcp_newreno_mark_lost(sk, *ack_flag & FLAG_SND_UNA_ADVANCED); + } else if (tcp_is_rack(sk)) { u32 prior_retrans = tp->retrans_out; tcp_rack_mark_lost(sk); @@ -2823,11 +2826,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, tcp_try_keep_open(sk); return; } - tcp_rack_identify_loss(sk, ack_flag); + tcp_identify_packet_loss(sk, ack_flag); break; case TCP_CA_Loss: tcp_process_loss(sk, flag, is_dupack, rexmit); - tcp_rack_identify_loss(sk, ack_flag); + tcp_identify_packet_loss(sk, ack_flag); if (!(icsk->icsk_ca_state == TCP_CA_Open || (*ack_flag & FLAG_LOST_RETRANS))) return; @@ -2844,7 +2847,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, if (icsk->icsk_ca_state <= TCP_CA_Disorder) tcp_try_undo_dsack(sk); - tcp_rack_identify_loss(sk, ack_flag); + tcp_identify_packet_loss(sk, ack_flag); if (!tcp_time_to_recover(sk, flag)) { tcp_try_to_open(sk, flag); return; diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c index 1c1bdf12a96f..299b0e38aa9a 100644 --- a/net/ipv4/tcp_recovery.c +++ b/net/ipv4/tcp_recovery.c @@ -216,3 +216,30 @@ void tcp_rack_update_reo_wnd(struct sock *sk, struct rate_sample *rs) tp->rack.reo_wnd_steps = 1; } } + +/* RFC6582 NewReno recovery for non-SACK connection. It simply retransmits + * the next unacked packet upon receiving + * a) three or more DUPACKs to start the fast recovery + * b) an ACK acknowledging new data during the fast recovery. + */ +void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced) +{ + const u8 state = inet_csk(sk)->icsk_ca_state; + struct tcp_sock *tp = tcp_sk(sk); + + if ((state < TCP_CA_Recovery && tp->sacked_out >= tp->reordering) || + (state == TCP_CA_Recovery && snd_una_advanced)) { + struct sk_buff *skb = tcp_rtx_queue_head(sk); + u32 mss; + + if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST) + return; + + mss = tcp_skb_mss(skb); + if (tcp_skb_pcount(skb) > 1 && skb->len > mss) + tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb, +mss, mss, GFP_ATOMIC); + + tcp_skb_mark_lost_uncond_verify(tp,
[PATCH net-next 0/8] tcp: default RACK loss recovery
This patch set implements the features correspond to the draft-ietf-tcpm-rack-03 version of the RACK draft. https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00 1. SACK: implement equivalent DUPACK threshold heuristic in RACK to replace existing RFC6675 recovery (tcp_mark_head_lost). 2. Non-SACK: simplify RFC6582 NewReno implementation 3. RTO: apply RACK's time-based approach to avoid spuriouly marking very recently sent packets lost. 4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to mark losses based on time on S/ACK. Tail loss probe and F-RTO remain enabled by default as complementary mechanisms to send probes in CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger RACK time-based loss detection. All Google web and internal servers have been running RACK-only mode (4) for a while now. a/b experiments indicate RACK/TLP on average reduces recovery latency by 10% compared to RFC6675. RFC6675 is default-off now but can be enabled by disabling RACK (sysctl net.ipv4.tcp_recovery=0) for unseen issues. Yuchung Cheng (8): tcp: support DUPACK threshold in RACK tcp: disable RFC6675 loss detection tcp: simpler NewReno implementation tcp: account lost retransmit after timeout tcp: new helper tcp_timeout_mark_lost tcp: separate loss marking and state update on RTO tcp: new helper tcp_rack_skb_timeout tcp: don't mark recently sent packets lost on RTO Documentation/networking/ip-sysctl.txt | 4 +- include/net/tcp.h | 5 ++ net/ipv4/tcp_input.c | 99 ++ net/ipv4/tcp_recovery.c| 80 - 4 files changed, 124 insertions(+), 64 deletions(-) -- 2.17.0.441.gb46fe60e1d-goog
Proposal
Hello Greetings to you please i have a business proposal for you contact me for more detailes asap thanks. Best Regards, Miss.Zeliha ömer faruk Esentepe Mahallesi Büyükdere Caddesi Kristal Kule Binasi No:215 Sisli - Istanbul, Turkey
Re: [PATCH bpf-next] libbpf: add ifindex to enable offload support
On 05/16/2018 11:02 PM, Jakub Kicinski wrote: > From: David Beckett> > BPF programs currently can only be offloaded using iproute2. This > patch will allow programs to be offloaded using libbpf calls. > > Signed-off-by: David Beckett > Reviewed-by: Jakub Kicinski Applied to bpf-next, thanks guys!
Re: [PATCH] bpf: add __printf verification to bpf_verifier_vlog
On 05/16/2018 10:27 PM, Mathieu Malaterre wrote: > __printf is useful to verify format and arguments. ‘bpf_verifier_vlog’ > function is used twice in verifier.c in both cases the caller function > already uses the __printf gcc attribute. > > Remove the following warning, triggered with W=1: > > kernel/bpf/verifier.c:176:2: warning: function might be possible candidate > for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] > > Signed-off-by: Mathieu MalaterreLooks good, applied to bpf-next, thanks Mathieu!
Re: [PATCH bpf-next] bpf: fix sock hashmap kmalloc warning
On 05/16/2018 11:06 PM, Yonghong Song wrote: > syzbot reported a kernel warning below: > WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 > mm/slab_common.c:996 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: >__dump_stack lib/dump_stack.c:77 [inline] >dump_stack+0x1b9/0x294 lib/dump_stack.c:113 >panic+0x22f/0x4de kernel/panic.c:184 >__warn.cold.8+0x163/0x1b3 kernel/panic.c:536 >report_bug+0x252/0x2d0 lib/bug.c:186 >fixup_bug arch/x86/kernel/traps.c:178 [inline] >do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 >do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 >invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 > RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996 > RSP: 0018:8801d907fc58 EFLAGS: 00010246 > RAX: RBX: 8801aeecb280 RCX: 8185ebd7 > RDX: RSI: RDI: ffe1 > RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700 > R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280 > R13: fff4 R14: 8801ad891a00 R15: 014200c0 >__do_kmalloc mm/slab.c:3713 [inline] >__kmalloc+0x25/0x760 mm/slab.c:3727 >kmalloc include/linux/slab.h:517 [inline] >map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858 >__do_sys_bpf kernel/bpf/syscall.c:2131 [inline] >__se_sys_bpf kernel/bpf/syscall.c:2096 [inline] >__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096 >do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > The test case is against sock hashmap with a key size 0xffe1. > Such a large key size will cause the below code in function > sock_hash_alloc() overflowing and produces a smaller elem_size, > hence map creation will be successful. > htab->elem_size = sizeof(struct htab_elem) + > round_up(htab->map.key_size, 8); > > Later, when map_get_next_key is called and kernel tries > to allocate the key unsuccessfully, it will issue > the above warning. > > Similar to hashtab, ensure the key size is at most > MAX_BPF_STACK for a successful map creation. > > Fixes: 81110384441a ("bpf: sockmap, add hash map support") > Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com > Signed-off-by: Yonghong SongApplied to bpf-next, thanks Yonghong!
Re: kernel BUG at lib/string.c:LINE! (4)
Hello, On Wed, 16 May 2018, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc.. > git tree: net-next > console output: https://syzkaller.appspot.com/x/log.txt?x=16e9101780 > kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 > dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d63780 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1051710780 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+aac887f7731986864...@syzkaller.appspotmail.com > > IPVS: Unknown mcast interface: veth1_to???a > IPVS: Unknown mcast interface: veth1_to???a > IPVS: Unknown mcast interface: veth1_to???a > detected buffer overflow in strlen > [ cut here ] > kernel BUG at lib/string.c:1052! > invalid opcode: [#1] SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > Modules linked in: > CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google > 01/01/2011 > RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051 > RSP: 0018:8801c976f800 EFLAGS: 00010282 > RAX: 0022 RBX: 0040 RCX: > RDX: 0022 RSI: 8160f6f1 RDI: ed00392edef6 > RBP: 8801c976f800 R08: 8801cf4c62c0 R09: ed003b5e4fb0 > R10: ed003b5e4fb0 R11: 8801daf27d87 R12: 8801c976fa20 > R13: 8801c976fae4 R14: 8801c976fae0 R15: 048b > FS: 7fd99f75e700() GS:8801daf0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 21c0 CR3: 0001d6843000 CR4: 001406e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > strlen include/linux/string.h:270 [inline] > strlcpy include/linux/string.h:293 [inline] > do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388 > nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] > nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115 > ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253 > udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487 > ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917 > tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057 > sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046 > __sys_setsockopt+0x1bd/0x390 net/socket.c:1903 > __do_sys_setsockopt net/socket.c:1914 [inline] > __se_sys_setsockopt net/socket.c:1911 [inline] > __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x447369 > RSP: 002b:7fd99f75dda8 EFLAGS: 0246 ORIG_RAX: 0036 > RAX: ffda RBX: 006e39e4 RCX: 00447369 > RDX: 048b RSI: RDI: 0003 > RBP: R08: 0018 R09: > R10: 21c0 R11: 0246 R12: 006e39e0 > R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0001 > Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de > 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90 > 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 > RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: 8801c976f800 > ---[ end trace 624046f2d9af7702 ]--- Just to let you know that I tested a patch with the syzbot, will do more tests before submitting... Regards -- Julian Anastasov
Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len
On Wed, May 16, 2018 at 07:05:34AM -0700, William Tu wrote: > On Mon, May 14, 2018 at 10:33 PM, Tobin C. Hardingwrote: > > On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote: > >> Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not > >> handled. BSO has 4 possible values: > >> 00 --> Good frame with no error, or unknown integrity > >> 11 --> Payload is a Bad Frame with CRC or Alignment Error > >> 01 --> Payload is a Short Frame > >> 10 --> Payload is an Oversized Frame > >> > >> Based the short/oversized definitions in RFC1757, the patch sets > >> the bso bit based on the mirrored packet's size. > >> > >> Reported-by: Xiaoyan Jin > >> Signed-off-by: William Tu > >> --- > >> include/net/erspan.h | 25 + > >> 1 file changed, 25 insertions(+) > >> > >> diff --git a/include/net/erspan.h b/include/net/erspan.h > >> index d044aa60cc76..5eb95f78ad45 100644 > >> --- a/include/net/erspan.h > >> +++ b/include/net/erspan.h > >> @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void) > >> return htonl((u32)h_usecs); > >> } > >> > >> +/* ERSPAN BSO (Bad/Short/Oversized) > >> + * 00b --> Good frame with no error, or unknown integrity > >> + * 01b --> Payload is a Short Frame > >> + * 10b --> Payload is an Oversized Frame > >> + * 11b --> Payload is a Bad Frame with CRC or Alignment Error > >> + */ > >> +enum erspan_bso { > >> + BSO_NOERROR, > >> + BSO_SHORT, > >> + BSO_OVERSIZED, > >> + BSO_BAD, > >> +}; > > > > If we are relying on the values perhaps this would be clearer > > > > BSO_NOERROR = 0x00, > > BSO_SHORT = 0x01, > > BSO_OVERSIZED = 0x02, > > BSO_BAD = 0x03, > > > > Yes, thanks. I will change in v2. > > >> + > >> +static inline u8 erspan_detect_bso(struct sk_buff *skb) > >> +{ > >> + if (skb->len < ETH_ZLEN) > >> + return BSO_SHORT; > >> + > >> + if (skb->len > ETH_FRAME_LEN) > >> + return BSO_OVERSIZED; > >> + > >> + return BSO_NOERROR; > >> +} > > > > Without having much contextual knowledge around this patch; should we be > > doing some check on CRC or alignment (at some stage)? Having BSO_BAD > > seems to imply so? > > > > The definition of BSO_BAD: > etherStatsCRCAlignErrors OBJECT-TYPE > SYNTAX Counter > ACCESS read-only > STATUS mandatory > DESCRIPTION > "The total number of packets received that > had a length (excluding framing bits, but > including FCS octets) of between 64 and 1518 > octets, inclusive, but but had either a bad > Frame Check Sequence (FCS) with an integral > number of octets (FCS Error) or a bad FCS with > a non-integral number of octets (Alignment Error)." > > But I don't know how to check CRC error at this code point. > Isn't it done by the NIC hardware? I'll just start with; I don't know anything about ERSPAN "ERSPAN is a Cisco proprietary feature and is available only to Catalyst 6500, 7600, Nexus, and ASR 1000 platforms to date. The ASR 1000 supports ERSPAN source (monitoring) only on Fast Ethernet, Gigabit Ethernet, and port-channel interfaces." https://supportforums.cisco.com/t5/network-infrastructure-documents/understanding-span-rspan-and-erspan/ta-p/3144951 I dug around a bit and none of the files that currently import erspan.h actually use the 'bso' field $ grep bso $(git grep -l 'erspan\.h') include/net/erspan.h: u8 bso = 0; /* Bad/Short/Oversized */ include/net/erspan.h: ershdr->en = bso; net/ipv4/ip_gre.c: ICMP in the real Internet is absolutely infeasible. net/ipv4/ip_gre.c: * ICMP in the real Internet is absolutely infeasible. Normally, AFAICT, the FCS does not get passed to the operating system since its a link layer mechanism. If ERSPAN is passing the FCS when it mirrors frames (does it mirror frames or packets, I don't know?) then surely ERSPAN should provide a function to return the BSO value. So IMHO this patch seems like a just pretense and not really doing anything. Hope this helps, Tobin.
[PATCH] net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devices
The early versions of am33xx devices, related to ES1.0 SoC revision have errata limiting mq support. That's the same errata as commit 7da1160002f1 ("drivers: net: cpsw: add am335x errata workarround for interrutps") AM33xx Errata [1] Advisory 1.0.9 http://www.ti.com/lit/er/sprz360f/sprz360f.pdf After additional investigation were found that drivers w/a is propagated on all AM33xx SoCs and on DM814x. But the errata exists only for ES1.0 of AM33xx family, limiting mq support for revisions after ES1.0. So, disable mq support only for related SoCs and use separate polls for revisions allowing mq. Signed-off-by: Ivan Khoronzhuk--- Based on net-next/master drivers/net/ethernet/ti/cpsw.c | 109 ++--- 1 file changed, 60 insertions(+), 49 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 28d893b93d30..a7285dddfd29 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -36,6 +36,7 @@ #include #include #include +#include #include @@ -957,7 +958,7 @@ static irqreturn_t cpsw_rx_interrupt(int irq, void *dev_id) return IRQ_HANDLED; } -static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget) +static int cpsw_tx_mq_poll(struct napi_struct *napi_tx, int budget) { u32 ch_map; int num_tx, cur_budget, ch; @@ -984,7 +985,21 @@ static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget) if (num_tx < budget) { napi_complete(napi_tx); writel(0xff, >wr_regs->tx_en); - if (cpsw->quirk_irq && cpsw->tx_irq_disabled) { + } + + return num_tx; +} + +static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget) +{ + struct cpsw_common *cpsw = napi_to_cpsw(napi_tx); + int num_tx; + + num_tx = cpdma_chan_process(cpsw->txv[0].ch, budget); + if (num_tx < budget) { + napi_complete(napi_tx); + writel(0xff, >wr_regs->tx_en); + if (cpsw->tx_irq_disabled) { cpsw->tx_irq_disabled = false; enable_irq(cpsw->irqs_table[1]); } @@ -993,7 +1008,7 @@ static int cpsw_tx_poll(struct napi_struct *napi_tx, int budget) return num_tx; } -static int cpsw_rx_poll(struct napi_struct *napi_rx, int budget) +static int cpsw_rx_mq_poll(struct napi_struct *napi_rx, int budget) { u32 ch_map; int num_rx, cur_budget, ch; @@ -1020,7 +1035,21 @@ static int cpsw_rx_poll(struct napi_struct *napi_rx, int budget) if (num_rx < budget) { napi_complete_done(napi_rx, num_rx); writel(0xff, >wr_regs->rx_en); - if (cpsw->quirk_irq && cpsw->rx_irq_disabled) { + } + + return num_rx; +} + +static int cpsw_rx_poll(struct napi_struct *napi_rx, int budget) +{ + struct cpsw_common *cpsw = napi_to_cpsw(napi_rx); + int num_rx; + + num_rx = cpdma_chan_process(cpsw->rxv[0].ch, budget); + if (num_rx < budget) { + napi_complete_done(napi_rx, num_rx); + writel(0xff, >wr_regs->rx_en); + if (cpsw->rx_irq_disabled) { cpsw->rx_irq_disabled = false; enable_irq(cpsw->irqs_table[0]); } @@ -2364,9 +2393,9 @@ static void cpsw_get_channels(struct net_device *ndev, { struct cpsw_common *cpsw = ndev_to_cpsw(ndev); + ch->max_rx = cpsw->quirk_irq ? 1 : CPSW_MAX_QUEUES; + ch->max_tx = cpsw->quirk_irq ? 1 : CPSW_MAX_QUEUES; ch->max_combined = 0; - ch->max_rx = CPSW_MAX_QUEUES; - ch->max_tx = CPSW_MAX_QUEUES; ch->max_other = 0; ch->other_count = 0; ch->rx_count = cpsw->rx_ch_num; @@ -2377,6 +2406,11 @@ static void cpsw_get_channels(struct net_device *ndev, static int cpsw_check_ch_settings(struct cpsw_common *cpsw, struct ethtool_channels *ch) { + if (cpsw->quirk_irq) { + dev_err(cpsw->dev, "Maximum one tx/rx queue is allowed"); + return -EOPNOTSUPP; + } + if (ch->combined_count) return -EINVAL; @@ -2917,44 +2951,20 @@ static int cpsw_probe_dual_emac(struct cpsw_priv *priv) return ret; } -#define CPSW_QUIRK_IRQ BIT(0) - -static const struct platform_device_id cpsw_devtype[] = { - { - /* keep it for existing comaptibles */ - .name = "cpsw", - .driver_data = CPSW_QUIRK_IRQ, - }, { - .name = "am335x-cpsw", - .driver_data = CPSW_QUIRK_IRQ, - }, { - .name = "am4372-cpsw", - .driver_data = 0, - }, { - .name = "dra7-cpsw", - .driver_data = 0, - }, { - /* sentinel */ - } -};
Re: [PATCH bpf-next 2/7] bpf: introduce bpf subcommand BPF_PERF_EVENT_QUERY
On 5/16/18 4:27 AM, Peter Zijlstra wrote: On Tue, May 15, 2018 at 04:45:16PM -0700, Yonghong Song wrote: Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_PERF_EVENT_QUERY. Given a pid and fd, if theis associated with a tracepoint/kprobe/uprobea perf event, BPF_PERF_EVENT_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Signed-off-by: Yonghong Song --- include/linux/trace_events.h | 15 ++ include/uapi/linux/bpf.h | 25 ++ kernel/bpf/syscall.c | 113 +++ kernel/trace/bpf_trace.c | 53 kernel/trace/trace_kprobe.c | 29 +++ kernel/trace/trace_uprobe.c | 22 + 6 files changed, 257 insertions(+) Why is the command called *_PERF_EVENT_* ? Are there not a lot of !perf places to attach BPF proglets? Just gave a complete picture, the below are major places to attach BPF programs: . perf based (through perf ioctl) . raw tracepoint based (through bpf interface) . netlink interface for tc, xdp, tunneling . setsockopt for socket filters . cgroup based (bpf attachment subcommand) mostly networking and io devices . some other networking socket related (sk_skb stream/parser/verdict, sk_msg verdict) through bpf attachment subcommand. Currently, for cgroup based attachment, we have BPF_PROG_QUERY with input cgroup file descriptor. For other networking based queries, we may need to enumerate tc filters, networking devices, open sockets, etc. to get the attachment information. So to have one BPF_QUERY command line may be too complex to cover all cases. But you are right that BPF_PERF_EVENT_QUERY name is too narrow since it should be used for other (pid, fd) based queries as well (e.g., socket, or other potential uses in the future). How about the subcommand name BPF_TASK_FD_QUERY and make bpf_attr.task_fd_query extensible? Thanks!
Re: [PATCH bpf-next v5 3/6] bpf: Add IPv6 Segment Routing helpers
2018-05-14 23:40 GMT+01:00 Daniel Borkmann: > On 05/12/2018 07:25 PM, Mathieu Xhonneux wrote: > [...] >> +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset, >> +const void *, from, u32, len) >> +{ >> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) >> + struct seg6_bpf_srh_state *srh_state = >> + this_cpu_ptr(_bpf_srh_states); >> + void *srh_tlvs, *srh_end, *ptr; >> + struct ipv6_sr_hdr *srh; >> + int srhoff = 0; >> + >> + if (ipv6_find_hdr(skb, , IPPROTO_ROUTING, NULL, NULL) < 0) >> + return -EINVAL; >> + >> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); >> + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4)); >> + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen); > > Do we need to check that this cannot go out of bounds wrt skb data? input_action_bpf_end (which calls the BPF program) already verifies using get_srh() that the whole SRH is accessible and is not out of bounds. The seg6 helpers (e.g. bpf_lwt_seg6_adjust_srh) then modify srh_state->hdrlen following the evolution of the SRH in size. I don't think that a check on srh_end is needed here as the SRH is already verified once, and srh_state->hdrlen is then updated to keep this bound correct. >> + ptr = skb->data + offset; >> + if (ptr >= srh_tlvs && ptr + len <= srh_end) >> + srh_state->valid = 0; >> + else if (ptr < (void *)>flags || >> + ptr + len > (void *)>segments) >> + return -EFAULT; >> + >> + if (unlikely(bpf_try_make_writable(skb, offset + len))) >> + return -EFAULT; >> + >> + memcpy(ptr, from, len); > > You have a use after free here. bpf_try_make_writable() is potentially > changing > underlying skb->data (e.g. see pskb_expand_head()). Therefore memcpy()'ing > into > cached ptr is invalid. > OK. >> + if (len > 0) { >> + ret = skb_cow_head(skb, len); >> + if (unlikely(ret < 0)) >> + return ret; >> + >> + ret = bpf_skb_net_hdr_push(skb, offset, len); >> + } else { >> + ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len); >> + } >> + if (unlikely(ret < 0)) >> + return ret; > > And here as well. You changed underlying pointers via skb_cow_head(), but in > the error path you leave the cached pointers that now point to already freed > buffer. Thus, you'd now be able to access the new skb data out of bounds since > cb->data_end is still the old one due to missing > bpf_compute_data_pointers(skb). > Please fix and audit your whole series carefully against these types of subtle > bugs. Right. I went through the whole series again. I found a similar mistake in bpf_push_seg6_encap, and added also a bpf_compute_data_pointers(skb) there. I didn't find anything else, so I hope that we're covered here (bpf_lwt_seg6_store_bytes, bpf_lwt_seg6_adjust_srh and bpf_push_seg6_encap are the only functions modifying the packet in this series). Thanks. I'll submit a v6 ASAP.
Re: [PATCH 00/14] Modify action API for implementing lockless actions
Wed, May 16, 2018 at 11:23:41PM CEST, vla...@mellanox.com wrote: > >On Wed 16 May 2018 at 17:36, Roman Mashakwrote: >> Vlad Buslov writes: >> >>> On Wed 16 May 2018 at 14:38, Roman Mashak wrote: On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov wrote: > I'm trying to run tdc, but keep getting following error even on clean > branch without my patches: Vlad, not sure if you saw my email: Apply Roman's patch and try again https://marc.info/?l=linux-netdev=152639369112020=2 cheers, jamal >>> >>> With patch applied I get following error: >>> >>> Test 7d50: Add skbmod action to set destination mac >>> exit: 255 0 >>> dst MAC address <11:22:33:44:55:66> >>> RTNETLINK answers: No such file or directory >>> We have an error talking to the kernel >>> >> >> You may actually have broken something with your patches in this case. > > Results is for net-next without my patches. Do you have skbmod compiled in kernel or as a module? >>> >>> Thanks, already figured out that default config has some actions >>> disabled. >>> Have more errors now. Everything related to ife: >>> >>> Test 7682: Create valid ife encode action with mark and pass control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No such file or directory >>> We have an error talking to the kernel >>> >>> Test ef47: Create valid ife encode action with mark and pipe control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No space left on device >>> We have an error talking to the kernel >>> >>> Test df43: Create valid ife encode action with mark and continue control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No space left on device >>> We have an error talking to the kernel >>> >>> Test e4cf: Create valid ife encode action with mark and drop control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No space left on device >>> We have an error talking to the kernel >>> >>> Test ccba: Create valid ife encode action with mark and reclassify control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No space left on device >>> We have an error talking to the kernel >>> >>> Test a1cf: Create valid ife encode action with mark and jump control >>> exit: 255 0 >>> IFE type 0xED3E >>> RTNETLINK answers: No space left on device >>> We have an error talking to the kernel >>> >>> ... >>> >>> >> >> Please make sure you have these in your kernel config: >> >> CONFIG_NET_ACT_IFE=y >> CONFIG_NET_IFE_SKBMARK=m >> CONFIG_NET_IFE_SKBPRIO=m >> CONFIG_NET_IFE_SKBTCINDEX=m Roman, could you please add this to some file? Something similar to: tools/testing/selftests/net/forwarding/config Thanks! >> >> For tdc to run all the tests, it is assumed that all the supported tc >> actions/filters are enabled and compiled. > >Enabling these options allowed all ife tests to pass. Thanks! > >Error in u32 test still appears however: > >Test e9a3: Add u32 with source match > >-> prepare stage *** Could not execute: "$TC qdisc add dev $DEV1 ingress" > >-> prepare stage *** Error message: "Cannot find device "v0p1"
[bpf PATCH 1/2] bpf: sockmap update rollback on error can incorrectly dec prog refcnt
If the user were to only attach one of the parse or verdict programs then it is possible a subsequent sockmap update could incorrectly decrement the refcnt on the program. This happens because in the rollback logic, after an error, we have to decrement the program reference count when its been incremented. However, we only increment the program reference count if the user has both a verdict and a parse program. The reason for this is because, at least at the moment, both are required for any one to be meaningful. The problem fixed here is in the rollback path we decrement the program refcnt even if only one existing. But we never incremented the refcnt in the first place creating an imbalance. This patch fixes the error path to handle this case. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel BorkmannSigned-off-by: John Fastabend --- kernel/bpf/sockmap.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 098eca5..f03aaa8 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1717,10 +1717,10 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, if (tx_msg) { tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg); if (IS_ERR(tx_msg)) { - if (verdict) - bpf_prog_put(verdict); - if (parse) + if (parse && verdict) { bpf_prog_put(parse); + bpf_prog_put(verdict); + } return PTR_ERR(tx_msg); } } @@ -1805,10 +1805,10 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, out_free: smap_release_sock(psock, sock); out_progs: - if (verdict) - bpf_prog_put(verdict); - if (parse) + if (parse && verdict) { bpf_prog_put(parse); + bpf_prog_put(verdict); + } if (tx_msg) bpf_prog_put(tx_msg); write_unlock_bh(>sk_callback_lock);
[bpf PATCH 2/2] bpf: parse and verdict prog attach may race with bpf map update
In the sockmap design BPF programs (SK_SKB_STREAM_PARSER and SK_SKB_STREAM_VERDICT) are attached to the sockmap map type and when a sock is added to the map the programs are used by the socket. However, sockmap updates from both userspace and BPF programs can happen concurrently with the attach and detach of these programs. To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE() primitive to ensure the program pointer is not refeched and possibly NULL'd before the refcnt increment. This happens inside a RCU critical section so although the pointer reference in the map object may be NULL (by a concurrent detach operation) the reference from READ_ONCE will not be free'd until after grace period. This ensures the object returned by READ_ONCE() is valid through the RCU criticl section and safe to use as long as we "know" it may be free'd shortly. Daniel spotted a case in the sock update API where instead of using the READ_ONCE() program reference we used the pointer from the original map, stab->bpf_{verdict|parse}. The problem with this is the logic checks the object returned from the READ_ONCE() is not NULL and then tries to reference the object again but using the above map pointer, which may have already been NULL'd by a parallel detach operation. If this happened bpf_porg_inc_not_zero could dereference a NULL pointer. Fix this by using variable returned by READ_ONCE() that is checked for NULL. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel BorkmannSigned-off-by: John Fastabend --- kernel/bpf/sockmap.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index f03aaa8..583c1eb 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1703,11 +1703,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, * we increment the refcnt. If this is the case abort with an * error. */ - verdict = bpf_prog_inc_not_zero(stab->bpf_verdict); + verdict = bpf_prog_inc_not_zero(verdict); if (IS_ERR(verdict)) return PTR_ERR(verdict); - parse = bpf_prog_inc_not_zero(stab->bpf_parse); + parse = bpf_prog_inc_not_zero(parse); if (IS_ERR(parse)) { bpf_prog_put(verdict); return PTR_ERR(parse);
Re: [PATCH bpf-next v6 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap
On 05/15/2018 12:19 PM, Daniel Borkmann wrote: > On 05/14/2018 07:00 PM, John Fastabend wrote: > [...] [...] > > As you say in the comment above the function wrt locking notes that the > __sock_map_ctx_update_elem() can be called concurrently. > > All operations operate on sock_map using cmpxchg and xchg operations to > ensure we > do not get stale references. Any reads into the map must be done with > READ_ONCE() > because of this. > > You initially use the READ_ONCE() on the verdict/parse/tx_msg, but later on > when > grabbing the reference you use again progs->bpf_verdict/bpf_parse/bpf_tx_msg > which > would potentially refetch it, but if updates would happen concurrently e.g. > to the > three progs, they could be NULL in the mean-time, no? bpf_prog_inc_not_zero() > would > then crash. Why are not the ones used that you fetched previously via > READ_ONCE() > for taking the ref? Nice catch. We should use the reference fetched by READ_ONCE in all cases. > > The second question I had is that verdict/parse/tx_msg are updated > independently > from each other and each could be NULL or non-NULL. What if, say, parse is > NULL > and verdict as well as tx_msg is non-NULL and the bpf_prog_inc_not_zero() on > the > tx_msg prog fails. Doesn't this cause a use-after-free since a ref on verdict > wasn't > taken earlier but the bpf_prog_put() will cause accidental misbalance/free of > the > progs? Also good catch. I'll send patches for both now. Thanks. > > It would probably help to clarify the locking comment a bit more if indeed the > above should be okay as is. > > Thanks, > Daniel >
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On Wed 16 May 2018 at 18:10, Davide Carattiwrote: > On Wed, 2018-05-16 at 13:36 -0400, Roman Mashak wrote: >> Vlad Buslov writes: >> >> > On Wed 16 May 2018 at 14:38, Roman Mashak wrote: >> > > On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov wrote: >> > > > > > > > I'm trying to run tdc, but keep getting following error even >> > > > > > > > on clean >> > > > > > > > branch without my patches: >> > > > > > > >> > > > > > > Vlad, not sure if you saw my email: >> > > > > > > Apply Roman's patch and try again >> > > > > > > >> > > > > > > https://marc.info/?l=linux-netdev=152639369112020=2 >> > > > > > > >> > > > > > > cheers, >> > > > > > > jamal >> > > > > > >> > > > > > With patch applied I get following error: >> > > > > > >> > > > > > Test 7d50: Add skbmod action to set destination mac >> > > > > > exit: 255 0 >> > > > > > dst MAC address <11:22:33:44:55:66> >> > > > > > RTNETLINK answers: No such file or directory >> > > > > > We have an error talking to the kernel >> > > > > > >> > > > > >> > > > > You may actually have broken something with your patches in this >> > > > > case. >> > > > >> > > > Results is for net-next without my patches. >> > > >> > > Do you have skbmod compiled in kernel or as a module? >> > >> > Thanks, already figured out that default config has some actions >> > disabled. >> > Have more errors now. Everything related to ife: >> > >> > Test 7682: Create valid ife encode action with mark and pass control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No such file or directory >> > We have an error talking to the kernel >> > >> > Test ef47: Create valid ife encode action with mark and pipe control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No space left on device >> > We have an error talking to the kernel >> > >> > Test df43: Create valid ife encode action with mark and continue control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No space left on device >> > We have an error talking to the kernel >> > >> > Test e4cf: Create valid ife encode action with mark and drop control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No space left on device >> > We have an error talking to the kernel >> > >> > Test ccba: Create valid ife encode action with mark and reclassify control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No space left on device >> > We have an error talking to the kernel >> > >> > Test a1cf: Create valid ife encode action with mark and jump control >> > exit: 255 0 >> > IFE type 0xED3E >> > RTNETLINK answers: No space left on device >> > We have an error talking to the kernel >> > >> > ... >> > >> > >> >> Please make sure you have these in your kernel config: >> >> CONFIG_NET_ACT_IFE=y >> CONFIG_NET_IFE_SKBMARK=m >> CONFIG_NET_IFE_SKBPRIO=m >> CONFIG_NET_IFE_SKBTCINDEX=m >> >> For tdc to run all the tests, it is assumed that all the supported tc >> actions/filters are enabled and compiled. > hello, > > looking at ife.json, it seems that we have at least 4 typos in > 'teardown'. > > It does > > $TC actions flush action skbedit > > in place of > > $TC actions flush action ife > > On my fedora28 (with fedora28 kernel), fixing them made test 7682 return > 'ok' (and all others in ife category, except ee94, 7ee0 and 0a7d). > > regards, I can confirm that on net-next kernel version that I use, there are also multiple teardowns of actions type skbedit after actually creating ife action in file ife.json. However, tests pass when I enabled config options that Roman suggested: ok 119 - 7682 # Create valid ife encode action with mark and pass control
Re: [PATCH iproute2-next] tc-netem: fix limit description in man page
On Wed, 16 May 2018 15:17:50 -0600 David Ahernwrote: > On 5/15/18 6:49 PM, Marcelo Ricardo Leitner wrote: > > As the kernel code says, limit is actually the amount of packets it can > > hold queued at a time, as per: > > > > static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, > > struct sk_buff **to_free) > > { > > ... > > if (unlikely(sch->q.qlen >= sch->limit)) > > return qdisc_drop_all(skb, sch, to_free); > > > > So lets fix the description of the field in the man page. > > > > Signed-off-by: Marcelo Ricardo Leitner > > --- > > man/man8/tc-netem.8 | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > applied to iproute2-next. Thanks, > Since it is an error, I will put it in master.
Re: [PATCH 00/14] Modify action API for implementing lockless actions
On Wed 16 May 2018 at 17:36, Roman Mashakwrote: > Vlad Buslov writes: > >> On Wed 16 May 2018 at 14:38, Roman Mashak wrote: >>> On Wed, May 16, 2018 at 2:43 AM, Vlad Buslov wrote: I'm trying to run tdc, but keep getting following error even on clean branch without my patches: >>> >>> Vlad, not sure if you saw my email: >>> Apply Roman's patch and try again >>> >>> https://marc.info/?l=linux-netdev=152639369112020=2 >>> >>> cheers, >>> jamal >> >> With patch applied I get following error: >> >> Test 7d50: Add skbmod action to set destination mac >> exit: 255 0 >> dst MAC address <11:22:33:44:55:66> >> RTNETLINK answers: No such file or directory >> We have an error talking to the kernel >> > > You may actually have broken something with your patches in this case. Results is for net-next without my patches. >>> >>> Do you have skbmod compiled in kernel or as a module? >> >> Thanks, already figured out that default config has some actions >> disabled. >> Have more errors now. Everything related to ife: >> >> Test 7682: Create valid ife encode action with mark and pass control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No such file or directory >> We have an error talking to the kernel >> >> Test ef47: Create valid ife encode action with mark and pipe control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No space left on device >> We have an error talking to the kernel >> >> Test df43: Create valid ife encode action with mark and continue control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No space left on device >> We have an error talking to the kernel >> >> Test e4cf: Create valid ife encode action with mark and drop control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No space left on device >> We have an error talking to the kernel >> >> Test ccba: Create valid ife encode action with mark and reclassify control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No space left on device >> We have an error talking to the kernel >> >> Test a1cf: Create valid ife encode action with mark and jump control >> exit: 255 0 >> IFE type 0xED3E >> RTNETLINK answers: No space left on device >> We have an error talking to the kernel >> >> ... >> >> > > Please make sure you have these in your kernel config: > > CONFIG_NET_ACT_IFE=y > CONFIG_NET_IFE_SKBMARK=m > CONFIG_NET_IFE_SKBPRIO=m > CONFIG_NET_IFE_SKBTCINDEX=m > > For tdc to run all the tests, it is assumed that all the supported tc > actions/filters are enabled and compiled. Enabling these options allowed all ife tests to pass. Thanks! Error in u32 test still appears however: Test e9a3: Add u32 with source match -> prepare stage *** Could not execute: "$TC qdisc add dev $DEV1 ingress" -> prepare stage *** Error message: "Cannot find device "v0p1"
Re: [PATCH iproute2-next] tc-netem: fix limit description in man page
On 5/15/18 6:49 PM, Marcelo Ricardo Leitner wrote: > As the kernel code says, limit is actually the amount of packets it can > hold queued at a time, as per: > > static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, > struct sk_buff **to_free) > { > ... > if (unlikely(sch->q.qlen >= sch->limit)) > return qdisc_drop_all(skb, sch, to_free); > > So lets fix the description of the field in the man page. > > Signed-off-by: Marcelo Ricardo Leitner> --- > man/man8/tc-netem.8 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > applied to iproute2-next. Thanks,
Re: [PATCH net-next v12 2/7] sch_cake: Add ingress mode
Cong Wangwrites: > On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen wrote: >> + if (tb[TCA_CAKE_AUTORATE]) { >> + if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE])) >> + q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS; >> + else >> + q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS; >> + } >> + >> + if (tb[TCA_CAKE_INGRESS]) { >> + if (!!nla_get_u32(tb[TCA_CAKE_INGRESS])) >> + q->rate_flags |= CAKE_FLAG_INGRESS; >> + else >> + q->rate_flags &= ~CAKE_FLAG_INGRESS; >> + } >> + >> if (tb[TCA_CAKE_MEMORY]) >> q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]); >> >> @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct >> sk_buff *skb) >> if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit)) >> goto nla_put_failure; >> >> + if (nla_put_u32(skb, TCA_CAKE_AUTORATE, >> + !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS))) >> + goto nla_put_failure; >> + >> + if (nla_put_u32(skb, TCA_CAKE_INGRESS, >> + !!(q->rate_flags & CAKE_FLAG_INGRESS))) >> + goto nla_put_failure; >> + > > Why do you want to dump each bit of the rate_flags separately rather than > dumping the whole rate_flags as an integer? Well, these were added one at a time, each as a new option. Isn't that more or less congruent with how netlink attributes are supposed to be used? -Toke
Re: [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier
Cong Wangwrites: > On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen wrote: >> When CAKE is deployed on a gateway that also performs NAT (which is a >> common deployment mode), the host fairness mechanism cannot distinguish >> internal hosts from each other, and so fails to work correctly. >> >> To fix this, we add an optional NAT awareness mode, which will query the >> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet >> and use that in the flow and host hashing. >> >> When the shaper is enabled and the host is already performing NAT, the cost >> of this lookup is negligible. However, in unlimited mode with no NAT being >> performed, there is a significant CPU cost at higher bandwidths. For this >> reason, the feature is turned off by default. >> >> Signed-off-by: Toke Høiland-Jørgensen >> --- >> net/sched/sch_cake.c | 73 >> ++ >> 1 file changed, 73 insertions(+) >> >> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c >> index 65439b643c92..e1038a7b6686 100644 >> --- a/net/sched/sch_cake.c >> +++ b/net/sched/sch_cake.c >> @@ -71,6 +71,12 @@ >> #include >> #include >> >> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) >> +#include >> +#include >> +#include >> +#endif >> + >> #define CAKE_SET_WAYS (8) >> #define CAKE_MAX_TINS (8) >> #define CAKE_QUEUES (1024) >> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars, >> return drop; >> } >> >> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) >> + >> +static void cake_update_flowkeys(struct flow_keys *keys, >> +const struct sk_buff *skb) >> +{ >> + const struct nf_conntrack_tuple *tuple; >> + enum ip_conntrack_info ctinfo; >> + struct nf_conn *ct; >> + bool rev = false; >> + >> + if (tc_skb_protocol(skb) != htons(ETH_P_IP)) >> + return; >> + >> + ct = nf_ct_get(skb, ); >> + if (ct) { >> + tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo)); >> + } else { >> + const struct nf_conntrack_tuple_hash *hash; >> + struct nf_conntrack_tuple srctuple; >> + >> + if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), >> + NFPROTO_IPV4, dev_net(skb->dev), >> + )) >> + return; >> + >> + hash = nf_conntrack_find_get(dev_net(skb->dev), >> +_ct_zone_dflt, >> +); >> + if (!hash) >> + return; >> + >> + rev = true; >> + ct = nf_ct_tuplehash_to_ctrack(hash); >> + tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir); >> + } >> + >> + keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip; >> + keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip; >> + >> + if (keys->ports.ports) { >> + keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all; >> + keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all; >> + } >> + if (rev) >> + nf_ct_put(ct); >> +} >> +#else >> +static void cake_update_flowkeys(struct flow_keys *keys, >> +const struct sk_buff *skb) >> +{ >> + /* There is nothing we can do here without CONNTRACK */ >> +} >> +#endif >> + >> /* Cake has several subtle multiple bit settings. In these cases you >> * would be matching triple isolate mode as well. >> */ >> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const >> struct sk_buff *skb, >> skb_flow_dissect_flow_keys(skb, , >>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); >> >> + if (flow_mode & CAKE_FLOW_NAT_FLAG) >> + cake_update_flowkeys(, skb); >> + >> /* flow_hash_from_keys() sorts the addresses by value, so we have >> * to preserve their order in a separate data structure to treat >> * src and dst host addresses as independently selectable. >> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct >> nlattr *opt, >> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) & >> CAKE_FLOW_MASK); >> >> + if (tb[TCA_CAKE_NAT]) { >> + q->flow_mode &= ~CAKE_FLOW_NAT_FLAG; >> + q->flow_mode |= CAKE_FLOW_NAT_FLAG * >> + !!nla_get_u32(tb[TCA_CAKE_NAT]); >> + } > > > I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK > is not enabled. Good point, will fix :) -Toke
Re: [PATCH bpf-next] bpf: fix sock hashmap kmalloc warning
On 05/16/2018 02:06 PM, Yonghong Song wrote: > syzbot reported a kernel warning below: > WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 > mm/slab_common.c:996 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: >__dump_stack lib/dump_stack.c:77 [inline] >dump_stack+0x1b9/0x294 lib/dump_stack.c:113 >panic+0x22f/0x4de kernel/panic.c:184 >__warn.cold.8+0x163/0x1b3 kernel/panic.c:536 >report_bug+0x252/0x2d0 lib/bug.c:186 >fixup_bug arch/x86/kernel/traps.c:178 [inline] >do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 >do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 >invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 > RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996 > RSP: 0018:8801d907fc58 EFLAGS: 00010246 > RAX: RBX: 8801aeecb280 RCX: 8185ebd7 > RDX: RSI: RDI: ffe1 > RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700 > R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280 > R13: fff4 R14: 8801ad891a00 R15: 014200c0 >__do_kmalloc mm/slab.c:3713 [inline] >__kmalloc+0x25/0x760 mm/slab.c:3727 >kmalloc include/linux/slab.h:517 [inline] >map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858 >__do_sys_bpf kernel/bpf/syscall.c:2131 [inline] >__se_sys_bpf kernel/bpf/syscall.c:2096 [inline] >__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096 >do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > The test case is against sock hashmap with a key size 0xffe1. > Such a large key size will cause the below code in function > sock_hash_alloc() overflowing and produces a smaller elem_size, > hence map creation will be successful. > htab->elem_size = sizeof(struct htab_elem) + > round_up(htab->map.key_size, 8); > > Later, when map_get_next_key is called and kernel tries > to allocate the key unsuccessfully, it will issue > the above warning. > > Similar to hashtab, ensure the key size is at most > MAX_BPF_STACK for a successful map creation. > > Fixes: 81110384441a ("bpf: sockmap, add hash map support") > Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com > Signed-off-by: Yonghong Song> --- > kernel/bpf/sockmap.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c > index 56879c9fd3a4..79f5e899 100644 > --- a/kernel/bpf/sockmap.c > +++ b/kernel/bpf/sockmap.c > @@ -1990,6 +1990,12 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr > *attr) > attr->map_flags & ~SOCK_CREATE_FLAG_MASK) > return ERR_PTR(-EINVAL); > > + if (attr->key_size > MAX_BPF_STACK) > + /* eBPF programs initialize keys on stack, so they cannot be > + * larger than max stack size > + */ > + return ERR_PTR(-E2BIG); > + > err = bpf_tcp_ulp_register(); > if (err && err != -EEXIST) > return ERR_PTR(err); > Thanks! Acked-by: John Fastabend
Re: [iproute2-next v2 1/1] tipc: fixed node and name table listings
On 5/15/18 7:54 AM, Jon Maloy wrote: > We make it easier for users to correlate between 128-bit node > identities and 32-bit node hash number by extending the 'node list' > command to also show the hash number. > > We also improve the 'nametable show' command to show the node identity > instead of the node hash number. Since the former potentially is much > longer than the latter, we make room for it by eliminating the (to the > user) irrelevant publication key. We also reorder some of the columns so > that the node id comes last, since this looks nicer and is more logical. > > --- > v2: Fixed compiler warning as per comment from David Ahern > > Signed-off-by: Jon Maloy> --- > tipc/misc.c | 18 ++ > tipc/misc.h | 1 + > tipc/nametable.c | 18 ++ > tipc/node.c | 19 --- > tipc/peer.c | 4 > 5 files changed, 41 insertions(+), 19 deletions(-) > > diff --git a/tipc/misc.c b/tipc/misc.c > index 16849f1..e8b726f 100644 > --- a/tipc/misc.c > +++ b/tipc/misc.c > @@ -13,6 +13,9 @@ > #include > #include > #include > +#include > +#include > +#include > #include "misc.h" > > #define IN_RANGE(val, low, high) ((val) <= (high) && (val) >= (low)) > @@ -109,3 +112,18 @@ void nodeid2str(uint8_t *id, char *str) > for (i = 31; str[i] == '0'; i--) > str[i] = 0; > } > + > +void hash2nodestr(uint32_t hash, char *str) > +{ > + struct tipc_sioc_nodeid_req nr = {}; > + int sd; > + > + sd = socket(AF_TIPC, SOCK_RDM, 0); > + if (sd < 0) { > + fprintf(stderr, "opening TIPC socket: %s\n", strerror(errno)); > + return; > + } > + nr.peer = hash; > + if (!ioctl(sd, SIOCGETNODEID, )) > + nodeid2str((uint8_t *)nr.node_id, str); > +} you are leaking sd
Re: [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
Cong Wangwrites: > On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen wrote: >> + >> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg) >> +{ >> + return NULL; >> +} >> + >> +static unsigned long cake_find(struct Qdisc *sch, u32 classid) >> +{ >> + return 0; >> +} >> + >> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg) >> +{ >> +} > > > Thanks for adding the support to other TC filters, it is much better > now! You're welcome. Turned out not to be that hard :) > A quick question: why class_ops->dump_stats is still NULL? > > It is supposed to dump the stats of each flow. Is there still any > difficulty to map it to tc class? I thought you figured it out when > you added the tcf_classify(). On the classify side, I solved the "multiple sets of queues" problem by using skb->priority to select the tin (diffserv tier) and the classifier output to select the queue within that tin. This would not work for dumping stats; some other way of mapping queues to the linear class space would be needed. And since we are not actually collecting any per-flow stats that I could print, I thought it wasn't worth coming up with a half-baked proposal for this just to add an API hook that no one in the existing CAKE user base has ever asked for... -Toke
Re: [PATCH net-next v12 2/7] sch_cake: Add ingress mode
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensenwrote: > + if (tb[TCA_CAKE_AUTORATE]) { > + if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE])) > + q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS; > + else > + q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS; > + } > + > + if (tb[TCA_CAKE_INGRESS]) { > + if (!!nla_get_u32(tb[TCA_CAKE_INGRESS])) > + q->rate_flags |= CAKE_FLAG_INGRESS; > + else > + q->rate_flags &= ~CAKE_FLAG_INGRESS; > + } > + > if (tb[TCA_CAKE_MEMORY]) > q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]); > > @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff > *skb) > if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit)) > goto nla_put_failure; > > + if (nla_put_u32(skb, TCA_CAKE_AUTORATE, > + !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS))) > + goto nla_put_failure; > + > + if (nla_put_u32(skb, TCA_CAKE_INGRESS, > + !!(q->rate_flags & CAKE_FLAG_INGRESS))) > + goto nla_put_failure; > + Why do you want to dump each bit of the rate_flags separately rather than dumping the whole rate_flags as an integer?
[PATCH bpf-next] bpf: fix sock hashmap kmalloc warning
syzbot reported a kernel warning below: WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996 RSP: 0018:8801d907fc58 EFLAGS: 00010246 RAX: RBX: 8801aeecb280 RCX: 8185ebd7 RDX: RSI: RDI: ffe1 RBP: 8801d907fc58 R08: 8801adb5e1c0 R09: ed0035a84700 R10: ed0035a84700 R11: 8801ad423803 R12: 8801aeecb280 R13: fff4 R14: 8801ad891a00 R15: 014200c0 __do_kmalloc mm/slab.c:3713 [inline] __kmalloc+0x25/0x760 mm/slab.c:3727 kmalloc include/linux/slab.h:517 [inline] map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858 __do_sys_bpf kernel/bpf/syscall.c:2131 [inline] __se_sys_bpf kernel/bpf/syscall.c:2096 [inline] __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe The test case is against sock hashmap with a key size 0xffe1. Such a large key size will cause the below code in function sock_hash_alloc() overflowing and produces a smaller elem_size, hence map creation will be successful. htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); Later, when map_get_next_key is called and kernel tries to allocate the key unsuccessfully, it will issue the above warning. Similar to hashtab, ensure the key size is at most MAX_BPF_STACK for a successful map creation. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Reported-by: syzbot+e4566d29080e7f346...@syzkaller.appspotmail.com Signed-off-by: Yonghong Song--- kernel/bpf/sockmap.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 56879c9fd3a4..79f5e899 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1990,6 +1990,12 @@ static struct bpf_map *sock_hash_alloc(union bpf_attr *attr) attr->map_flags & ~SOCK_CREATE_FLAG_MASK) return ERR_PTR(-EINVAL); + if (attr->key_size > MAX_BPF_STACK) + /* eBPF programs initialize keys on stack, so they cannot be +* larger than max stack size +*/ + return ERR_PTR(-E2BIG); + err = bpf_tcp_ulp_register(); if (err && err != -EEXIST) return ERR_PTR(err); -- 2.14.3
[PATCH v3 1/2] media: rc: introduce BPF_PROG_RAWIR_EVENT
Add support for BPF_PROG_RAWIR_EVENT. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Signed-off-by: Sean Young--- drivers/media/rc/Kconfig | 13 ++ drivers/media/rc/Makefile | 1 + drivers/media/rc/bpf-rawir-event.c | 363 + drivers/media/rc/lirc_dev.c| 24 ++ drivers/media/rc/rc-core-priv.h| 24 ++ drivers/media/rc/rc-ir-raw.c | 14 +- include/linux/bpf_rcdev.h | 30 +++ include/linux/bpf_types.h | 3 + include/uapi/linux/bpf.h | 55 - kernel/bpf/syscall.c | 7 + 10 files changed, 531 insertions(+), 3 deletions(-) create mode 100644 drivers/media/rc/bpf-rawir-event.c create mode 100644 include/linux/bpf_rcdev.h diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig index eb2c3b6eca7f..2172d65b0213 100644 --- a/drivers/media/rc/Kconfig +++ b/drivers/media/rc/Kconfig @@ -25,6 +25,19 @@ config LIRC passes raw IR to and from userspace, which is needed for IR transmitting (aka "blasting") and for the lirc daemon. +config BPF_RAWIR_EVENT + bool "Support for eBPF programs attached to lirc devices" + depends on BPF_SYSCALL + depends on RC_CORE=y + depends on LIRC + help + Allow attaching eBPF programs to a lirc device using the bpf(2) + syscall command BPF_PROG_ATTACH. This is supported for raw IR + receivers. + + These eBPF programs can be used to decode IR into scancodes, for + IR protocols not supported by the kernel decoders. + menuconfig RC_DECODERS bool "Remote controller decoders" depends on RC_CORE diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile index 2e1c87066f6c..74907823bef8 100644 --- a/drivers/media/rc/Makefile +++ b/drivers/media/rc/Makefile @@ -5,6 +5,7 @@ obj-y += keymaps/ obj-$(CONFIG_RC_CORE) += rc-core.o rc-core-y := rc-main.o rc-ir-raw.o rc-core-$(CONFIG_LIRC) += lirc_dev.o +rc-core-$(CONFIG_BPF_RAWIR_EVENT) += bpf-rawir-event.o obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o diff --git a/drivers/media/rc/bpf-rawir-event.c b/drivers/media/rc/bpf-rawir-event.c new file mode 100644 index ..7cb48b8d87b5 --- /dev/null +++ b/drivers/media/rc/bpf-rawir-event.c @@ -0,0 +1,363 @@ +// SPDX-License-Identifier: GPL-2.0 +// bpf-rawir-event.c - handles bpf +// +// Copyright (C) 2018 Sean Young + +#include +#include +#include +#include "rc-core-priv.h" + +/* + * BPF interface for raw IR + */ +const struct bpf_prog_ops rawir_event_prog_ops = { +}; + +BPF_CALL_1(bpf_rc_repeat, struct bpf_rawir_event*, event) +{ + struct ir_raw_event_ctrl *ctrl; + + ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event); + + rc_repeat(ctrl->dev); + + return 0; +} + +static const struct bpf_func_proto rc_repeat_proto = { + .func = bpf_rc_repeat, + .gpl_only = true, /* rc_repeat is EXPORT_SYMBOL_GPL */ + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +BPF_CALL_4(bpf_rc_keydown, struct bpf_rawir_event*, event, u32, protocol, + u32, scancode, u32, toggle) +{ + struct ir_raw_event_ctrl *ctrl; + + ctrl = container_of(event, struct ir_raw_event_ctrl, bpf_rawir_event); + + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0); + + return 0; +} + +static const struct bpf_func_proto rc_keydown_proto = { + .func = bpf_rc_keydown, + .gpl_only = true, /* rc_keydown is EXPORT_SYMBOL_GPL */ + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + +static const struct bpf_func_proto * +rawir_event_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_rc_repeat: + return _repeat_proto; + case BPF_FUNC_rc_keydown: + return _keydown_proto; + case BPF_FUNC_map_lookup_elem: + return _map_lookup_elem_proto; + case BPF_FUNC_map_update_elem: + return _map_update_elem_proto; + case BPF_FUNC_map_delete_elem: + return _map_delete_elem_proto; + case BPF_FUNC_ktime_get_ns: + return _ktime_get_ns_proto; + case BPF_FUNC_tail_call: + return _tail_call_proto; + case BPF_FUNC_get_prandom_u32: + return _get_prandom_u32_proto; + case BPF_FUNC_trace_printk: + if (capable(CAP_SYS_ADMIN)) + return
[PATCH v3 2/2] bpf: add selftest for rawir_event type program
This is simple test over rc-loopback. Signed-off-by: Sean Young--- tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h| 57 +++- tools/lib/bpf/libbpf.c| 1 + tools/testing/selftests/bpf/Makefile | 8 +- tools/testing/selftests/bpf/bpf_helpers.h | 6 + tools/testing/selftests/bpf/test_rawir.sh | 37 + .../selftests/bpf/test_rawir_event_kern.c | 26 .../selftests/bpf/test_rawir_event_user.c | 130 ++ 8 files changed, 261 insertions(+), 5 deletions(-) create mode 100755 tools/testing/selftests/bpf/test_rawir.sh create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 9bdfdf2d3fbe..8889a4ee8577 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -71,6 +71,7 @@ static const char * const prog_type_name[] = { [BPF_PROG_TYPE_SK_MSG] = "sk_msg", [BPF_PROG_TYPE_RAW_TRACEPOINT] = "raw_tracepoint", [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr", + [BPF_PROG_TYPE_RAWIR_EVENT] = "rawir_event", }; static void print_boot_time(__u64 nsecs, char *buf, unsigned int size) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1205d86a7a29..243e141e8a5b 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -141,6 +141,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_MSG, BPF_PROG_TYPE_RAW_TRACEPOINT, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, + BPF_PROG_TYPE_RAWIR_EVENT, }; enum bpf_attach_type { @@ -158,6 +159,7 @@ enum bpf_attach_type { BPF_CGROUP_INET6_CONNECT, BPF_CGROUP_INET4_POST_BIND, BPF_CGROUP_INET6_POST_BIND, + BPF_RAWIR_EVENT, __MAX_BPF_ATTACH_TYPE }; @@ -1829,7 +1831,6 @@ union bpf_attr { * Return * 0 on success, or a negative error in case of failure. * - * * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags) * Description * Do FIB lookup in kernel tables using parameters in *params*. @@ -1856,6 +1857,7 @@ union bpf_attr { * Egress device index on success, 0 if packet needs to continue * up the stack for further processing or a negative error in case * of failure. + * * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags) * Description * Add an entry to, or update a sockhash *map* referencing sockets. @@ -1902,6 +1904,35 @@ union bpf_attr { * egress otherwise). This is the only flag supported for now. * Return * **SK_PASS** on success, or **SK_DROP** on error. + * + * int bpf_rc_keydown(void *ctx, u32 protocol, u32 scancode, u32 toggle) + * Description + * Report decoded scancode with toggle value. For use in + * BPF_PROG_TYPE_RAWIR_EVENT, to report a successfully + * decoded scancode. This is will generate a keydown event, + * and a keyup event once the scancode is no longer repeated. + * + * *ctx* pointer to bpf_rawir_event, *protocol* is decoded + * protocol (see RC_PROTO_* enum). + * + * Some protocols include a toggle bit, in case the button + * was released and pressed again between consecutive scancodes, + * copy this bit into *toggle* if it exists, else set to 0. + * + * Return + * Always return 0 (for now) + * + * int bpf_rc_repeat(void *ctx) + * Description + * Repeat the last decoded scancode; some IR protocols like + * NEC have a special IR message for repeat last button, + * in case user is holding a button down; the scancode is + * not repeated. + * + * *ctx* pointer to bpf_rawir_event. + * + * Return + * Always return 0 (for now) */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -1976,7 +2007,9 @@ union bpf_attr { FN(fib_lookup), \ FN(sock_hash_update), \ FN(msg_redirect_hash), \ - FN(sk_redirect_hash), + FN(sk_redirect_hash), \ + FN(rc_repeat), \ + FN(rc_keydown), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2043,6 +2076,26 @@ enum bpf_hdr_start_off { BPF_HDR_START_NET, }; +/* + * user accessible mirror of in-kernel ir_raw_event + */ +#define BPF_RAWIR_EVENT_SPACE 0 +#define BPF_RAWIR_EVENT_PULSE 1 +#define BPF_RAWIR_EVENT_TIMEOUT2 +#define BPF_RAWIR_EVENT_RESET 3 +#define BPF_RAWIR_EVENT_CARRIER
[PATCH v3 0/2] IR decoding using BPF
The kernel IR decoders (drivers/media/rc/ir-*-decoder.c) support the most widely used IR protocols, but there are many protocols which are not supported[1]. For example, the lirc-remotes[2] repo has over 2700 remotes, many of which are not supported by rc-core. There is a "long tail" of unsupported IR protocols, for which lircd is need to decode the IR . IR encoding is done in such a way that some simple circuit can decode it; therefore, bpf is ideal. In order to support all these protocols, here we have bpf based IR decoding. The idea is that user-space can define a decoder in bpf, attach it to the rc device through the lirc chardev. Separate work is underway to extend ir-keytable to have an extensive library of bpf-based decoders, and a much expanded library of rc keymaps. Another future application would be to compile IRP[3] to a IR BPF program, and so support virtually every remote without having to write a decoder for each. It might also be possible to support non-button devices such as analog directional pads or air conditioning remote controls and decode the target temperature in bpf, and pass that to an input device. Thanks, Sean Young [1] http://www.hifi-remote.com/wiki/index.php?title=DecodeIR [2] https://sourceforge.net/p/lirc-remotes/code/ci/master/tree/remotes/ [3] http://www.hifi-remote.com/wiki/index.php?title=IRP_Notation Changes since v2: - Fixed locking issues - Improved self-test to cover more cases - Rebased on bpf-next again Changes since v1: - Code review comments from Y Songand Randy Dunlap - Re-wrote sample bpf to be selftest - Renamed RAWIR_DECODER -> RAWIR_EVENT (Kconfig, context, bpf prog type) - Rebase on bpf-next - Introduced bpf_rawir_event context structure with simpler access checking Sean Young (2): media: rc: introduce BPF_PROG_RAWIR_EVENT bpf: add selftest for rawir_event type program drivers/media/rc/Kconfig | 13 + drivers/media/rc/Makefile | 1 + drivers/media/rc/bpf-rawir-event.c| 363 ++ drivers/media/rc/lirc_dev.c | 24 ++ drivers/media/rc/rc-core-priv.h | 24 ++ drivers/media/rc/rc-ir-raw.c | 14 +- include/linux/bpf_rcdev.h | 30 ++ include/linux/bpf_types.h | 3 + include/uapi/linux/bpf.h | 55 ++- kernel/bpf/syscall.c | 7 + tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h| 57 ++- tools/lib/bpf/libbpf.c| 1 + tools/testing/selftests/bpf/Makefile | 8 +- tools/testing/selftests/bpf/bpf_helpers.h | 6 + tools/testing/selftests/bpf/test_rawir.sh | 37 ++ .../selftests/bpf/test_rawir_event_kern.c | 26 ++ .../selftests/bpf/test_rawir_event_user.c | 130 +++ 18 files changed, 792 insertions(+), 8 deletions(-) create mode 100644 drivers/media/rc/bpf-rawir-event.c create mode 100644 include/linux/bpf_rcdev.h create mode 100755 tools/testing/selftests/bpf/test_rawir.sh create mode 100644 tools/testing/selftests/bpf/test_rawir_event_kern.c create mode 100644 tools/testing/selftests/bpf/test_rawir_event_user.c -- 2.17.0
[PATCH bpf-next] libbpf: add ifindex to enable offload support
From: David BeckettBPF programs currently can only be offloaded using iproute2. This patch will allow programs to be offloaded using libbpf calls. Signed-off-by: David Beckett Reviewed-by: Jakub Kicinski --- tools/lib/bpf/bpf.c| 2 ++ tools/lib/bpf/bpf.h| 2 ++ tools/lib/bpf/libbpf.c | 18 +++--- tools/lib/bpf/libbpf.h | 1 + 4 files changed, 20 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index a3a8fb2ac697..6a8a00097fd8 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -91,6 +91,7 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr) attr.btf_fd = create_attr->btf_fd; attr.btf_key_id = create_attr->btf_key_id; attr.btf_value_id = create_attr->btf_value_id; + attr.map_ifindex = create_attr->map_ifindex; return sys_bpf(BPF_MAP_CREATE, , sizeof(attr)); } @@ -201,6 +202,7 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, attr.log_size = 0; attr.log_level = 0; attr.kern_version = load_attr->kern_version; + attr.prog_ifindex = load_attr->prog_ifindex; memcpy(attr.prog_name, load_attr->name, min(name_len, BPF_OBJ_NAME_LEN - 1)); diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index fb3a146d92ff..15bff7728cf1 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -38,6 +38,7 @@ struct bpf_create_map_attr { __u32 btf_fd; __u32 btf_key_id; __u32 btf_value_id; + __u32 map_ifindex; }; int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr); @@ -64,6 +65,7 @@ struct bpf_load_program_attr { size_t insns_cnt; const char *license; __u32 kern_version; + __u32 prog_ifindex; }; /* Recommend log buffer size */ diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index df54c4c9e48a..3dbe217bf23e 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -178,6 +178,7 @@ struct bpf_program { /* Index in elf obj file, for relocation use. */ int idx; char *name; + int prog_ifindex; char *section_name; struct bpf_insn *insns; size_t insns_cnt, main_prog_cnt; @@ -213,6 +214,7 @@ struct bpf_map { int fd; char *name; size_t offset; + int map_ifindex; struct bpf_map_def def; uint32_t btf_key_id; uint32_t btf_value_id; @@ -1091,6 +1093,7 @@ bpf_object__create_maps(struct bpf_object *obj) int *pfd = >fd; create_attr.name = map->name; + create_attr.map_ifindex = map->map_ifindex; create_attr.map_type = def->type; create_attr.map_flags = def->map_flags; create_attr.key_size = def->key_size; @@ -1273,7 +1276,7 @@ static int bpf_object__collect_reloc(struct bpf_object *obj) static int load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type, const char *name, struct bpf_insn *insns, int insns_cnt, -char *license, u32 kern_version, int *pfd) +char *license, u32 kern_version, int *pfd, int prog_ifindex) { struct bpf_load_program_attr load_attr; char *log_buf; @@ -1287,6 +1290,7 @@ load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type, load_attr.insns_cnt = insns_cnt; load_attr.license = license; load_attr.kern_version = kern_version; + load_attr.prog_ifindex = prog_ifindex; if (!load_attr.insns || !load_attr.insns_cnt) return -EINVAL; @@ -1368,7 +1372,8 @@ bpf_program__load(struct bpf_program *prog, } err = load_program(prog->type, prog->expected_attach_type, prog->name, prog->insns, prog->insns_cnt, - license, kern_version, ); + license, kern_version, , + prog->prog_ifindex); if (!err) prog->instances.fds[0] = fd; goto out; @@ -1399,7 +1404,8 @@ bpf_program__load(struct bpf_program *prog, err = load_program(prog->type, prog->expected_attach_type, prog->name, result.new_insn_ptr, result.new_insn_cnt, - license, kern_version, ); + license, kern_version, , + prog->prog_ifindex); if (err) { pr_warning("Loading the %dth instance of program '%s' failed\n", @@ -2188,6 +2194,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, enum bpf_attach_type expected_attach_type; enum bpf_prog_type
Re: [PATCH 0/3] ibmvnic: Fix bugs and memory leaks
On 05/16/2018 03:49 PM, Thomas Falcon wrote: > This is a small patch series fixing up some bugs and memory leaks > in the ibmvnic driver. The first fix frees up previously allocated > memory that should be freed in case of an error. The second fixes > a reset case that was failing due to TX/RX queue IRQ's being > erroneously disabled without being enabled again. The final patch > fixes incorrect reallocated of statistics buffers during a device > reset, resulting in loss of statistics information and a memory leak. > > Thomas Falcon (3): > ibmvnic: Free coherent DMA memory if FW map failed > ibmvnic: Fix non-fatal firmware error reset > ibmvnic: Fix statistics buffers memory leak Sorry, these are meant for the 'net' tree. Tom > > drivers/net/ethernet/ibm/ibmvnic.c | 28 +--- > 1 file changed, 17 insertions(+), 11 deletions(-) >
Re: [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensenwrote: > When CAKE is deployed on a gateway that also performs NAT (which is a > common deployment mode), the host fairness mechanism cannot distinguish > internal hosts from each other, and so fails to work correctly. > > To fix this, we add an optional NAT awareness mode, which will query the > kernel conntrack mechanism to obtain the pre-NAT addresses for each packet > and use that in the flow and host hashing. > > When the shaper is enabled and the host is already performing NAT, the cost > of this lookup is negligible. However, in unlimited mode with no NAT being > performed, there is a significant CPU cost at higher bandwidths. For this > reason, the feature is turned off by default. > > Signed-off-by: Toke Høiland-Jørgensen > --- > net/sched/sch_cake.c | 73 > ++ > 1 file changed, 73 insertions(+) > > diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c > index 65439b643c92..e1038a7b6686 100644 > --- a/net/sched/sch_cake.c > +++ b/net/sched/sch_cake.c > @@ -71,6 +71,12 @@ > #include > #include > > +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) > +#include > +#include > +#include > +#endif > + > #define CAKE_SET_WAYS (8) > #define CAKE_MAX_TINS (8) > #define CAKE_QUEUES (1024) > @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars, > return drop; > } > > +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) > + > +static void cake_update_flowkeys(struct flow_keys *keys, > +const struct sk_buff *skb) > +{ > + const struct nf_conntrack_tuple *tuple; > + enum ip_conntrack_info ctinfo; > + struct nf_conn *ct; > + bool rev = false; > + > + if (tc_skb_protocol(skb) != htons(ETH_P_IP)) > + return; > + > + ct = nf_ct_get(skb, ); > + if (ct) { > + tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo)); > + } else { > + const struct nf_conntrack_tuple_hash *hash; > + struct nf_conntrack_tuple srctuple; > + > + if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), > + NFPROTO_IPV4, dev_net(skb->dev), > + )) > + return; > + > + hash = nf_conntrack_find_get(dev_net(skb->dev), > +_ct_zone_dflt, > +); > + if (!hash) > + return; > + > + rev = true; > + ct = nf_ct_tuplehash_to_ctrack(hash); > + tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir); > + } > + > + keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip; > + keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip; > + > + if (keys->ports.ports) { > + keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all; > + keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all; > + } > + if (rev) > + nf_ct_put(ct); > +} > +#else > +static void cake_update_flowkeys(struct flow_keys *keys, > +const struct sk_buff *skb) > +{ > + /* There is nothing we can do here without CONNTRACK */ > +} > +#endif > + > /* Cake has several subtle multiple bit settings. In these cases you > * would be matching triple isolate mode as well. > */ > @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const > struct sk_buff *skb, > skb_flow_dissect_flow_keys(skb, , >FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); > > + if (flow_mode & CAKE_FLOW_NAT_FLAG) > + cake_update_flowkeys(, skb); > + > /* flow_hash_from_keys() sorts the addresses by value, so we have > * to preserve their order in a separate data structure to treat > * src and dst host addresses as independently selectable. > @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct > nlattr *opt, > q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) & > CAKE_FLOW_MASK); > > + if (tb[TCA_CAKE_NAT]) { > + q->flow_mode &= ~CAKE_FLOW_NAT_FLAG; > + q->flow_mode |= CAKE_FLOW_NAT_FLAG * > + !!nla_get_u32(tb[TCA_CAKE_NAT]); > + } I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK is not enabled. > + > if (tb[TCA_CAKE_RTT]) { > q->interval = nla_get_u32(tb[TCA_CAKE_RTT]); > > @@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff > *skb) > if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter)) > goto nla_put_failure; > > + if (nla_put_u32(skb, TCA_CAKE_NAT, > + !!(q->flow_mode &
Re: [PATCH v2 net] net/ipv4: Initialize proto and ports in flow struct
On Wed, May 16, 2018 at 1:36 PM, David Ahernwrote: > Updating the FIB tracepoint for the recent change to allow rules using > the protocol and ports exposed a few places where the entries in the flow > struct are not initialized. > > For __fib_validate_source add the call to fib4_rules_early_flow_dissect > since it is invoked for the input path. For netfilter, add the memset on > the flow struct to avoid future problems like this. In ip_route_input_slow > need to set the fields if the skb dissection does not happen. > > Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport > and dport") > Signed-off-by: David Ahern > --- LGTM, Acked-by: Roopa Prabhu
[PATCH 1/3] ibmvnic: Free coherent DMA memory if FW map failed
If the firmware map fails for whatever reason, remember to free up the memory after. Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 6e8d6a6..9e08917 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -192,6 +192,7 @@ static int alloc_long_term_buff(struct ibmvnic_adapter *adapter, if (adapter->fw_done_rc) { dev_err(dev, "Couldn't map long term buffer,rc = %d\n", adapter->fw_done_rc); + dma_free_coherent(dev, ltb->size, ltb->buff, ltb->addr); return -1; } return 0; -- 1.8.3.1
[PATCH 2/3] ibmvnic: Fix non-fatal firmware error reset
It is not necessary to disable interrupt lines here during a reset to handle a non-fatal firmware error. Move that call within the code block that handles the other cases that do require interrupts to be disabled and re-enabled. Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 9e08917..1b9c22f 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1822,9 +1822,8 @@ static int do_reset(struct ibmvnic_adapter *adapter, if (rc) return rc; } + ibmvnic_disable_irqs(adapter); } - - ibmvnic_disable_irqs(adapter); adapter->state = VNIC_CLOSED; if (reset_state == VNIC_CLOSED) -- 1.8.3.1
[PATCH 3/3] ibmvnic: Fix statistics buffers memory leak
Move initialization of statistics buffers from ibmvnic_init function into ibmvnic_probe. In the current state, ibmvnic_init will be called again during a device reset, resulting in the allocation of new buffers without freeing the old ones. Signed-off-by: Thomas Falcon--- drivers/net/ethernet/ibm/ibmvnic.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 1b9c22f..4bb4646 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -4586,14 +4586,6 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter) release_crq_queue(adapter); } - rc = init_stats_buffers(adapter); - if (rc) - return rc; - - rc = init_stats_token(adapter); - if (rc) - return rc; - return rc; } @@ -4662,13 +4654,21 @@ static int ibmvnic_probe(struct vio_dev *dev, const struct vio_device_id *id) goto ibmvnic_init_fail; } while (rc == EAGAIN); + rc = init_stats_buffers(adapter); + if (rc) + goto ibmvnic_init_fail; + + rc = init_stats_token(adapter); + if (rc) + goto ibmvnic_stats_fail; + netdev->mtu = adapter->req_mtu - ETH_HLEN; netdev->min_mtu = adapter->min_mtu - ETH_HLEN; netdev->max_mtu = adapter->max_mtu - ETH_HLEN; rc = device_create_file(>dev, _attr_failover); if (rc) - goto ibmvnic_init_fail; + goto ibmvnic_dev_file_err; netif_carrier_off(netdev); rc = register_netdev(netdev); @@ -4687,6 +4687,12 @@ static int ibmvnic_probe(struct vio_dev *dev, const struct vio_device_id *id) ibmvnic_register_fail: device_remove_file(>dev, _attr_failover); +ibmvnic_dev_file_err: + release_stats_token(adapter); + +ibmvnic_stats_fail: + release_stats_buffers(adapter); + ibmvnic_init_fail: release_sub_crqs(adapter, 1); release_crq_queue(adapter); -- 1.8.3.1
[PATCH 0/3] ibmvnic: Fix bugs and memory leaks
This is a small patch series fixing up some bugs and memory leaks in the ibmvnic driver. The first fix frees up previously allocated memory that should be freed in case of an error. The second fixes a reset case that was failing due to TX/RX queue IRQ's being erroneously disabled without being enabled again. The final patch fixes incorrect reallocated of statistics buffers during a device reset, resulting in loss of statistics information and a memory leak. Thomas Falcon (3): ibmvnic: Free coherent DMA memory if FW map failed ibmvnic: Fix non-fatal firmware error reset ibmvnic: Fix statistics buffers memory leak drivers/net/ethernet/ibm/ibmvnic.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) -- 1.8.3.1
Re: [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensenwrote: > + > +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg) > +{ > + return NULL; > +} > + > +static unsigned long cake_find(struct Qdisc *sch, u32 classid) > +{ > + return 0; > +} > + > +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg) > +{ > +} Thanks for adding the support to other TC filters, it is much better now! A quick question: why class_ops->dump_stats is still NULL? It is supposed to dump the stats of each flow. Is there still any difficulty to map it to tc class? I thought you figured it out when you added the tcf_classify().
Re: [PATCH 1/3] sh_eth: add RGMII support
> > Hi Sergei > > > > What about > > PHY_INTERFACE_MODE_RGMII_ID, > > PHY_INTERFACE_MODE_RGMII_RXID, > > PHY_INTERFACE_MODE_RGMII_TXID, > >Oops, totally forgot about those... :-/ Everybody does. I keep intending to write a email template for this, and phy_interface_mode_is_rgmii() :-) Andrew
[PATCH v2 net] net/ipv4: Initialize proto and ports in flow struct
Updating the FIB tracepoint for the recent change to allow rules using the protocol and ports exposed a few places where the entries in the flow struct are not initialized. For __fib_validate_source add the call to fib4_rules_early_flow_dissect since it is invoked for the input path. For netfilter, add the memset on the flow struct to avoid future problems like this. In ip_route_input_slow need to set the fields if the skb dissection does not happen. Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and dport") Signed-off-by: David Ahern--- Have not seen any problems with the IPv6 version v2 - do not remove tracepoint in __fib_validate_source (sent the net-next version of this patch) - add set of ports and proto to ip_route_input_slow if skb dissect is not done net/ipv4/fib_frontend.c | 8 +++- net/ipv4/netfilter/ipt_rpfilter.c | 2 +- net/ipv4/route.c | 7 ++- 3 files changed, 14 insertions(+), 3 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index f05afaf3235c..4d622112bf95 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -326,10 +326,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos, int oif, struct net_device *dev, int rpf, struct in_device *idev, u32 *itag) { + struct net *net = dev_net(dev); + struct flow_keys flkeys; int ret, no_addr; struct fib_result res; struct flowi4 fl4; - struct net *net = dev_net(dev); bool dev_match; fl4.flowi4_oif = 0; @@ -347,6 +348,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, no_addr = idev->ifa_list == NULL; fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0; + if (!fib4_rules_early_flow_dissect(net, skb, , )) { + fl4.flowi4_proto = 0; + fl4.fl4_sport = 0; + fl4.fl4_dport = 0; + } trace_fib_validate_source(dev, ); diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c index fd01f13c896a..12843c9ef142 100644 --- a/net/ipv4/netfilter/ipt_rpfilter.c +++ b/net/ipv4/netfilter/ipt_rpfilter.c @@ -89,10 +89,10 @@ static bool rpfilter_mt(const struct sk_buff *skb, struct xt_action_param *par) return true ^ invert; } + memset(, 0, sizeof(flow)); flow.flowi4_iif = LOOPBACK_IFINDEX; flow.daddr = iph->saddr; flow.saddr = rpfilter_get_saddr(iph->daddr); - flow.flowi4_oif = 0; flow.flowi4_mark = info->flags & XT_RPFILTER_VALID_MARK ? skb->mark : 0; flow.flowi4_tos = RT_TOS(iph->tos); flow.flowi4_scope = RT_SCOPE_UNIVERSE; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 29268efad247..2cfa1b518f8d 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1961,8 +1961,13 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr, fl4.saddr = saddr; fl4.flowi4_uid = sock_net_uid(net, NULL); - if (fib4_rules_early_flow_dissect(net, skb, , &_flkeys)) + if (fib4_rules_early_flow_dissect(net, skb, , &_flkeys)) { flkeys = &_flkeys; + } else { + fl4.flowi4_proto = 0; + fl4.fl4_sport = 0; + fl4.fl4_dport = 0; + } err = fib_lookup(net, , res, 0); if (err != 0) { -- 2.11.0
Re: [PATCH 1/3] sh_eth: add RGMII support
On 05/16/2018 11:30 PM, Andrew Lunn wrote: >> The R-Car V3H (AKA R8A77980) GEther controller adds support for the RGMII >> PHY interface mode as a new value for the RMII_MII register. >> >> Based on the original (and large) patch by Vladimir Barinov. >> >> Signed-off-by: Vladimir Barinov>> Signed-off-by: Sergei Shtylyov >> >> --- >> drivers/net/ethernet/renesas/sh_eth.c |3 +++ >> 1 file changed, 3 insertions(+) >> >> Index: net-next/drivers/net/ethernet/renesas/sh_eth.c >> === >> --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c >> +++ net-next/drivers/net/ethernet/renesas/sh_eth.c >> @@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net >> u32 value; >> >> switch (mdp->phy_interface) { >> +case PHY_INTERFACE_MODE_RGMII: >> +value = 0x3; >> +break; > > Hi Sergei > > What about > PHY_INTERFACE_MODE_RGMII_ID, > PHY_INTERFACE_MODE_RGMII_RXID, > PHY_INTERFACE_MODE_RGMII_TXID, Oops, totally forgot about those... :-/ > Andrew MBR, Sergei
Re: [PATCH 1/3] sh_eth: add RGMII support
On Wed, May 16, 2018 at 10:56:45PM +0300, Sergei Shtylyov wrote: > The R-Car V3H (AKA R8A77980) GEther controller adds support for the RGMII > PHY interface mode as a new value for the RMII_MII register. > > Based on the original (and large) patch by Vladimir Barinov. > > Signed-off-by: Vladimir Barinov> Signed-off-by: Sergei Shtylyov > > --- > drivers/net/ethernet/renesas/sh_eth.c |3 +++ > 1 file changed, 3 insertions(+) > > Index: net-next/drivers/net/ethernet/renesas/sh_eth.c > === > --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c > +++ net-next/drivers/net/ethernet/renesas/sh_eth.c > @@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net > u32 value; > > switch (mdp->phy_interface) { > + case PHY_INTERFACE_MODE_RGMII: > + value = 0x3; > + break; Hi Sergei What about PHY_INTERFACE_MODE_RGMII_ID, PHY_INTERFACE_MODE_RGMII_RXID, PHY_INTERFACE_MODE_RGMII_TXID, Andrew
Re: [PATCH net-next v3 1/3] ipv4: support sport, dport and ip_proto in RTM_GETROUTE
On Wed, May 16, 2018 at 11:37 AM, David Millerwrote: > From: Roopa Prabhu > Date: Tue, 15 May 2018 20:55:06 -0700 > >> +static int inet_rtm_getroute_reply(struct sk_buff *in_skb, struct nlmsghdr >> *nlh, >> +__be32 dst, __be32 src, struct flowi4 *fl4, >> +struct rtable *rt, struct fib_result *res) >> +{ >> + struct net *net = sock_net(in_skb->sk); >> + struct rtmsg *rtm = nlmsg_data(nlh); >> + u32 table_id = RT_TABLE_MAIN; >> + struct sk_buff *skb; >> + int err = 0; >> + >> + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); >> + if (!skb) >> + return -ENOMEM; > > If the caller can use GFP_KERNEL, so can this allocation. yes, but we hold rcu read lock before calling the reply function for fib result. I did consider allocating the skb before the read lock..but then the refactoring (into a separate netlink reply func) would seem unnecessary. I am fine with pre-allocating and undoing the refactoring if that works better.
[PATCH net-next v12 5/7] sch_cake: Add DiffServ handling
This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 407 +- 1 file changed, 401 insertions(+), 6 deletions(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index e1038a7b6686..f0f94d536e51 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -297,6 +297,68 @@ static void cobalt_set_enqueue_time(struct sk_buff *skb, static u16 quantum_div[CAKE_QUEUES + 1] = {0}; +/* Diffserv lookup tables */ + +static const u8 precedence[] = { + 0, 0, 0, 0, 0, 0, 0, 0, + 1, 1, 1, 1, 1, 1, 1, 1, + 2, 2, 2, 2, 2, 2, 2, 2, + 3, 3, 3, 3, 3, 3, 3, 3, + 4, 4, 4, 4, 4, 4, 4, 4, + 5, 5, 5, 5, 5, 5, 5, 5, + 6, 6, 6, 6, 6, 6, 6, 6, + 7, 7, 7, 7, 7, 7, 7, 7, +}; + +static const u8 diffserv8[] = { + 2, 5, 1, 2, 4, 2, 2, 2, + 0, 2, 1, 2, 1, 2, 1, 2, + 5, 2, 4, 2, 4, 2, 4, 2, + 3, 2, 3, 2, 3, 2, 3, 2, + 6, 2, 3, 2, 3, 2, 3, 2, + 6, 2, 2, 2, 6, 2, 6, 2, + 7, 2, 2, 2, 2, 2, 2, 2, + 7, 2, 2, 2, 2, 2, 2, 2, +}; + +static const u8 diffserv4[] = { + 0, 2, 0, 0, 2, 0, 0, 0, + 1, 0, 0, 0, 0, 0, 0, 0, + 2, 0, 2, 0, 2, 0, 2, 0, + 2, 0, 2, 0, 2, 0, 2, 0, + 3, 0, 2, 0, 2, 0, 2, 0, + 3, 0, 0, 0, 3, 0, 3, 0, + 3, 0, 0, 0, 0, 0, 0, 0, + 3, 0, 0, 0, 0, 0, 0, 0, +}; + +static const u8 diffserv3[] = { + 0, 0, 0, 0, 2, 0, 0, 0, + 1, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 2, 0, 2, 0, + 2, 0, 0, 0, 0, 0, 0, 0, + 2, 0, 0, 0, 0, 0, 0, 0, +}; + +static const u8 besteffort[] = { + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, +}; + +/* tin priority order for stats dumping */ + +static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7}; +static const u8 bulk_order[] = {1, 0, 2, 3}; + #define REC_INV_SQRT_CACHE (16) static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0}; @@ -1219,6 +1281,46 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free) return idx + (tin << 16); } +static void cake_wash_diffserv(struct sk_buff *skb) +{ + switch (skb->protocol) { + case htons(ETH_P_IP): + ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0); + break; + case htons(ETH_P_IPV6): + ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0); + break; + default: + break; + } +} + +static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash) +{ + u8 dscp; + + switch (skb->protocol) { + case htons(ETH_P_IP): + dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2; + if (wash && dscp) + ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0); + return dscp; + + case htons(ETH_P_IPV6): + dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2; + if (wash && dscp) + ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0); + return dscp; + + case htons(ETH_P_ARP): + return 0x38; /* CS7 - Net Control */ + + default: + /* If there is no Diffserv field, treat
[PATCH net-next v12 7/7] sch_cake: Conditionally split GSO segments
At lower bandwidths, the transmission time of a single GSO segment can add an unacceptable amount of latency due to HOL blocking. Furthermore, with a software shaper, any tuning mechanism employed by the kernel to control the maximum size of GSO segments is thrown off by the artificial limit on bandwidth. For this reason, we split GSO segments into their individual packets iff the shaper is active and configured to a bandwidth <= 1 Gbps. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 99 +- 1 file changed, 73 insertions(+), 26 deletions(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index 1ce81d919f73..dca276806e9f 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -82,6 +82,7 @@ #define CAKE_QUEUES (1024) #define CAKE_FLOW_MASK 63 #define CAKE_FLOW_NAT_FLAG 64 +#define CAKE_SPLIT_GSO_THRESHOLD (12500) /* 1Gbps */ /* struct cobalt_params - contains codel and blue parameters * @interval: codel initial drop rate @@ -1474,36 +1475,73 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch, if (unlikely(len > b->max_skblen)) b->max_skblen = len; - cobalt_set_enqueue_time(skb, now); - get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb); - flow_queue_add(flow, skb); - - if (q->ack_filter) - ack = cake_ack_filter(q, flow); + if (skb_is_gso(skb) && q->rate_flags & CAKE_FLAG_SPLIT_GSO) { + struct sk_buff *segs, *nskb; + netdev_features_t features = netif_skb_features(skb); + unsigned int slen = 0; + + segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); + if (IS_ERR_OR_NULL(segs)) + return qdisc_drop(skb, sch, to_free); + + while (segs) { + nskb = segs->next; + segs->next = NULL; + qdisc_skb_cb(segs)->pkt_len = segs->len; + cobalt_set_enqueue_time(segs, now); + get_cobalt_cb(segs)->adjusted_len = cake_overhead(q, + segs); + flow_queue_add(flow, segs); + + sch->q.qlen++; + slen += segs->len; + q->buffer_used += segs->truesize; + b->packets++; + segs = nskb; + } - if (ack) { - b->ack_drops++; - sch->qstats.drops++; - b->bytes += qdisc_pkt_len(ack); - len -= qdisc_pkt_len(ack); - q->buffer_used += skb->truesize - ack->truesize; - if (q->rate_flags & CAKE_FLAG_INGRESS) - cake_advance_shaper(q, b, ack, now, true); + /* stats */ + b->bytes+= slen; + b->backlogs[idx]+= slen; + b->tin_backlog += slen; + sch->qstats.backlog += slen; + q->avg_window_bytes += slen; - qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack)); - consume_skb(ack); + qdisc_tree_reduce_backlog(sch, 1, len); + consume_skb(skb); } else { - sch->q.qlen++; - q->buffer_used += skb->truesize; - } + /* not splitting */ + cobalt_set_enqueue_time(skb, now); + get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb); + flow_queue_add(flow, skb); + + if (q->ack_filter) + ack = cake_ack_filter(q, flow); + + if (ack) { + b->ack_drops++; + sch->qstats.drops++; + b->bytes += qdisc_pkt_len(ack); + len -= qdisc_pkt_len(ack); + q->buffer_used += skb->truesize - ack->truesize; + if (q->rate_flags & CAKE_FLAG_INGRESS) + cake_advance_shaper(q, b, ack, now, true); + + qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack)); + consume_skb(ack); + } else { + sch->q.qlen++; + q->buffer_used += skb->truesize; + } - /* stats */ - b->packets++; - b->bytes+= len; - b->backlogs[idx]+= len; - b->tin_backlog += len; - sch->qstats.backlog += len; - q->avg_window_bytes += len; + /* stats */ + b->packets++; + b->bytes+= len; + b->backlogs[idx]+= len; + b->tin_backlog += len; + sch->qstats.backlog += len; + q->avg_window_bytes += len; + } if
[PATCH net-next v12 3/7] sch_cake: Add optional ACK filter
The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the *upstream* direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs *fewer* cumulative bytes than the new packet being enqueued. This prevents duplicate ACKs from being filtered (unless there is also SACK options present), to avoid interfering with retransmission logic. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In *really* pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 260 ++ 1 file changed, 258 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index d515f18f8460..65439b643c92 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -755,6 +755,239 @@ static void flow_queue_add(struct cake_flow *flow, struct sk_buff *skb) skb->next = NULL; } +static struct iphdr *cake_get_iphdr(const struct sk_buff *skb, + struct ipv6hdr *buf) +{ + unsigned int offset = skb_network_offset(skb); + struct iphdr *iph; + + iph = skb_header_pointer(skb, offset, sizeof(struct iphdr), buf); + + if (!iph) + return NULL; + + if (iph->version == 4 && iph->protocol == IPPROTO_IPV6) + return skb_header_pointer(skb, offset + iph->ihl * 4, + sizeof(struct ipv6hdr), buf); + + else if (iph->version == 4) + return iph; + + else if (iph->version == 6) + return skb_header_pointer(skb, offset, sizeof(struct ipv6hdr), + buf); + + return NULL; +} + +static struct tcphdr *cake_get_tcphdr(const struct sk_buff *skb, + void *buf, unsigned int bufsize) +{ + unsigned int offset = skb_network_offset(skb); + const struct ipv6hdr *ipv6h; + const struct tcphdr *tcph; + const struct iphdr *iph; + struct ipv6hdr _ipv6h; + struct tcphdr _tcph; + + ipv6h = skb_header_pointer(skb, offset, sizeof(_ipv6h), &_ipv6h); + + if (!ipv6h) + return NULL; + + if (ipv6h->version == 4) { + iph = (struct iphdr *)ipv6h; + offset += iph->ihl * 4; + + /* special-case 6in4 tunnelling, as that is a common way to get +* v6 connectivity in the home +*/ + if (iph->protocol == IPPROTO_IPV6) { + ipv6h =
[PATCH net-next v12 6/7] sch_cake: Add overhead compensation support to the rate shaper
This commit adds configurable overhead compensation support to the rate shaper. With this feature, userspace can configure the actual bottleneck link overhead and encapsulation mode used, which will be used by the shaper to calculate the precise duration of each packet on the wire. This feature is needed because CAKE is often deployed one or two hops upstream of the actual bottleneck (which can be, e.g., inside a DSL or cable modem). In this case, the link layer characteristics and overhead reported by the kernel does not match the actual bottleneck. Being able to set the actual values in use makes it possible to configure the shaper rate much closer to the actual bottleneck rate (our experience shows it is possible to get with 0.1% of the actual physical bottleneck rate), thus keeping latency low without sacrificing bandwidth. The overhead compensation has three tunables: A fixed per-packet overhead size (which, if set, will be accounted from the IP packet header), a minimum packet size (MPU) and a framing mode supporting either ATM or PTM framing. We include a set of common keywords in TC to help users configure the right parameters. If no overhead value is set, the value reported by the kernel is used. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 124 ++ 1 file changed, 123 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index f0f94d536e51..1ce81d919f73 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -271,6 +271,7 @@ enum { struct cobalt_skb_cb { ktime_t enqueue_time; + u32 adjusted_len; }; static u64 us_to_ns(u64 us) @@ -1120,6 +1121,88 @@ static u64 cake_ewma(u64 avg, u64 sample, u32 shift) return avg; } +static u32 cake_calc_overhead(struct cake_sched_data *q, u32 len, u32 off) +{ + if (q->rate_flags & CAKE_FLAG_OVERHEAD) + len -= off; + + if (q->max_netlen < len) + q->max_netlen = len; + if (q->min_netlen > len) + q->min_netlen = len; + + len += q->rate_overhead; + + if (len < q->rate_mpu) + len = q->rate_mpu; + + if (q->atm_mode == CAKE_ATM_ATM) { + len += 47; + len /= 48; + len *= 53; + } else if (q->atm_mode == CAKE_ATM_PTM) { + /* Add one byte per 64 bytes or part thereof. +* This is conservative and easier to calculate than the +* precise value. +*/ + len += (len + 63) / 64; + } + + if (q->max_adjlen < len) + q->max_adjlen = len; + if (q->min_adjlen > len) + q->min_adjlen = len; + + return len; +} + +static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb) +{ + const struct skb_shared_info *shinfo = skb_shinfo(skb); + unsigned int hdr_len, last_len = 0; + u32 off = skb_network_offset(skb); + u32 len = qdisc_pkt_len(skb); + u16 segs = 1; + + q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8); + + if (!shinfo->gso_size) + return cake_calc_overhead(q, len, off); + + /* borrowed from qdisc_pkt_len_init() */ + hdr_len = skb_transport_header(skb) - skb_mac_header(skb); + + /* + transport layer */ + if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | + SKB_GSO_TCPV6))) { + const struct tcphdr *th; + struct tcphdr _tcphdr; + + th = skb_header_pointer(skb, skb_transport_offset(skb), + sizeof(_tcphdr), &_tcphdr); + if (likely(th)) + hdr_len += __tcp_hdrlen(th); + } else { + struct udphdr _udphdr; + + if (skb_header_pointer(skb, skb_transport_offset(skb), + sizeof(_udphdr), &_udphdr)) + hdr_len += sizeof(struct udphdr); + } + + if (unlikely(shinfo->gso_type & SKB_GSO_DODGY)) + segs = DIV_ROUND_UP(skb->len - hdr_len, + shinfo->gso_size); + else + segs = shinfo->gso_segs; + + len = shinfo->gso_size + hdr_len; + last_len = skb->len - shinfo->gso_size * (segs - 1); + + return (cake_calc_overhead(q, len, off) * (segs - 1) + + cake_calc_overhead(q, last_len, off)); +} + static void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j) { struct cake_heap_entry ii = q->overflow_heap[i]; @@ -1197,7 +1280,7 @@ static int cake_advance_shaper(struct cake_sched_data *q, struct sk_buff *skb, ktime_t now, bool drop) { - u32 len = qdisc_pkt_len(skb); + u32 len = get_cobalt_cb(skb)->adjusted_len; /* charge packet bandwidth to
[PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier
When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 73 ++ 1 file changed, 73 insertions(+) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index 65439b643c92..e1038a7b6686 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -71,6 +71,12 @@ #include #include +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) +#include +#include +#include +#endif + #define CAKE_SET_WAYS (8) #define CAKE_MAX_TINS (8) #define CAKE_QUEUES (1024) @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars, return drop; } +#if IS_REACHABLE(CONFIG_NF_CONNTRACK) + +static void cake_update_flowkeys(struct flow_keys *keys, +const struct sk_buff *skb) +{ + const struct nf_conntrack_tuple *tuple; + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; + bool rev = false; + + if (tc_skb_protocol(skb) != htons(ETH_P_IP)) + return; + + ct = nf_ct_get(skb, ); + if (ct) { + tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo)); + } else { + const struct nf_conntrack_tuple_hash *hash; + struct nf_conntrack_tuple srctuple; + + if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), + NFPROTO_IPV4, dev_net(skb->dev), + )) + return; + + hash = nf_conntrack_find_get(dev_net(skb->dev), +_ct_zone_dflt, +); + if (!hash) + return; + + rev = true; + ct = nf_ct_tuplehash_to_ctrack(hash); + tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir); + } + + keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip; + keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip; + + if (keys->ports.ports) { + keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all; + keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all; + } + if (rev) + nf_ct_put(ct); +} +#else +static void cake_update_flowkeys(struct flow_keys *keys, +const struct sk_buff *skb) +{ + /* There is nothing we can do here without CONNTRACK */ +} +#endif + /* Cake has several subtle multiple bit settings. In these cases you * would be matching triple isolate mode as well. */ @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb, skb_flow_dissect_flow_keys(skb, , FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); + if (flow_mode & CAKE_FLOW_NAT_FLAG) + cake_update_flowkeys(, skb); + /* flow_hash_from_keys() sorts the addresses by value, so we have * to preserve their order in a separate data structure to treat * src and dst host addresses as independently selectable. @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt, q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) & CAKE_FLOW_MASK); + if (tb[TCA_CAKE_NAT]) { + q->flow_mode &= ~CAKE_FLOW_NAT_FLAG; + q->flow_mode |= CAKE_FLOW_NAT_FLAG * + !!nla_get_u32(tb[TCA_CAKE_NAT]); + } + if (tb[TCA_CAKE_RTT]) { q->interval = nla_get_u32(tb[TCA_CAKE_RTT]); @@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb) if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter)) goto nla_put_failure; + if (nla_put_u32(skb, TCA_CAKE_NAT, + !!(q->flow_mode & CAKE_FLOW_NAT_FLAG))) + goto nla_put_failure; + return nla_nest_end(skb, opts); nla_put_failure:
[PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the c...@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms raw overhead 0 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 memory used: 0b of 500b capacity estimate: 100Mbit min/max network layer size:65535 / 0 min/max overhead-adjusted size:65535 / 0 average network hdr offset:0 Bulk Best EffortVoice thresh 6250Kbit 100Mbit 25Mbit target 5.0ms5.0ms5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 0us 0us 0us av_delay 0us 0us 0us sp_delay 0us 0us 0us pkts000 bytes 000 way_inds000 way_miss000 way_cols000 drops 000 marks 000 ack_drop000 sp_flows000 bk_flows000 un_flows000 max_len 000 quantum 300 1514 762 Tested-by: Pete HeistTested-by: Georgios Amanakis Signed-off-by: Dave Taht Signed-off-by: Toke Høiland-Jørgensen --- include/uapi/linux/pkt_sched.h | 105 ++ net/sched/Kconfig | 11 net/sched/Makefile |1 net/sched/sch_cake.c | 1739 4 files changed, 1856 insertions(+) create mode 100644 net/sched/sch_cake.c diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 37b5096ae97b..883e84f008d7 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -934,4 +934,109 @@ enum { #define TCA_CBS_MAX (__TCA_CBS_MAX - 1) +/* CAKE */ +enum { + TCA_CAKE_UNSPEC, + TCA_CAKE_BASE_RATE64, + TCA_CAKE_DIFFSERV_MODE, + TCA_CAKE_ATM, + TCA_CAKE_FLOW_MODE, + TCA_CAKE_OVERHEAD, + TCA_CAKE_RTT, + TCA_CAKE_TARGET, + TCA_CAKE_AUTORATE, + TCA_CAKE_MEMORY, + TCA_CAKE_NAT, + TCA_CAKE_RAW, + TCA_CAKE_WASH, + TCA_CAKE_MPU, +
[PATCH net-next v12 0/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
This patch series adds the CAKE qdisc, and has been split up to ease review. I have attempted to split out each configurable feature into its own patch. The first commit adds the base shaper and packet scheduler, while subsequent commits add the optional features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. The result of applying the entire series is identical to the out of tree version that have seen extensive testing in previous deployments, most notably as an out of tree patch to OpenWrt. However, note that I have only compile tested the individual patches; so the whole series should be considered as a unit. --- Changelog v12: - Get rid of custom time typedefs. Use ktime_t for time and u64 for duration instead. v11: - Fix overhead compensation calculation for GSO packets - Change configured rate to be u64 (I ran out of bits before I ran out of CPU when testing the effects of the above) v10: - Christmas tree gardening (fix variable declarations to be in reverse line length order) v9: - Remove duplicated checks around kvfree() and just call it unconditionally. - Don't pass __GFP_NOWARN when allocating memory - Move options in cake_dump() that are related to optional features to later patches implementing the features. - Support attaching filters to the qdisc and use the classification result to select flow queue. - Support overriding diffserv priority tin from skb->priority v8: - Remove inline keyword from function definitions - Simplify ACK filter; remove the complex state handling to make the logic easier to follow. This will potentially be a bit less efficient, but I have not been able to measure a difference. v7: - Split up patch into a series to ease review. - Constify the ACK filter. v6: - Fix 6in4 encapsulation checks in ACK filter code - Checkpatch fixes v5: - Refactor ACK filter code and hopefully fix the safety issues properly this time. v4: - Only split GSO packets if shaping at speeds <= 1Gbps - Fix overhead calculation code to also work for GSO packets - Don't re-implement kvzalloc() - Remove local header include from out-of-tree build (fixes kbuild-bot complaint). - Several fixes to the ACK filter: - Check pskb_may_pull() before deref of transport headers. - Don't run ACK filter logic on split GSO packets - Fix TCP sequence number compare to deal with wraparounds v3: - Use IS_REACHABLE() macro to fix compilation when sch_cake is built-in and conntrack is a module. - Switch the stats output to use nested netlink attributes instead of a versioned struct. - Remove GPL boilerplate. - Fix array initialisation style. v2: - Fix kbuild test bot complaint - Clean up the netlink ABI - Fix checkpatch complaints - A few tweaks to the behaviour of cake based on testing carried out while writing the paper. --- Toke Høiland-Jørgensen (7): sched: Add Common Applications Kept Enhanced (cake) qdisc sch_cake: Add ingress mode sch_cake: Add optional ACK filter sch_cake: Add NAT awareness to packet classifier sch_cake: Add DiffServ handling sch_cake: Add overhead compensation support to the rate shaper sch_cake: Conditionally split GSO segments include/uapi/linux/pkt_sched.h | 105 ++ net/sched/Kconfig | 11 net/sched/Makefile |1 net/sched/sch_cake.c | 2709 4 files changed, 2826 insertions(+) create mode 100644 net/sched/sch_cake.c
[PATCH net-next v12 2/7] sch_cake: Add ingress mode
The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued *for each flow*. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen--- net/sched/sch_cake.c | 85 -- 1 file changed, 81 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c index 422cfccbf37f..d515f18f8460 100644 --- a/net/sched/sch_cake.c +++ b/net/sched/sch_cake.c @@ -433,7 +433,8 @@ static bool cobalt_queue_empty(struct cobalt_vars *vars, static bool cobalt_should_drop(struct cobalt_vars *vars, struct cobalt_params *p, ktime_t now, - struct sk_buff *skb) + struct sk_buff *skb, + u32 bulk_flows) { bool next_due, over_target, drop = false; ktime_t schedule; @@ -457,6 +458,7 @@ static bool cobalt_should_drop(struct cobalt_vars *vars, sojourn = ktime_to_ns(ktime_sub(now, cobalt_get_enqueue_time(skb))); schedule = ktime_sub(now, vars->drop_next); over_target = sojourn > p->target && + sojourn > p->mtu_time * bulk_flows * 2 && sojourn > p->mtu_time * 4; next_due = vars->count && schedule >= 0; @@ -910,6 +912,9 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free) b->tin_dropped++; sch->qstats.drops++; + if (q->rate_flags & CAKE_FLAG_INGRESS) + cake_advance_shaper(q, b, skb, now, true); + __qdisc_drop(skb, to_free); sch->q.qlen--; @@ -986,8 +991,46 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch, cake_heapify_up(q, b->overflow_idx[idx]); /* incoming bandwidth capacity estimate */ - q->avg_window_bytes = 0; - q->last_packet_time = now; + if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) { + u64 packet_interval = \ + ktime_to_ns(ktime_sub(now, q->last_packet_time)); + + if (packet_interval > NSEC_PER_SEC) + packet_interval = NSEC_PER_SEC; + + /* filter out short-term bursts, eg. wifi aggregation */ + q->avg_packet_interval = \ + cake_ewma(q->avg_packet_interval, + packet_interval, + (packet_interval > q->avg_packet_interval ? + 2 : 8)); + + q->last_packet_time = now; + + if (packet_interval > q->avg_packet_interval) { + u64 window_interval = \ + ktime_to_ns(ktime_sub(now, + q->avg_window_begin)); + u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC; + + do_div(b, window_interval); + q->avg_peak_bandwidth = + cake_ewma(q->avg_peak_bandwidth, b, + b > q->avg_peak_bandwidth ? 2 : 8); + q->avg_window_bytes = 0; + q->avg_window_begin = now; + + if (ktime_after(now, + ktime_add_ms(q->last_reconfig_time, +250))) { + q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4; + cake_reconfigure(sch); + } + } + } else { + q->avg_window_bytes = 0; + q->last_packet_time = now; + } /* flowchain */ if (!flow->set || flow->set == CAKE_SET_DECAYING) { @@ -1246,14 +1289,26 @@ static struct sk_buff
[PATCH] bpf: add __printf verification to bpf_verifier_vlog
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’ function is used twice in verifier.c in both cases the caller function already uses the __printf gcc attribute. Remove the following warning, triggered with W=1: kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] Signed-off-by: Mathieu Malaterre--- include/linux/bpf_verifier.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 7e61c395fddf..ebf78f8ddfa1 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -197,8 +197,8 @@ struct bpf_verifier_env { u32 subprog_cnt; }; -void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt, - va_list args); +__printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log, + const char *fmt, va_list args); __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, const char *fmt, ...); -- 2.11.0
Re: [PATCH net-next v2 2/2] drivers: net: Remove device_node checks with of_mdiobus_register()
Hello! On 05/16/2018 02:56 AM, Florian Fainelli wrote: > A number of drivers have the following pattern: > > if (np) > of_mdiobus_register() > else > mdiobus_register() > > which the implementation of of_mdiobus_register() now takes care of. > Remove that pattern in drivers that strictly adhere to it. > > Signed-off-by: Florian Fainelli[...] > diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c > index ac621f44237a..02e8982519ce 100644 > --- a/drivers/net/dsa/bcm_sf2.c > +++ b/drivers/net/dsa/bcm_sf2.c > @@ -450,12 +450,8 @@ static int bcm_sf2_mdio_register(struct dsa_switch *ds) > priv->slave_mii_bus->parent = ds->dev->parent; > priv->slave_mii_bus->phy_mask = ~priv->indir_phy_mask; > > - if (dn) > - err = of_mdiobus_register(priv->slave_mii_bus, dn); > - else > - err = mdiobus_register(priv->slave_mii_bus); > - > - if (err) > + err = of_mdiobus_register(priv->slave_mii_bus, dn); > + if (err && dn) of_node_put() checks for NULL. > of_node_put(dn); > > return err; [...] > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index d4604bc8eb5b..f3e43db0d6cb 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -2052,13 +2052,9 @@ static int fec_enet_mii_init(struct platform_device > *pdev) > fep->mii_bus->parent = >dev; > > node = of_get_child_by_name(pdev->dev.of_node, "mdio"); > - if (node) { > - err = of_mdiobus_register(fep->mii_bus, node); > + err = of_mdiobus_register(fep->mii_bus, node); > + if (node) > of_node_put(node); Same comment here. [...] > diff --git a/drivers/net/ethernet/renesas/sh_eth.c > b/drivers/net/ethernet/renesas/sh_eth.c > index 5970d9e5ddf1..8dd41e08a6c6 100644 > --- a/drivers/net/ethernet/renesas/sh_eth.c > +++ b/drivers/net/ethernet/renesas/sh_eth.c > @@ -3025,15 +3025,10 @@ static int sh_mdio_init(struct sh_eth_private *mdp, >pdev->name, pdev->id); > > /* register MDIO bus */ > - if (dev->of_node) { > - ret = of_mdiobus_register(mdp->mii_bus, dev->of_node); > - } else { > - if (pd->phy_irq > 0) > - mdp->mii_bus->irq[pd->phy] = pd->phy_irq; > - > - ret = mdiobus_register(mdp->mii_bus); > - } > + if (pd->phy_irq > 0) > + mdp->mii_bus->irq[pd->phy] = pd->phy_irq; > > + ret = of_mdiobus_register(mdp->mii_bus, dev->of_node); > if (ret) > goto out_free_bus; > This part is: Acked-by: Sergei Shtylyov [...] > diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c > index 91761436709a..8dff87ec6d99 100644 > --- a/drivers/net/usb/lan78xx.c > +++ b/drivers/net/usb/lan78xx.c > @@ -1843,12 +1843,9 @@ static int lan78xx_mdio_init(struct lan78xx_net *dev) > } > > node = of_get_child_by_name(dev->udev->dev.of_node, "mdio"); > - if (node) { > - ret = of_mdiobus_register(dev->mdiobus, node); > + ret = of_mdiobus_register(dev->mdiobus, node); > + if (node) > of_node_put(node); of_node_put() checks for NULL, again... MBR, Sergei
Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather
On Tue, May 15, 2018 at 7:57 PM, Willem de Bruijnwrote: > On Tue, May 15, 2018 at 4:04 PM, Willem de Bruijn > wrote: >> On Tue, May 15, 2018 at 10:14 AM, Willem de Bruijn >> wrote: >>> On Mon, May 14, 2018 at 7:45 PM, Eric Dumazet >>> wrote: On 05/14/2018 04:30 PM, Willem de Bruijn wrote: > I don't quite follow. The reported crash happens in the protocol layer, > because of this check. With pagedlen we have not allocated > sufficient space for the skb_put. > > if (!(rt->dst.dev->features_F_SG)) { > unsigned int off; > > off = skb->len; > if (getfrag(from, skb_put(skb, copy), > offset, copy, off, skb) < 0) { > __skb_trim(skb, off); > err = -EFAULT; > goto error; > } > } else { > int i = skb_shinfo(skb)->nr_frags; > > Are you referring to a separate potential issue in the gso layer? > If a bonding device advertises SG, but a slave does not, then > skb_segment on the slave should build linear segs? I have not > tested that. Given that the device attribute could change under us, we need to not crash, even if initially we thought NETIF_F_SG was available. Unless you want to hold RTNL in UDP xmit :) Ideally, GSO should be always on, as we did for TCP. Otherwise, I can guarantee syzkaller will hit again. >>> >>> Ah, right. Thanks, Eric! >>> >>> I'll read that feature bit only once. >> >> This issue is actually deeper and not specific to gso. >> With corking it is trivial to turn off sg in between calls. >> >> I'll need to send a separate fix for that. > > This would do it. The extra branch is unfortunate, but I see no easy > way around it for the corking case. > > It will obviously not build a linear skb, but validate_xmit_skb will clean > that up for such edge cases. > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 66340ab750e6..e7daec7c7421 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1040,7 +1040,8 @@ static int __ip_append_data(struct sock *sk, > if (copy > length) > copy = length; > > - if (!(rt->dst.dev->features_F_SG)) { > + if (!(rt->dst.dev->features_F_SG) && > + skb_tailroom(skb) >= copy) { > unsigned int off; Reminder that this is a separate draft patch to net unrelated to gso. A simpler branch > - if (!(rt->dst.dev->features_F_SG)) { > + if (skb_tailroom(skb) >= copy) { is probably sufficient, but might have subtle side-effects when SG is off, where allocation padding allows data to fit that would currently is added as frag. Risky for a stable patch with no significant benefit. On the other extreme, I can define bool sg = rt->dst.dev->features & NETIF_F_SG; and refer to that in both current sites that test the flag. But this will not help the corking case where the function is entered twice for the same skb. I'll add that in the net-next gso fix where the flag is tested three times. But intend to send this snippet (also for v6) as is.
Re: [PATCH bpf-next] samples/bpf: Decrement ttl in fib forwarding example
On 05/16/2018 01:20 AM, David Ahern wrote: > Only consider forwarding packets if ttl in received packet is > 1 and > decrement ttl before handing off to bpf_redirect_map. > > Signed-off-by: David AhernLooks good, applied to bpf-next, thanks David!
Re: [PATCH bpf-next v6 2/4] bpf: sockmap, add hash map support
On 05/15/2018 11:09 PM, Y Song wrote: > On Tue, May 15, 2018 at 12:01 PM, Daniel Borkmann> wrote: >> On 05/14/2018 07:00 PM, John Fastabend wrote: [...] >>> enum bpf_prog_type { >>> @@ -1855,6 +1856,52 @@ struct bpf_stack_build_id { >>> * Egress device index on success, 0 if packet needs to >>> continue >>> * up the stack for further processing or a negative error in >>> case >>> * of failure. >>> + * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct >>> bpf_map *map, void *key, u64 flags) >> >> When you rebase please fix this up properly next time and add a newline in >> between >> the helpers. I fixed this up while applying. > > I guess the tools/include/uapi/linux/bpf.h may also need fixup to be > in sync with main bpf.h. Yep agree, just fixed it up, thanks!
Re: [RFC bpf-next 00/11] Add socket lookup support
On Wed, May 16, 2018 at 12:05:06PM -0700, Joe Stringer wrote: > > > > A few open points: > > * Currently, the lookup interface only returns either a valid socket or a > > NULL > > pointer. This means that if there is any kind of issue with the tuple, > > such > > as it provides an unsupported protocol number, or the socket can't be > > found, > > then we are unable to differentiate these cases from one another. One > > natural > > approach to improve this could be to return an ERR_PTR from the > > bpf_sk_lookup() helper. This would be more complicated but maybe it's > > worthwhile. > > This suggestion would add a lot of complexity, and there's not many > legitimately different error cases. There's: > * Unsupported socket type > * Cannot find netns > * Tuple argument is the wrong size > * Can't find socket > > If we split the helpers into protocol-specific types, the first one > would be addressed. The last one is addressed by returning NULL. It > seems like a reasonable compromise to me to return NULL also in the > middle two cases as well, and rely on the BPF writer to provide valid > arguments. > > > * No ordering is defined between sockets. If the tuple could find multiple > > sockets, then it will arbitrarily return one. It is up to the caller to > > handle this. If we wish to handle this more reliably in future, we could > > encode an ordering preference in the flags field. > > Doesn't need to be addressed with this series, there is scope for > addressing these cases when the use case arises. Thanks for summarizing the conf call discussion. Looking forward to non-rfc patches :)
[PATCH 3/3] sh_eth: add R8A77980 support
Finally, add support for the DT probing of the R-Car V3H (AKA R8A77980) -- it's the only R-Car gen3 SoC having the GEther controller -- others have only EtherAVB... Based on the original (and large) patch by Vladimir Barinov. Signed-off-by: Vladimir BarinovSigned-off-by: Sergei Shtylyov --- Documentation/devicetree/bindings/net/sh_eth.txt |1 drivers/net/ethernet/renesas/sh_eth.c| 44 +++ 2 files changed, 45 insertions(+) Index: net-next/Documentation/devicetree/bindings/net/sh_eth.txt === --- net-next.orig/Documentation/devicetree/bindings/net/sh_eth.txt +++ net-next/Documentation/devicetree/bindings/net/sh_eth.txt @@ -14,6 +14,7 @@ Required properties: "renesas,ether-r8a7791" if the device is a part of R8A7791 SoC. "renesas,ether-r8a7793" if the device is a part of R8A7793 SoC. "renesas,ether-r8a7794" if the device is a part of R8A7794 SoC. + "renesas,gether-r8a77980" if the device is a part of R8A77980 SoC. "renesas,ether-r7s72100" if the device is a part of R7S72100 SoC. "renesas,rcar-gen1-ether" for a generic R-Car Gen1 device. "renesas,rcar-gen2-ether" for a generic R-Car Gen2 or RZ/G1 Index: net-next/drivers/net/ethernet/renesas/sh_eth.c === --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c +++ net-next/drivers/net/ethernet/renesas/sh_eth.c @@ -753,6 +753,49 @@ static struct sh_eth_cpu_data rcar_gen2_ .rmiimode = 1, .magic = 1, }; + +/* R8A77980 */ +static struct sh_eth_cpu_data r8a77980_data = { + .soft_reset = sh_eth_soft_reset_gether, + + .set_duplex = sh_eth_set_duplex, + .set_rate = sh_eth_set_rate_gether, + + .register_type = SH_ETH_REG_GIGABIT, + + .edtrr_trns = EDTRR_TRNS_GETHER, + .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD, + .ecsipr_value = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP | + ECSIPR_MPDIP, + .eesipr_value = EESIPR_RFCOFIP | EESIPR_ECIIP | + EESIPR_FTCIP | EESIPR_TDEIP | EESIPR_TFUFIP | + EESIPR_FRIP | EESIPR_RDEIP | EESIPR_RFOFIP | + EESIPR_RMAFIP | EESIPR_RRFIP | + EESIPR_RTLFIP | EESIPR_RTSFIP | + EESIPR_PREIP | EESIPR_CERFIP, + + .tx_check = EESR_FTC | EESR_CD | EESR_RTO, + .eesr_err_check = EESR_TWB1 | EESR_TWB | EESR_TABT | EESR_RABT | + EESR_RFE | EESR_RDE | EESR_RFRMER | + EESR_TFE | EESR_TDE | EESR_ECI, + .fdr_value = 0x070f, + + .apr= 1, + .mpr= 1, + .tpauser= 1, + .bculr = 1, + .hw_swap= 1, + .nbst = 1, + .rpadir = 1, + .rpadir_value = 2 << 16, + .no_trimd = 1, + .no_ade = 1, + .xdfar_rw = 1, + .hw_checksum= 1, + .select_mii = 1, + .magic = 1, + .cexcr = 1, +}; #endif /* CONFIG_OF */ static void sh_eth_set_rate_sh7724(struct net_device *ndev) @@ -3134,6 +3177,7 @@ static const struct of_device_id sh_eth_ { .compatible = "renesas,ether-r8a7791", .data = _gen2_data }, { .compatible = "renesas,ether-r8a7793", .data = _gen2_data }, { .compatible = "renesas,ether-r8a7794", .data = _gen2_data }, + { .compatible = "renesas,gether-r8a77980", .data = _data }, { .compatible = "renesas,ether-r7s72100", .data = _data }, { .compatible = "renesas,rcar-gen1-ether", .data = _gen1_data }, { .compatible = "renesas,rcar-gen2-ether", .data = _gen2_data },
[PATCH 2/3] sh_eth: add EDMR.NBST support
The R-Car V3H (AKA R8A77980) GEther controller adds the DMA burst mode bit (NBST) in EDMR and the manual tells to always set it before doing any DMA. Based on the original (and large) patch by Vladimir Barinov. Signed-off-by: Vladimir BarinovSigned-off-by: Sergei Shtylyov --- drivers/net/ethernet/renesas/sh_eth.c |4 drivers/net/ethernet/renesas/sh_eth.h |2 ++ 2 files changed, 6 insertions(+) Index: net-next/drivers/net/ethernet/renesas/sh_eth.c === --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c +++ net-next/drivers/net/ethernet/renesas/sh_eth.c @@ -1434,6 +1434,10 @@ static int sh_eth_dev_init(struct net_de sh_eth_write(ndev, mdp->cd->trscer_err_mask, TRSCER); + /* DMA transfer burst mode */ + if (mdp->cd->nbst) + sh_eth_modify(ndev, EDMR, EDMR_NBST, EDMR_NBST); + if (mdp->cd->bculr) sh_eth_write(ndev, 0x800, BCULR); /* Burst sycle set */ Index: net-next/drivers/net/ethernet/renesas/sh_eth.h === --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.h +++ net-next/drivers/net/ethernet/renesas/sh_eth.h @@ -184,6 +184,7 @@ enum GECMR_BIT { /* EDMR */ enum DMAC_M_BIT { + EDMR_NBST = 0x80, EDMR_EL = 0x40, /* Litte endian */ EDMR_DL1 = 0x20, EDMR_DL0 = 0x10, EDMR_SRST_GETHER = 0x03, @@ -505,6 +506,7 @@ struct sh_eth_cpu_data { unsigned bculr:1; /* EtherC have BCULR */ unsigned tsu:1; /* EtherC have TSU */ unsigned hw_swap:1; /* E-DMAC have DE bit in EDMR */ + unsigned nbst:1;/* E-DMAC has NBST bit in EDMR */ unsigned rpadir:1; /* E-DMAC have RPADIR */ unsigned no_trimd:1;/* E-DMAC DO NOT have TRIMD */ unsigned no_ade:1; /* E-DMAC DO NOT have ADE bit in EESR */
[PATCH 1/3] sh_eth: add RGMII support
The R-Car V3H (AKA R8A77980) GEther controller adds support for the RGMII PHY interface mode as a new value for the RMII_MII register. Based on the original (and large) patch by Vladimir Barinov. Signed-off-by: Vladimir BarinovSigned-off-by: Sergei Shtylyov --- drivers/net/ethernet/renesas/sh_eth.c |3 +++ 1 file changed, 3 insertions(+) Index: net-next/drivers/net/ethernet/renesas/sh_eth.c === --- net-next.orig/drivers/net/ethernet/renesas/sh_eth.c +++ net-next/drivers/net/ethernet/renesas/sh_eth.c @@ -466,6 +466,9 @@ static void sh_eth_select_mii(struct net u32 value; switch (mdp->phy_interface) { + case PHY_INTERFACE_MODE_RGMII: + value = 0x3; + break; case PHY_INTERFACE_MODE_GMII: value = 0x2; break;
[PATCH 0/3] Add R8A77980 GEther support
Hello! Here's a set of 3 patches against DaveM's 'net-next.git' repo. They (gradually) add R8A77980 GEther support to the 'sh_eth' driver, starting with couple new register bits/values introduced with this chip, and ending with adding a new 'struct sh_eth_cpu_data' instance connected to the new DT "compatible" prop value... [1/1] sh_eth: add RGMII support [2/3] sh_eth: add EDMR.NBST support [3/3] sh_eth: add R8A77980 support MBR, Sergei
[RFC PATCH] net: hns3: hns3_pci_sriov_configure() can be static
Fixes: fdb793670a00 ("net: hns3: Add support of .sriov_configure in HNS3 driver") Signed-off-by: Fengguang Wu--- hns3_enet.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index e85ff38..3617b9d 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -1579,7 +1579,7 @@ static void hns3_remove(struct pci_dev *pdev) * Enable or change the number of VFs. Called when the user updates the number * of VFs in sysfs. **/ -int hns3_pci_sriov_configure(struct pci_dev *pdev, int num_vfs) +static int hns3_pci_sriov_configure(struct pci_dev *pdev, int num_vfs) { int ret;
Re: [PATCH net-next 09/10] net: hns3: Add support of .sriov_configure in HNS3 driver
Hi Peng, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Salil-Mehta/Misc-Bug-Fixes-and-clean-ups-for-HNS3-Driver/20180516-211239 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:266:16: sparse: expression using sizeof(void) drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:266:16: sparse: expression using sizeof(void) >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:1582:5: sparse: symbol >> 'hns3_pci_sriov_configure' was not declared. Should it be static? drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2513:21: sparse: expression using sizeof(void) drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2706:22: sparse: expression using sizeof(void) drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:2706:22: sparse: expression using sizeof(void) Please review and possibly fold the followup patch. --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST
On 05/16/2018 12:07 PM, David Miller wrote: > From: David Miller> Date: Wed, 16 May 2018 15:06:59 -0400 (EDT) > >> So applied, thanks. > > Nevermind, eventually got a build failure: > > ERROR: "knav_queue_open" [drivers/net/ethernet/ti/keystone_netcp.ko] > undefined! > make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1 > make: *** [Makefile:1276: modules] Error 2 Snap, ok, let me do some more serious build testing with different architectures here. Sorry about that. -- Florian
[PATCH v3] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to free it. Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing") Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx rules") Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to query vport RoCE fields") Signed-off-by: Christophe JAILLET--- v1 -> v2: More places to update have been added to the patch v2 -> v3: Add Fixes tag 3 patches with one Fixes tag each should probably be better, but honestly, I won't send a v4. Fill free to split it if needed. --- drivers/infiniband/hw/mlx5/cq.c| 2 +- drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/vport.c| 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c index 77d257ec899b..6d52ea03574e 100644 --- a/drivers/infiniband/hw/mlx5/cq.c +++ b/drivers/infiniband/hw/mlx5/cq.c @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev *dev, struct ib_udata *udata, return 0; err_cqb: - kfree(*cqb); + kvfree(*cqb); err_db: mlx5_ib_db_unmap_user(to_mucontext(context), >db); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index 35e256eb2f6e..b123f8a52ad8 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch *esw) esw->offloads.vport_rx_group = g; out: - kfree(flow_group_in); + kvfree(flow_group_in); return err; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c b/drivers/net/ethernet/mellanox/mlx5/core/vport.c index 177e076b8d17..719cecb182c6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c @@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev, *system_image_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.system_image_guid); - kfree(out); + kvfree(out); return 0; } @@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid) *node_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.node_guid); - kfree(out); + kvfree(out); return 0; } @@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev, *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out, nic_vport_context.qkey_violation_counter); - kfree(out); + kvfree(out); return 0; } -- 2.17.0
Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST
From: David MillerDate: Wed, 16 May 2018 15:06:59 -0400 (EDT) > So applied, thanks. Nevermind, eventually got a build failure: ERROR: "knav_queue_open" [drivers/net/ethernet/ti/keystone_netcp.ko] undefined! make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1 make: *** [Makefile:1276: modules] Error 2 Reverted.
Re: [PATCH net-next v2 0/3] net: Allow more drivers with COMPILE_TEST
From: Florian FainelliDate: Wed, 16 May 2018 11:52:55 -0700 > This patch series includes more drivers to be build tested with COMPILE_TEST > enabled. This helps cover some of the issues I just ran into with missing > a driver *sigh*. > > Changes in v2: > > - allow FEC to build outside of CONFIG_ARM/ARM64 by defining a layout of > registers, this is not meant to run, so this is not a real issue if we > are not matching the correct register layout Ok, this is a lot better. But man, some of these drivers... drivers/net/ethernet/ti/davinci_cpdma.c: In function ‘cpdma_desc_pool_destroy’: drivers/net/ethernet/ti/davinci_cpdma.c:194:7: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=] "cpdma_desc_pool size %d != avail %d", ^ gen_pool_size(pool->gen_pool), ~ and on and on and on... But I'm really happy to see FEC and others at least being build tested in more scenerios. So applied, thanks.
Re: [RFC bpf-next 00/11] Add socket lookup support
On 9 May 2018 at 14:06, Joe Stringerwrote: > This series proposes a new helper for the BPF API which allows BPF programs to > perform lookups for sockets in a network namespace. This would allow programs > to determine early on in processing whether the stack is expecting to receive > the packet, and perform some action (eg drop, forward somewhere) based on this > information. > > The series is structured roughly into: > * Misc refactor > * Add the socket pointer type > * Add reference tracking to ensure that socket references are freed > * Extend the BPF API to add sk_lookup() / sk_release() functions > * Add tests/documentation > > The helper proposed in this series includes a parameter for a tuple which must > be filled in by the caller to determine the socket to look up. The simplest > case would be filling with the contents of the packet, ie mapping the packet's > 5-tuple into the parameter. In common cases, it may alternatively be useful to > reverse the direction of the tuple and perform a lookup, to find the socket > that initiates this connection; and if the BPF program ever performs a form of > IP address translation, it may further be useful to be able to look up > arbitrary tuples that are not based upon the packet, but instead based on > state > held in BPF maps or hardcoded in the BPF program. > > Currently, access into the socket's fields are limited to those which are > otherwise already accessible, and are restricted to read-only access. > > A few open points: > * Currently, the lookup interface only returns either a valid socket or a NULL > pointer. This means that if there is any kind of issue with the tuple, such > as it provides an unsupported protocol number, or the socket can't be found, > then we are unable to differentiate these cases from one another. One > natural > approach to improve this could be to return an ERR_PTR from the > bpf_sk_lookup() helper. This would be more complicated but maybe it's > worthwhile. This suggestion would add a lot of complexity, and there's not many legitimately different error cases. There's: * Unsupported socket type * Cannot find netns * Tuple argument is the wrong size * Can't find socket If we split the helpers into protocol-specific types, the first one would be addressed. The last one is addressed by returning NULL. It seems like a reasonable compromise to me to return NULL also in the middle two cases as well, and rely on the BPF writer to provide valid arguments. > * No ordering is defined between sockets. If the tuple could find multiple > sockets, then it will arbitrarily return one. It is up to the caller to > handle this. If we wish to handle this more reliably in future, we could > encode an ordering preference in the flags field. Doesn't need to be addressed with this series, there is scope for addressing these cases when the use case arises. > * Currently this helper is only defined for TC hook point, but it should also > be valid at XDP and perhaps some other hooks. Easy to add support for XDP on demand, initial implementation doesn't need it.
Re: [PATCH net-next] cxgb4: update LE-TCAM collection for T6
From: Rahul LakkireddyDate: Wed, 16 May 2018 19:51:15 +0530 > For T6, clip table is separated from main TCAM. So, update LE-TCAM > collection logic to collect clip table TCAM as well. IPv6 takes > 4 entries in clip table TCAM compared to 2 entries in main TCAM. > > Also, in case of errors, keep LE-TCAM collected so far and set the > status to partial dump. > > Signed-off-by: Rahul Lakkireddy > Signed-off-by: Ganesh Goudar Applied, thanks.