Re: [v3] Bluetooth: btrsi: rework dependencies
Arnd Bergmannwrote: > The linkage between the bluetooth driver and the wireless > driver is not defined properly, leading to build problems > such as: > > warning: (BT_HCIRSI) selects RSI_COEX which has unmet direct dependencies > (NETDEVICES && WLAN && WLAN_VENDOR_RSI && BT_HCIRSI && RSI_91X) > drivers/net/wireless/rsi/rsi_91x_main.o: In function `rsi_read_pkt': > (.text+0x205): undefined reference to `rsi_bt_ops' > > As the dependency is actually the reverse (RSI_91X uses > the BT_RSI driver, not the other way round), this changes > the dependency to match, and enables the bluetooth driver > from the RSI_COEX symbol. > > Fixes: 38aa4da50483 ("Bluetooth: btrsi: add new rsi bluetooth driver") > Acked-by; Marcel Holtmann > Signed-off-by: Arnd Bergmann Patch applied to wireless-drivers-next.git, thanks. 255dd5b79d54 Bluetooth: btrsi: rework dependencies -- https://patchwork.kernel.org/patch/10285795/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [v2] rsi: Remove stack VLA usage
"Tobin C. Harding"wrote: > The use of stack Variable Length Arrays needs to be avoided, as they > can be a vector for stack exhaustion, which can be both a runtime bug > (kernel Oops) or a security flaw (overwriting memory beyond the > stack). Also, in general, as code evolves it is easy to lose track of > how big a VLA can get. Thus, we can end up having runtime failures > that are hard to debug. As part of the directive[1] to remove all VLAs > from the kernel, and build with -Wvla. > > Currently rsi code uses a VLA based on a function argument to > `rsi_sdio_load_data_master_write()`. The function call chain is > > Both these functions > > rsi_sdio_reinit_device() > rsi_probe() > > start the call chain: > > rsi_hal_device_init() > rsi_load_fw() > auto_fw_upgrade() > ping_pong_write() > rsi_sdio_load_data_master_write() > > [Without familiarity with the code] it appears that none of the 4 locks > > mutex > rx_mutex > tx_mutex > tx_bus_mutex > > are held when `rsi_sdio_load_data_master_write()` is called. It is therefore > safe to use kmalloc with GFP_KERNEL. > > We can avoid using the VLA by using `kmalloc()` and free'ing the memory on all > exit paths. > > Change buffer from 'u8 array' to 'u8 *'. Call `kmalloc()` to allocate memory > for > the buffer. Using goto statement to call `kfree()` on all return paths. > > It can be expected that this patch will result in a small increase in overhead > due to the use of `kmalloc()` however this code is only called on > initialization > (and re-initialization) so this overhead should not degrade performance. > > [1] https://lkml.org/lkml/2018/3/7/621 > > Signed-off-by: Tobin C. Harding Patch applied to wireless-drivers-next.git, thanks. 44f98a9332e4 rsi: Remove stack VLA usage -- https://patchwork.kernel.org/patch/10283841/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
[PATCH net 0/1] net/smc: fix clc problem
Dave, the conversion from kernel_recvmsg() to sock_recvmsg() introduced a problem for SMC. This patch fixes it. Thanks, Ursula Ursula Braun (1): net/smc: use announced length in sock_recvmsg() net/smc/smc_clc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.13.5
[bpf-next V6 PATCH 14/15] xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe.h|4 +--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 17 +++-- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |1 + drivers/net/tun.c |4 ++-- drivers/net/virtio_net.c|2 +- include/net/xdp.h |2 +- kernel/bpf/cpumap.c |6 +++--- net/core/xdp.c |4 +++- 8 files changed, 23 insertions(+), 17 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index cbc20f199364..dfbc15a45cb4 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -240,8 +240,7 @@ struct ixgbe_tx_buffer { unsigned long time_stamp; union { struct sk_buff *skb; - /* XDP uses address ptr on irq_clean */ - void *data; + struct xdp_frame *xdpf; }; unsigned int bytecount; unsigned short gso_segs; @@ -249,7 +248,6 @@ struct ixgbe_tx_buffer { DEFINE_DMA_UNMAP_ADDR(dma); DEFINE_DMA_UNMAP_LEN(len); u32 tx_flags; - struct xdp_mem_info xdp_mem; }; struct ixgbe_rx_buffer { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ff069597fccf..e6e9b28ecfba 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, /* free the skb */ if (ring_is_xdp(tx_ring)) - xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); + xdp_return_frame(tx_buffer->xdpf); else napi_consume_skb(tx_buffer->skb, napi_budget); @@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, xdp.data_hard_start = xdp.data - ixgbe_rx_offset(rx_ring); xdp.data_end = xdp.data + size; + prefetchw(xdp.data_hard_start); /* xdp_frame write */ skb = ixgbe_run_xdp(adapter, rx_ring, ); } @@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring) /* Free all the Tx ring sk_buffs */ if (ring_is_xdp(tx_ring)) - xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); + xdp_return_frame(tx_buffer->xdpf); else dev_kfree_skb_any(tx_buffer->skb); @@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter, struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()]; struct ixgbe_tx_buffer *tx_buffer; union ixgbe_adv_tx_desc *tx_desc; + struct xdp_frame *xdpf; u32 len, cmd_type; dma_addr_t dma; u16 i; - len = xdp->data_end - xdp->data; + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + len = xdpf->len; if (unlikely(!ixgbe_desc_unused(ring))) return IXGBE_XDP_CONSUMED; - dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE); + dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE); if (dma_mapping_error(ring->dev, dma)) return
[bpf-next V6 PATCH 11/15] page_pool: refurbish version of page_pool code
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning pages on DMA-TX completion time, which have good cross CPU performance, given DMA-TX completion time can happen on a remote CPU. Refurbish my page_pool code, that was presented[1] at MM-summit 2016. Adapted page_pool code to not depend the page allocator and integration into struct page. The DMA mapping feature is kept, even-though it will not be activated/used in this patchset. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf V2: Adjustments requested by Tariq - Changed page_pool_create return codes, don't return NULL, only ERR_PTR, as this simplifies err handling in drivers. V4: many small improvements and cleanups - Add DOC comment section, that can be used by kernel-doc - Improve fallback mode, to work better with refcnt based recycling e.g. remove a WARN as pointed out by Tariq e.g. quicker fallback if ptr_ring is empty. V5: Fixed SPDX license as pointed out by Alexei V6: Adjustments requested by Eric Dumazet - Adjust cacheline_aligned_in_smp usage/placement - Move rcu_head in struct page_pool - Free pages quicker on destroy, minimize resources delayed an RCU period - Remove code for forward/backward compat ABI interface Signed-off-by: Jesper Dangaard Brouer--- include/net/page_pool.h | 129 +++ net/core/Makefile |1 net/core/page_pool.c| 317 +++ 3 files changed, 447 insertions(+) create mode 100644 include/net/page_pool.h create mode 100644 net/core/page_pool.c diff --git a/include/net/page_pool.h b/include/net/page_pool.h new file mode 100644 index ..1fe77db59518 --- /dev/null +++ b/include/net/page_pool.h @@ -0,0 +1,129 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * page_pool.h + * Author: Jesper Dangaard Brouer + * Copyright (C) 2016 Red Hat, Inc. + */ + +/** + * DOC: page_pool allocator + * + * This page_pool allocator is optimized for the XDP mode that + * uses one-frame-per-page, but have fallbacks that act like the + * regular page allocator APIs. + * + * Basic use involve replacing alloc_pages() calls with the + * page_pool_alloc_pages() call. Drivers should likely use + * page_pool_dev_alloc_pages() replacing dev_alloc_pages(). + * + * If page_pool handles DMA mapping (use page->private), then API user + * is responsible for invoking page_pool_put_page() once. In-case of + * elevated refcnt, the DMA state is released, assuming other users of + * the page will eventually call put_page(). + * + * If no DMA mapping is done, then it can act as shim-layer that + * fall-through to alloc_page. As no state is kept on the page, the + * regular put_page() call is sufficient. + */ +#ifndef _NET_PAGE_POOL_H +#define _NET_PAGE_POOL_H + +#include /* Needed by ptr_ring */ +#include +#include + +#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */ +#define PP_FLAG_ALLPP_FLAG_DMA_MAP + +/* + * Fast allocation side cache array/stack + * + * The cache size and refill watermark is related to the network + * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX + * ring is usually refilled and the max consumed elements will be 64, + * thus a natural max size of objects needed in the cache. + * + * Keeping room for more objects, is due to XDP_DROP use-case. As + * XDP_DROP allows the opportunity to recycle objects directly into + * this array, as it shares the same softirq/NAPI protection. If + * cache is already full (or partly full) then the XDP_DROP recycles + * would have to take a slower code path. + */ +#define PP_ALLOC_CACHE_SIZE128 +#define PP_ALLOC_CACHE_REFILL 64 +struct pp_alloc_cache { + u32 count; + void *cache[PP_ALLOC_CACHE_SIZE]; +}; + +struct page_pool_params { + unsigned intflags; + unsigned intorder; + unsigned intpool_size; + int nid; /* Numa node id to allocate from pages from */ + struct device *dev; /* device, for DMA pre-mapping purposes */ + enum dma_data_direction dma_dir; /* DMA mapping direction */ +}; + +struct page_pool { + struct rcu_head rcu; + struct page_pool_params p; + + /* +* Data structure for allocation side +* +* Drivers allocation side usually already perform some kind +* of resource protection. Piggyback on this protection, and +* require driver to protect allocation side. +* +* For NIC drivers this means, allocate a page_pool per +* RX-queue. As the RX-queue is already protected by +* Softirq/BH scheduling and napi_schedule. NAPI schedule +* guarantee that a single napi_struct will only be scheduled +* on a single CPU (see napi_schedule). +*/ + struct pp_alloc_cache alloc cacheline_aligned_in_smp; + + /* Data structure for
[bpf-next V6 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage. The registered allocator page_pool pointer is not available directly from xdp_rxq_info, but it could be (if needed). For now, the driver should keep separate track of the page_pool pointer, which it should use for RX-ring page allocation. As suggested by Saeed, to maintain a symmetric API it is the drivers responsibility to allocate/create and free/destroy the page_pool. Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers responsibility to free the page_pool, but with a RCU free call. This is done easily via the page_pool helper page_pool_destroy() (which avoids touching any driver code during the RCU callback, which could happen after the driver have been unloaded). Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h |3 +++ net/core/xdp.c| 45 +++-- 2 files changed, 38 insertions(+), 10 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index 859aa9b737fe..98b55eaf8fd7 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -36,6 +36,7 @@ enum mem_type { MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + MEM_TYPE_PAGE_POOL, MEM_TYPE_MAX, }; @@ -44,6 +45,8 @@ struct xdp_mem_info { u32 id; }; +struct page_pool; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; diff --git a/net/core/xdp.c b/net/core/xdp.c index e7f11f05086d..cc3238474e8c 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -8,6 +8,7 @@ #include #include #include +#include #include @@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht; struct xdp_mem_allocator { struct xdp_mem_info mem; - void *allocator; + union { + void *allocator; + struct page_pool *page_pool; + }; struct rhash_head node; struct rcu_head rcu; }; @@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu) /* Allow this ID to be reused */ ida_simple_remove(_id_pool, xa->mem.id); - /* TODO: Depending on allocator type/pointer free resources */ + /* Notice, driver is expected to free the *allocator, +* e.g. page_pool, and MUST also use RCU free. +*/ /* Poison memory */ xa->mem.id = 0x; @@ -243,8 +249,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, xdp_rxq->mem.type = type; - if (!allocator) + if (!allocator) { + if (type == MEM_TYPE_PAGE_POOL) + return -EINVAL; /* Setup time check page_pool req */ return 0; + } /* Delay init of rhashtable to save memory if feature isn't used */ if (!mem_id_init) { @@ -290,15 +299,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model); void xdp_return_frame(void *data, struct xdp_mem_info *mem) { - if (mem->type == MEM_TYPE_PAGE_SHARED) { + struct xdp_mem_allocator *xa; + struct page *page; + + switch (mem->type) { + case MEM_TYPE_PAGE_POOL: + rcu_read_lock(); + /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */ + xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params); + page = virt_to_head_page(data); + if (xa) + page_pool_put_page(xa->page_pool, page); + else + put_page(page); + rcu_read_unlock(); + break; + case MEM_TYPE_PAGE_SHARED: page_frag_free(data); - return; - } - - if (mem->type == MEM_TYPE_PAGE_ORDER0) { - struct page *page = virt_to_page(data); /* Assumes order0 page*/ - + break; + case MEM_TYPE_PAGE_ORDER0: + page = virt_to_page(data); /* Assumes order0 page*/ put_page(page); + break; + default: + /* Not possible, checked in xdp_rxq_info_reg_mem_model() */ + break; } } EXPORT_SYMBOL_GPL(xdp_return_frame);
[bpf-next V6 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync. This builds towards changing the API further to become a bulk API, because xdp_buff is not a queue-able object while xdp_frame is. V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP") Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 21 +++-- drivers/net/tun.c | 19 --- drivers/net/virtio_net.c | 24 ++-- include/linux/netdevice.h |4 ++-- net/core/filter.c | 17 +++-- 5 files changed, 54 insertions(+), 31 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index e6e9b28ecfba..f78096ed4c86 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2252,7 +2252,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring, #define IXGBE_XDP_TX 2 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter, - struct xdp_buff *xdp); + struct xdp_frame *xdpf); static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter, struct ixgbe_ring *rx_ring, @@ -2260,6 +2260,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter, { int err, result = IXGBE_XDP_PASS; struct bpf_prog *xdp_prog; + struct xdp_frame *xdpf; u32 act; rcu_read_lock(); @@ -2273,7 +2274,12 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter, case XDP_PASS: break; case XDP_TX: - result = ixgbe_xmit_xdp_ring(adapter, xdp); + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) { + result = IXGBE_XDP_CONSUMED; + break; + } + result = ixgbe_xmit_xdp_ring(adapter, xdpf); break; case XDP_REDIRECT: err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog); @@ -8329,20 +8335,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb, } static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter, - struct xdp_buff *xdp) + struct xdp_frame *xdpf) { struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()]; struct ixgbe_tx_buffer *tx_buffer; union ixgbe_adv_tx_desc *tx_desc; - struct xdp_frame *xdpf; u32 len, cmd_type; dma_addr_t dma; u16 i; - xdpf = convert_to_xdp_frame(xdp); - if (unlikely(!xdpf)) - return -EOVERFLOW; - len = xdpf->len; if (unlikely(!ixgbe_desc_unused(ring))) @@ -9995,7 +9996,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } -static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) +static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf) { struct ixgbe_adapter *adapter = netdev_priv(dev); struct ixgbe_ring *ring; @@ -10011,7 +10012,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) if (unlikely(!ring)) return -ENXIO; - err = ixgbe_xmit_xdp_ring(adapter, xdp); + err = ixgbe_xmit_xdp_ring(adapter, xdpf); if (err != IXGBE_XDP_TX) return -ENOSPC; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index a7e42ae1b220..da0402ebc5ce 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1293,18 +1293,13 @@ static const struct net_device_ops tun_netdev_ops = { .ndo_get_stats64= tun_net_get_stats64, }; -static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) +static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame) { struct tun_struct *tun = netdev_priv(dev); - struct xdp_frame *frame; struct tun_file *tfile; u32 numqueues; int ret = 0; - frame = convert_to_xdp_frame(xdp); - if (unlikely(!frame)) - return -EOVERFLOW; - rcu_read_lock(); numqueues = READ_ONCE(tun->numqueues); @@ -1328,6 +1323,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) return ret; } +static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp) +{ + struct xdp_frame *frame = convert_to_xdp_frame(xdp); + + if (unlikely(!frame)) + return -EOVERFLOW; + + return tun_xdp_xmit(dev, frame); +} + static void tun_xdp_flush(struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); @@ -1675,7 +1680,7 @@
[bpf-next V6 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call
This patch shows how it is possible to have both the driver local page cache, which uses elevated refcnt for "catching"/avoiding SKB put_page returns the page through the page allocator. And at the same time, have pages getting returned to the page_pool from ndp_xdp_xmit DMA completion. The performance improvement for XDP_REDIRECT in this patch is really good. Especially considering that (currently) the xdp_return_frame API and page_pool_put_page() does per frame operations of both rhashtable ID-lookup and locked return into (page_pool) ptr_ring. (It is the plan to remove these per frame operation in a followup patchset). The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe, with xdp_redirect_map (using devmap) . And the target/maximum capability of ixgbe is 13Mpps (on this HW setup). Before this patch for mlx5, XDP redirected frames were returned via the page allocator. The single flow performance was 6Mpps, and if I started two flows the collective performance drop to 4Mpps, because we hit the page allocator lock (further negative scaling occurs). Two test scenarios need to be covered, for xdp_return_frame API, which is DMA-TX completion running on same-CPU or cross-CPU free/return. Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very close to our 13Mpps max target. The reason max target isn't reached in cross-CPU test, is likely due to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to ixgbe testing). It is also planned to remove this unnecessary DMA unmap in a later patchset V2: Adjustments requested by Tariq - Changed page_pool_create return codes not return NULL, only ERR_PTR, as this simplifies err handling in drivers. - Save a branch in mlx5e_page_release - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V5: Updated patch desc Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/en.h |3 ++ drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 40 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 16 ++-- 3 files changed, 47 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 28cc26debeda..ab91166f7c5a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -53,6 +53,8 @@ #include "mlx5_core.h" #include "en_stats.h" +struct page_pool; + #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v) #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN) @@ -535,6 +537,7 @@ struct mlx5e_rq { /* XDP */ struct bpf_prog *xdp_prog; struct mlx5e_xdpsq xdpsq; + struct page_pool *page_pool; /* control */ struct mlx5_wq_ctrlwq_ctrl; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 2e4ca0f15b62..e3915fc4d987 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -35,6 +35,7 @@ #include #include #include +#include #include "eswitch.h" #include "en.h" #include "en_tc.h" @@ -387,10 +388,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, struct mlx5e_rq_param *rqp, struct mlx5e_rq *rq) { + struct page_pool_params pp_params = { 0 }; struct mlx5_core_dev *mdev = c->mdev; void *rqc = rqp->rqc; void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); - u32 byte_count; + u32 byte_count, pool_size; int npages; int wq_sz; int err; @@ -429,10 +431,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; rq->buff.headroom = params->rq_headroom; + pool_size = 1 << params->log_rq_size; switch (rq->wq_type) { case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ: + pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE; + rq->post_wqes = mlx5e_post_rx_mpwqes; rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe; @@ -506,13 +511,30 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->mkey_be = c->mkey_be; } - /* This must only be activate for order-0 pages */ - if (rq->xdp_prog) { - err = xdp_rxq_info_reg_mem_model(>xdp_rxq, -MEM_TYPE_PAGE_ORDER0, NULL); - if (err) - goto err_rq_wq_destroy; + /* Create a page_pool and register it with rxq */ + pp_params.order = rq->buff.page_order; + pp_params.flags = 0; /* No-internal DMA mapping in page_pool */ + pp_params.pool_size = pool_size; + pp_params.nid =
[bpf-next V6 PATCH 09/15] mlx5: register a memory model when XDP is enabled
Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame. This enable a different memory model, thus activating another code path in the xdp_return_frame API. V2: Fixed issues pointed out by Tariq. Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index da94c8cba5ee..2e4ca0f15b62 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->mkey_be = c->mkey_be; } + /* This must only be activate for order-0 pages */ + if (rq->xdp_prog) { + err = xdp_rxq_info_reg_mem_model(>xdp_rxq, +MEM_TYPE_PAGE_ORDER0, NULL); + if (err) + goto err_rq_wq_destroy; + } + for (i = 0; i < wq_sz; i++) { struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(>wq, i);
[bpf-next V6 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which uses a radix tree, use a dynamic rhashtable, for creating ID to pointer lookup table, because this is faster. The problem that is being solved here is that, the xdp_rxq_info pointer (stored in xdp_buff) cannot be used directly, as the guaranteed lifetime is too short. The info is needed on a (potentially) remote CPU during DMA-TX completion time . In an xdp_frame the xdp_mem_info is stored, when it got converted from an xdp_buff, which is sufficient for the simple page refcnt based recycle schemes. For more advanced allocators there is a need to store a pointer to the registered allocator. Thus, there is a need to guard the lifetime or validity of the allocator pointer, which is done through this rhashtable ID map to pointer. The removal and validity of of the allocator and helper struct xdp_mem_allocator is guarded by RCU. The allocator will be created by the driver, and registered with xdp_rxq_info_reg_mem_model(). It is up-to debate who is responsible for freeing the allocator pointer or invoking the allocator destructor function. In any case, this must happen via RCU freeing. Use the IDA infrastructure for getting a cyclic increasing ID number, that is used for keeping track of each registered allocator per RX-queue xdp_rxq_info. V4: Per req of Jason Wang - Use xdp_rxq_info_reg_mem_model() in all drivers implementing XDP_REDIRECT, even-though it's not strictly necessary when allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero). V6: Per req of Alex Duyck - Introduce rhashtable_lookup() call in later patch Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 + drivers/net/tun.c |6 + drivers/net/virtio_net.c |7 + include/net/xdp.h | 14 -- net/core/xdp.c| 223 - 5 files changed, 241 insertions(+), 18 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 45520eb503ee..ff069597fccf 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter, struct device *dev = rx_ring->dev; int orig_node = dev_to_node(dev); int ring_node = -1; - int size; + int size, err; size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count; @@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter, rx_ring->queue_index) < 0) goto err; + err = xdp_rxq_info_reg_mem_model(_ring->xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err) { + xdp_rxq_info_unreg(_ring->xdp_rxq); + goto err; + } + rx_ring->xdp_prog = adapter->xdp_prog; return 0; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 6750980d9f30..81fddf9cc58f 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -846,6 +846,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file, tun->dev, tfile->queue_index); if (err < 0) goto out; + err = xdp_rxq_info_reg_mem_model(>xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err < 0) { + xdp_rxq_info_unreg(>xdp_rxq); + goto out; + } err = 0; } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 6c4220450506..48c86accd3b8 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1312,6 +1312,13 @@ static int virtnet_open(struct net_device *dev) if (err < 0) return err; + err = xdp_rxq_info_reg_mem_model(>rq[i].xdp_rxq, +MEM_TYPE_PAGE_SHARED, NULL); + if (err < 0) { + xdp_rxq_info_unreg(>rq[i].xdp_rxq); + return err; + } + virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi); virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi); } diff --git a/include/net/xdp.h b/include/net/xdp.h index 33ecf87eada0..859aa9b737fe 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -41,6 +41,7 @@ enum mem_type { struct xdp_mem_info { u32 type; /* enum mem_type, but known size type */ + u32 id; }; struct xdp_rxq_info { @@ -99,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct
[bpf-next V6 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame
The generic xdp_frame format, was inspired by the cpumap own internal xdp_pkt format. It is now time to convert it over to the generic xdp_frame format. The cpumap needs one extra field dev_rx. Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h |1 + kernel/bpf/cpumap.c | 100 ++- 2 files changed, 29 insertions(+), 72 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index 1c69c07ee929..33ecf87eada0 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -67,6 +67,7 @@ struct xdp_frame { * while mem info is valid on remote CPU. */ struct xdp_mem_info mem; + struct net_device *dev_rx; /* used by cpumap */ }; /* Convert xdp_buff to xdp_frame */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 3e4bbcbe3e86..bcdc4dea5ce7 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work) kthread_stop(rcpu->kthread); } -/* For now, xdp_pkt is a cpumap internal data structure, with info - * carried between enqueue to dequeue. It is mapped into the top - * headroom of the packet, to avoid allocating separate mem. - */ -struct xdp_pkt { - void *data; - u16 len; - u16 headroom; - u16 metasize; - /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, -* while mem info is valid on remote CPU. -*/ - struct xdp_mem_info mem; - struct net_device *dev_rx; -}; - -/* Convert xdp_buff to xdp_pkt */ -static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp) -{ - struct xdp_pkt *xdp_pkt; - int metasize; - int headroom; - - /* Assure headroom is available for storing info */ - headroom = xdp->data - xdp->data_hard_start; - metasize = xdp->data - xdp->data_meta; - metasize = metasize > 0 ? metasize : 0; - if (unlikely((headroom - metasize) < sizeof(*xdp_pkt))) - return NULL; - - /* Store info in top of packet */ - xdp_pkt = xdp->data_hard_start; - - xdp_pkt->data = xdp->data; - xdp_pkt->len = xdp->data_end - xdp->data; - xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); - xdp_pkt->metasize = metasize; - - /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ - xdp_pkt->mem = xdp->rxq->mem; - - return xdp_pkt; -} - static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, -struct xdp_pkt *xdp_pkt) +struct xdp_frame *xdpf) { unsigned int frame_size; void *pkt_data_start; @@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * would be preferred to set frame_size to 2048 or 4096 * depending on the driver. * frame_size = 2048; -* frame_len = frame_size - sizeof(*xdp_pkt); +* frame_len = frame_size - sizeof(*xdp_frame); * * Instead, with info avail, skb_shared_info in placed after * packet len. This, unfortunately fakes the truesize. @@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, * is not at a fixed memory location, with mixed length * packets, which is bad for cache-line hotness. */ - frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom + + frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - pkt_data_start = xdp_pkt->data - xdp_pkt->headroom; + pkt_data_start = xdpf->data - xdpf->headroom; skb = build_skb(pkt_data_start, frame_size); if (!skb) return NULL; - skb_reserve(skb, xdp_pkt->headroom); - __skb_put(skb, xdp_pkt->len); - if (xdp_pkt->metasize) - skb_metadata_set(skb, xdp_pkt->metasize); + skb_reserve(skb, xdpf->headroom); + __skb_put(skb, xdpf->len); + if (xdpf->metasize) + skb_metadata_set(skb, xdpf->metasize); /* Essential SKB info: protocol and skb->dev */ - skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx); + skb->protocol = eth_type_trans(skb, xdpf->dev_rx); /* Optional SKB info, currently missing: * - HW checksum info (skb->ip_summed) @@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring) * invoked cpu_map_kthread_stop(). Catch any broken behaviour * gracefully and warn once. */ - struct xdp_pkt *xdp_pkt; + struct xdp_frame *xdpf; - while ((xdp_pkt = ptr_ring_consume(ring))) - if (WARN_ON_ONCE(xdp_pkt)) - xdp_return_frame(xdp_pkt, _pkt->mem); + while ((xdpf = ptr_ring_consume(ring))) + if
[bpf-next V6 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
The virtio_net driver assumes XDP frames are always released based on page refcnt (via put_page). Thus, is only queues the XDP data pointer address and uses virt_to_head_page() to retrieve struct page. Use the XDP return API to get away from such assumptions. Instead queue an xdp_frame, which allow us to use the xdp_return_frame API, when releasing the frame. Signed-off-by: Jesper Dangaard Brouer--- drivers/net/virtio_net.c | 31 +-- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 23374603e4d9..6c4220450506 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi, struct xdp_buff *xdp) { struct virtio_net_hdr_mrg_rxbuf *hdr; - unsigned int len; + struct xdp_frame *xdpf, *xdpf_sent; struct send_queue *sq; + unsigned int len; unsigned int qp; - void *xdp_sent; int err; qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id(); sq = >sq[qp]; /* Free up any pending old buffers before queueing new ones. */ - while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) { - struct page *sent_page = virt_to_head_page(xdp_sent); + while ((xdpf_sent = virtqueue_get_buf(sq->vq, )) != NULL) + xdp_return_frame(xdpf_sent->data, _sent->mem); - put_page(sent_page); - } + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + /* virtqueue want to use data area in-front of packet */ + if (unlikely(xdpf->metasize > 0)) + return -EOPNOTSUPP; + + if (unlikely(xdpf->headroom < vi->hdr_len)) + return -EOVERFLOW; - xdp->data -= vi->hdr_len; + /* Make room for virtqueue hdr (also change xdpf->headroom?) */ + xdpf->data -= vi->hdr_len; /* Zero header and leave csum up to XDP layers */ - hdr = xdp->data; + hdr = xdpf->data; memset(hdr, 0, vi->hdr_len); + hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */ + xdpf->len += vi->hdr_len; - sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data); + sg_init_one(sq->sg, xdpf->data, xdpf->len); - err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC); + err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC); if (unlikely(err)) return false; /* Caller handle free/refcnt */
RE: [RFC PATCH v2] net: phy: Added device tree binding for dev-addr and dev-addr code check-up
> -Original Message- > From: Florian Fainelli [mailto:f.faine...@gmail.com] > Sent: Tuesday, March 27, 2018 1:45 AM > To: Vicenţiu Galanopulo; > netdev@vger.kernel.org; linux-ker...@vger.kernel.org; robh...@kernel.org; > mark.rutl...@arm.com; da...@davemloft.net; mar...@holtmann.org; > devicet...@vger.kernel.org > Cc: Madalin-cristian Bucur ; Alexandru Marginean > > Subject: Re: [RFC PATCH v2] net: phy: Added device tree binding for dev-addr > and dev-addr code check-up > > On 03/23/2018 08:05 AM, Vicentiu Galanopulo wrote: > > Reason for this patch is that the Inphi PHY has a vendor specific > > address space for accessing the > > C45 MDIO registers - starting from 0x1e. > > > > A search of the dev-addr property is done in of_mdiobus_register. > > If the property is found in the PHY node, > > of_mdiobus_register_static_phy is called. This is a wrapper function > > for of_mdiobus_register_phy which finds the device in package based on > > dev-addr and fills devices_addrs: > > devices_addrs is a new field added to phy_c45_device_ids. > > This new field will store the dev-addr property on the same index > > where the device in package has been found. > > In order to have dev-addr in get_phy_c45_ids(), mdio_c45_ids is passed > > from of_mdio.c to phy_device.c as an external variable. > > In get_phy_device a copy of the mdio_c45_ids is done over the local > > c45_ids (wich are empty). After the copying, the c45_ids will also > > contain the static device found from dev-addr. > > Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), when > > probing the identifiers, dev-addr can be extracted from devices_addrs > > and probed if devices_addrs[current_identifier] is not 0. > > This way changing the kernel API is avoided completely. > > > > As a plus to this patch, num_ids in get_phy_c45_ids, has the value 8 > > (ARRAY_SIZE(c45_ids->device_ids)), > > but the u32 *devs can store 32 devices in the bitfield. > > If a device is stored in *devs, in bits 32 to 9, it will not be found. > > This is the reason for changing in phy.h, the size of device_ids > > array. > > > > Signed-off-by: Vicentiu Galanopulo > > --- > > Correct me if I am completely misunderstanding the problem, but have you > considered doing the following: > > - if all you need to is to replace instances of loops that do: > > if (phydev->is_c45) { > for (i = 1; i < num_ids; i++) { > if (!(phydev->c45_ids.devices_in_package & (1 << > i))) > continue; > > with one that starts at dev-addr, as specified by Device Tree, then I suspect > there is an easier way to do what you want rather than your fairly intrusive > patch to do that: > Sorry for the lengthy comment and sorry if this is redundant, I'm just trying to explain best that I can my patch: The loop goes through the devices_in_packages, and where it finds a bit that is set, it continues to get the PHY device ID. But, to have devices_in_package with those bits set, a previous querying of the MDIO_DEVS2 and MDIO_DEVS1 is necessary. And in the MDIO bus query, dev-addr, from the device tree, is part of the whole register address: reg_addr = MII_ADDR_C45 | dev_addr << 16 | MDIO_DEVS2; Andrew suggested in his first comment that the device tree lookup could be done in of_mdiobus_register(), probably because of the loop which already checks the property of the PHY. My understanding of his comments was that I could propagate, as a parameter, dev-addr, down the hierarchy call of_mdiobus_register_phy()-> get_phy_device()-> get_phy_id()->get_phy_c45_ids()->get_phy_c45_devs_in_pkg(dev_addr param existed here). This is where the loop you're mentioning needed some altering, because the loop index is actually the dev_addr parameter from get_phy_c45_devs_in_pkg(), and it's from 1 to 7 (num_ids = 8) This worked for the other PHYs which had dev-addr in this range, but it doesn't work for the INPHI, which has dev_add = 30 (0x1e). After I did the extension of the device_ids from 8 to 32 to match the devs (u32 *devs = _ids->devices_in_package;) in get_phy_c45_ids() : - u32 device_ids[8]; + u32 device_ids[32]; I realized that dev_addr for other PHY vendors could be larger than 31 (just a coincidence that for INPHI is 30 and it fits), and that dev-addr should be a separate parameter, that still has to match the bit position from *devs (_ids->devices_in_package) So, I didn't had to change the loop to start from dev-addr, just to let it check if the bit is set in *devs (as before), and if that bit corresponds to the INPHI PHY (dev-addr has been set in the PHY device tree node), use dev-addr in getting the PHY ID. The loop start index has to remain the same because other PHY vendors have dev-addr 1 to 7, and PHY discovering succeeds. For other issues that I had with
Re: [net-next 03/15] net/mlx5e: PFC stall prevention support
On 25-Mar-18 19:18, Andrew Lunn wrote: >>> Shouldn't you map a value of MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC back to >>> PFC_STORM_PREVENTION_AUTO? >> >> We discussed this point internally, mapping MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC >> (100) to >> PFC_STORM_PREVENTION_AUTO might cause confusion when the user explicitly >> asks for 100msec timeout >> and gets auto in his following query. >> Also, this way the "auto" timeout is visible to the user, which might help >> him get an initial >> clue of which values are recommended. > > Yes, this is a fair point, which is why i asked the question. Either > way, it can cause confusion. 'I configured it to auto, but it always > returns 100, not auto.' > > Whatever is decided, it should be consistent across drivers. So please > add some documentation to the ethtool header file about what is > expected. We didn't want to limit other drivers implementation, but I agree that consistency across drivers is important in this case. We will find a proper place to document it. > > Andrew >
Re: [PATCH bpf-next] bpf: sockmap: initialize sg table entries properly
On 3/27/2018 12:15 PM, John Fastabend wrote: On 03/25/2018 11:54 PM, Prashant Bhole wrote: When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized to SG_MAGIC, when sg table is initialized using sg_init_table(). Magic is checked while navigating the scatterlist. We hit BUG_ON when magic check is failed. Fixed following things: - Initialization of sg table in bpf_tcp_sendpage() was missing, initialized it using sg_init_table() - bpf_tcp_sendmsg() initializes sg table using sg_init_table() before entering the loop, but further consumed sg entries are initialized using memset. Fixed it by replacing memset with sg_init_table() in function bpf_tcp_push() Signed-off-by: Prashant Bhole--- kernel/bpf/sockmap.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 69c5bccabd22..8a848a99d768 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -312,7 +312,7 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes, md->sg_start++; if (md->sg_start == MAX_SKB_FRAGS) md->sg_start = 0; - memset(sg, 0, sizeof(*sg)); + sg_init_table(sg, 1); Looks OK here. if (md->sg_start == md->sg_end) break; @@ -763,10 +763,14 @@ static int bpf_tcp_sendpage(struct sock *sk, struct page *page, lock_sock(sk); - if (psock->cork_bytes) + if (psock->cork_bytes) { m = psock->cork; - else + sg = >sg_data[m->sg_end]; + } else { m = + sg = m->sg_data; + sg_init_table(sg, MAX_SKB_FRAGS); sg_init_table() does an unnecessary memset() though. We probably either want a new scatterlist API or just open code this, #ifdef CONFIG_DEBUG_SG { unsigned int i; for (i = 0; i < nents; i++) sgl[i].sg_magic = SG_MAGIC; } Similar sg_init_table() is present in bpf_tcp_sendmsg(). I agree that it causes unnecessary memset, but I don't agree with open coded fix. I am still with V1. Thanks. -Prashant
Re: [PATCH 4/4] selftests/bpf: fix compiling errors
On 03/27/2018 05:06 AM, Du, Changbin wrote: > On Mon, Mar 26, 2018 at 08:02:30PM -0700, Alexei Starovoitov wrote: >> On Tue, Mar 27, 2018 at 10:20:10AM +0800, Du, Changbin wrote: >>> On Mon, Mar 26, 2018 at 07:55:13AM -0700, Alexei Starovoitov wrote: On Mon, Mar 26, 2018 at 05:23:28PM +0800, changbin...@intel.com wrote: > Signed-off-by: Changbin Du> --- > tools/testing/selftests/bpf/Makefile | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/bpf/Makefile > b/tools/testing/selftests/bpf/Makefile > index 5c43c18..dc0fdc8 100644 > --- a/tools/testing/selftests/bpf/Makefile > +++ b/tools/testing/selftests/bpf/Makefile > @@ -10,7 +10,8 @@ ifneq ($(wildcard $(GENHDR)),) >GENFLAGS := -DHAVE_GENHDR > endif > > -CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) > -I../../../include > +CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) \ > + -I../../../include -I../../../../usr/include > LDLIBS += -lcap -lelf -lrt -lpthread > > # Order correspond to 'make run_tests' order > @@ -62,7 +63,7 @@ else >CPU ?= generic > endif > > -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \ > +CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi > -I../../../../usr/include \ > -Wno-compare-distinct-pointer-types Nack. I suspect that will break the build for everyone else who's doing it in the directory itself instead of the outer one. >>> >>> This one? But I didn't see any problem. >> >> because the build was lucky and additional path ../../../../usr/include >> didn't point >> to a valid dir? Agree. > I am sorry but I don't understand why you mean *lucky*. Of cause, the path is > valid. The problem is that this suddenly requires users to do a 'make headers_install' in order to populate usr/include/ directory in the first place. While it's annoying enough for BPF samples where this is needed, I absolutely don't want to introduce this for BPF kselftests. It's the wrong approach. Besides, in tools infra, there is a tools/arch/*/include/uapi/asm/bitsperlong.h header copy already, so we really need to use that instead. Please adapt your patch accordingly and respin. Please also Cc us and netdev@vger.kernel.org for BPF kselftests changes. Thanks, Daniel
Re: [next] rsi: remove redundant duplicate assignment of buffer_size
Colin Ian Kingwrote: > From: Colin Ian King > > Variable buffer_size is re-assigned the same value, this duplicated > assignment is redundant and can be removed. > > Cleans up clang warning: > drivers/net/wireless/rsi/rsi_91x_usb.c:140:4: warning: Value stored > to 'buffer_size' is never read > > Signed-off-by: Colin Ian King Patch applied to wireless-drivers-next.git, thanks. a31f9314114c rsi: remove redundant duplicate assignment of buffer_size -- https://patchwork.kernel.org/patch/10282247/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: rtlwifi: rtl8821ae: fix spelling mistake: "Aboslute" -> "Absolute"
Colin Ian Kingwrote: > From: Colin Ian King > > Trivial fix to spelling mistake in RT_TRACE message text. > > Signed-off-by: Colin Ian King > Acked-by: Larry Finger "Absolute" -- https://patchwork.kernel.org/patch/10292111/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH v2 bpf-next] bpf, tracing: unbreak lttng
On 03/27/2018 02:07 AM, Steven Rostedt wrote: > On Mon, 26 Mar 2018 16:02:20 -0700 > Alexei Starovoitovwrote: > >> for_each_kernel_tracepoint() is used by out-of-tree lttng module >> and therefore cannot be changed. > > This is false and misleading. NACK. Steven, while I really don't care about this particular function, you wrote a few emails ago, quote: Look, the tracepoint code was written by Mathieu for LTTng, and perf and ftrace were able to benefit because of it, as well as your bpf code. For this, we agreed to keep this function around for his use, as its the only thing he requires. Everyone has been fine with that. [...] Having that function for LTTng does not hurt us. And I will NACK removing it. So later saying that "this is false and misleading" is kind of misleading by itself. ;-) Anyway, it would probably make sense to add a comment to for_each_kernel_tracepoint() that this is used by LTTng so that people don't accidentally remove it due to missing in-tree user. I would think some sort of clarification/background in a comment or such would have avoided the whole confusion and resulting discussion around this in the first place. Btw, in networking land, as soon as there is no in-tree user for a particular kernel function, it will get ripped out, no matter what. Given this is also the typical convention in the kernel, it may have caused some confusion with above preference. Anyway, given v6 is out now, I've tossed the old series from bpf-next tree. So I hope we can all move on with some more constructive discussion. :-) Thanks everyone, Daniel
Re: [PATCH bpf-next] bpf: sockmap: initialize sg table entries properly
On 03/27/2018 10:41 AM, Prashant Bhole wrote: > On 3/27/2018 12:15 PM, John Fastabend wrote: >> On 03/25/2018 11:54 PM, Prashant Bhole wrote: >>> When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized to SG_MAGIC, >>> when sg table is initialized using sg_init_table(). Magic is checked >>> while navigating the scatterlist. We hit BUG_ON when magic check is >>> failed. >>> >>> Fixed following things: >>> - Initialization of sg table in bpf_tcp_sendpage() was missing, >>> initialized it using sg_init_table() >>> >>> - bpf_tcp_sendmsg() initializes sg table using sg_init_table() before >>> entering the loop, but further consumed sg entries are initialized >>> using memset. Fixed it by replacing memset with sg_init_table() in >>> function bpf_tcp_push() >>> >>> Signed-off-by: Prashant Bhole>>> --- >>> kernel/bpf/sockmap.c | 11 +++ >>> 1 file changed, 7 insertions(+), 4 deletions(-) >>> >>> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c >>> index 69c5bccabd22..8a848a99d768 100644 >>> --- a/kernel/bpf/sockmap.c >>> +++ b/kernel/bpf/sockmap.c >>> @@ -312,7 +312,7 @@ static int bpf_tcp_push(struct sock *sk, int >>> apply_bytes, >>> md->sg_start++; >>> if (md->sg_start == MAX_SKB_FRAGS) >>> md->sg_start = 0; >>> - memset(sg, 0, sizeof(*sg)); >>> + sg_init_table(sg, 1); >> >> Looks OK here. >> >>> if (md->sg_start == md->sg_end) >>> break; >>> @@ -763,10 +763,14 @@ static int bpf_tcp_sendpage(struct sock *sk, struct >>> page *page, >>> lock_sock(sk); >>> - if (psock->cork_bytes) >>> + if (psock->cork_bytes) { >>> m = psock->cork; >>> - else >>> + sg = >sg_data[m->sg_end]; >>> + } else { >>> m = >>> + sg = m->sg_data; >>> + sg_init_table(sg, MAX_SKB_FRAGS); >> >> sg_init_table() does an unnecessary memset() though. We >> probably either want a new scatterlist API or just open >> code this, >> >> #ifdef CONFIG_DEBUG_SG >> { >> unsigned int i; >> for (i = 0; i < nents; i++) >> sgl[i].sg_magic = SG_MAGIC; >> } > > Similar sg_init_table() is present in bpf_tcp_sendmsg(). > I agree that it causes unnecessary memset, but I don't agree with open coded > fix. But then lets fix is properly and add a static inline helper to the include/linux/scatterlist.h header like ... static inline void sg_init_debug_marker(struct scatterlist *sgl, unsigned int nents) { #ifdef CONFIG_DEBUG_SG unsigned int i; for (i = 0; i < nents; i++) sgl[i].sg_magic = SG_MAGIC; #endif } ... and reuse it in all the places that would otherwise open-code this, as well as sg_init_table(): void sg_init_table(struct scatterlist *sgl, unsigned int nents) { memset(sgl, 0, sizeof(*sgl) * nents); sg_init_debug_marker(sgl, nents); sg_mark_end([nents - 1]); } This would be a lot cleaner than having this duplicated in various places. Thanks, Daniel
Re: [PATCH net] vhost: correctly remove wait queue during poll failure
On 2018年03月27日 17:28, Darren Kenny wrote: Hi Jason, On Tue, Mar 27, 2018 at 11:47:22AM +0800, Jason Wang wrote: We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by checking poll->wqh to make sure it was in a list. This text seems at odds with the code below, instead of checking poll-whq, you are removing that check... Maybe the text needs rewording? Yes, I admit it's bad, thanks for pointing out. How about: "Fixing this by switching to use vhost_poll_stop() which zeros poll->wqh after removing poll from waitqueue to make sure it won't be freed twice." Thanks Thanks, Darren. Reported-by: syzbot+c0272972b01b872e6...@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang--- drivers/vhost/vhost.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1b3e8d2d..5d5a9d9 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file) if (mask) vhost_poll_wakeup(>wait, 0, 0, poll_to_key(mask)); if (mask & EPOLLERR) { - if (poll->wqh) - remove_wait_queue(poll->wqh, >wait); + vhost_poll_stop(poll); ret = -EINVAL; } -- 2.7.4
Re: [PATCH net-next nfs 1/6] net: Convert rpcsec_gss_net_ops
On 26.03.2018 21:36, J. Bruce Fields wrote: > On Fri, Mar 23, 2018 at 02:53:34PM -0400, Anna Schumaker wrote: >> >> >> On 03/13/2018 06:49 AM, Kirill Tkhai wrote: >>> These pernet_operations initialize and destroy sunrpc_net_id refered >>> per-net items. Only used global list is cache_list, and accesses >>> already serialized. >>> >>> sunrpc_destroy_cache_detail() check for list_empty() without >>> cache_list_lock, but when it's called from >>> unregister_pernet_subsys(), there can't be callers in parallel, so >>> we won't miss list_empty() in this case. >>> >>> Signed-off-by: Kirill Tkhai>> >> It might make sense to take these and the other NFS patches through >> the net tree, since the pernet_operations don't yet have the async >> field in my tree (and I therefore can't compile once these are >> applied). > > Ditto for the nfsd patch, so, for what it's worth: > > Acked-by: J. Bruce Fields > > for that patch.--b. Thanks, Bruce. Kirill
[PATCH] vhost-net: add time limitation for tx polling
handle_tx() will delay rx for a long time when busy tx polling udp packets with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no time. It's not fair for handle_rx(), so needs to limit max time of tx polling. Signed-off-by: Haibin Zhang--- drivers/vhost/net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8139bc70ad7d..dc9218a3a75b 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net) struct socket *sock; struct vhost_net_ubuf_ref *uninitialized_var(ubufs); bool zcopy, zcopy_used; + unsigned long start = jiffies; mutex_lock(>mutex); sock = vq->private_data; @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net) else vhost_zerocopy_signal_used(net, vq); vhost_net_tx_packet(net); - if (unlikely(total_len >= VHOST_NET_WEIGHT)) { + if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies - start >= 1)) { vhost_poll_queue(>poll); break; } -- 2.12.3
[PATCH net] lan78xx: Crash in lan78xx_writ_reg (Workqueue: events lan78xx_deferred_multicast_write)
Description: Crash was reported with syzkaller pointing to lan78xx_write_reg routine. Root-cause: Proper cleanup of workqueues and init/setup routines was not happening in failure conditions. Fix: Handled the error conditions by cleaning up the queues and init/setup routines. Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver") Reported-by: Andrey KonovalovSigned-off-by: Raghuram Chary J --- drivers/net/usb/lan78xx.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index 60a604cc7647..11176070b345 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -2863,8 +2863,7 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf) if (ret < 0) { netdev_warn(dev->net, "lan78xx_setup_irq_domain() failed : %d", ret); - kfree(pdata); - return ret; + goto out1; } dev->net->hard_header_len += TX_OVERHEAD; @@ -2872,14 +2871,32 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct usb_interface *intf) /* Init all registers */ ret = lan78xx_reset(dev); + if (ret) { + netdev_warn(dev->net, "Registers INIT FAILED"); + goto out2; + } ret = lan78xx_mdio_init(dev); + if (ret) { + netdev_warn(dev->net, "MDIO INIT FAILED."); + goto out2; + } dev->net->flags |= IFF_MULTICAST; pdata->wol = WAKE_MAGIC; return ret; + +out2: + lan78xx_remove_irq_domain(dev); + +out1: + netdev_warn(dev->net, "Bind routine FAILED"); + cancel_work_sync(>set_multicast); + cancel_work_sync(>set_vlan); + kfree(pdata); + return ret; } static void lan78xx_unbind(struct lan78xx_net *dev, struct usb_interface *intf) @@ -2891,6 +2908,8 @@ static void lan78xx_unbind(struct lan78xx_net *dev, struct usb_interface *intf) lan78xx_remove_mdio(dev); if (pdata) { + cancel_work_sync(>set_multicast); + cancel_work_sync(>set_vlan); netif_dbg(dev, ifdown, dev->net, "free pdata"); kfree(pdata); pdata = NULL; -- 2.16.2
Re: [RFC PATCH 00/24] Introducing AF_XDP support
On Mon, 26 Mar 2018 14:58:02 -0700 William Tuwrote: > > Again high count for NMI ?!? > > > > Maybe you just forgot to tell perf that you want it to decode the > > bpf_prog correctly? > > > > https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols > > > > Enable via: > > $ sysctl net/core/bpf_jit_kallsyms=1 > > > > And use perf report (while BPF is STILL LOADED): > > > > $ perf report --kallsyms=/proc/kallsyms > > > > E.g. for emailing this you can use this command: > > > > $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms > > --no-children --stdio -g none | head -n 40 > > > > Thanks, I followed the steps, the result of l2fwd > # Total Lost Samples: 119 > # > # Samples: 2K of event 'cycles:ppp' > # Event count (approx.): 25675705627 > # > # Overhead CPU Command Shared Object Symbol > # ... ... .. > .. > # > 10.48% 013 xdpsock xdpsock [.] main > 9.77% 013 xdpsock [kernel.vmlinux][k] clflush_cache_range > 8.45% 013 xdpsock [kernel.vmlinux][k] nmi > 8.07% 013 xdpsock [kernel.vmlinux][k] xsk_sendmsg > 7.81% 013 xdpsock [kernel.vmlinux][k] __domain_mapping > 4.95% 013 xdpsock [kernel.vmlinux][k] ixgbe_xmit_frame_ring > 4.66% 013 xdpsock [kernel.vmlinux][k] skb_store_bits > 4.39% 013 xdpsock [kernel.vmlinux][k] syscall_return_via_sysret > 3.93% 013 xdpsock [kernel.vmlinux][k] pfn_to_dma_pte > 2.62% 013 xdpsock [kernel.vmlinux][k] __intel_map_single > 2.53% 013 xdpsock [kernel.vmlinux][k] __alloc_skb > 2.36% 013 xdpsock [kernel.vmlinux][k] iommu_no_mapping > 2.21% 013 xdpsock [kernel.vmlinux][k] alloc_skb_with_frags > 2.07% 013 xdpsock [kernel.vmlinux][k] skb_set_owner_w > 1.98% 013 xdpsock [kernel.vmlinux][k] __kmalloc_node_track_caller > 1.94% 013 xdpsock [kernel.vmlinux][k] ksize > 1.84% 013 xdpsock [kernel.vmlinux][k] validate_xmit_skb_list > 1.62% 013 xdpsock [kernel.vmlinux][k] kmem_cache_alloc_node > 1.48% 013 xdpsock [kernel.vmlinux][k] __kmalloc_reserve.isra.37 > 1.21% 013 xdpsock xdpsock [.] xq_enq > 1.08% 013 xdpsock [kernel.vmlinux][k] intel_alloc_iova > You did use net/core/bpf_jit_kallsyms=1 and correct perf commands decoding of bpf_prog, so the perf top#3 'nmi' is likely a real NMI call... which looks wrong. > And l2fwd under "perf stat" looks OK to me. There is little context > switches, cpu is fully utilized, 1.17 insn per cycle seems ok. > > Performance counter stats for 'CPU(s) 6': > 1.787420 cpu-clock (msec) #1.000 CPUs utilized > 24 context-switches #0.002 K/sec > 0 cpu-migrations#0.000 K/sec > 0 page-faults #0.000 K/sec > 22,361,333,647 cycles#2.236 GHz > 13,458,442,838 stalled-cycles-frontend # 60.19% frontend cycles idle > 26,251,003,067 instructions #1.17 insn per cycle > #0.51 stalled cycles per > insn > 4,938,921,868 branches # 493.853 M/sec > 7,591,739 branch-misses #0.15% of all branches > 10.000835769 seconds time elapsed This perf stat also indicate something is wrong. The 1.17 insn per cycle is NOT okay, it is too low (compared to what usually I see, e.g. 2.36 insn per cycle). It clearly says you have 'stalled-cycles-frontend' and '60.19% frontend cycles idle'. This means your CPU have issues/bottleneck fetching instructions. Explained by Andi Kleen here [1] [1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
[PATCH v3 0/2] of_net: Implement of_get_nvmem_mac_address helper
Posted this as a small set now, with an (optional) second patch that shows how the changes work and what I've used to test the code on a Topic Miami board. v3: Add patch that implements mac in nvmem for the Cadence MACB controller Remove the integrated of_get_mac_address call v2: Use of_nvmem_cell_get to avoid needing the assiciated device Use void* instead of char* Add devicetree binding doc Mike Looijmans (2): of_net: Implement of_get_nvmem_mac_address helper net: macb: Try to retrieve MAC addess from nvmem provider Documentation/devicetree/bindings/net/ethernet.txt | 2 ++ drivers/net/ethernet/cadence/macb_main.c | 12 +-- drivers/of/of_net.c| 40 ++ include/linux/of_net.h | 6 4 files changed, 57 insertions(+), 3 deletions(-) -- 1.9.1
RE: [RFC PATCH v2] net: phy: Added device tree binding for dev-addr and dev-addr code check-up
> -Original Message- > From: Rob Herring [mailto:r...@kernel.org] > Sent: Tuesday, March 27, 2018 1:25 AM > To: Vicenţiu Galanopulo> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; > mark.rutl...@arm.com; da...@davemloft.net; mar...@holtmann.org; > devicet...@vger.kernel.org; Madalin-cristian Bucur ; > Alexandru Marginean > Subject: Re: [RFC PATCH v2] net: phy: Added device tree binding for dev-addr > and dev-addr code check-up > > On Fri, Mar 23, 2018 at 10:05:22AM -0500, Vicentiu Galanopulo wrote: > > Reason for this patch is that the Inphi PHY has a vendor specific > > address space for accessing the > > C45 MDIO registers - starting from 0x1e. > > > > A search of the dev-addr property is done in of_mdiobus_register. > > If the property is found in the PHY node, > > of_mdiobus_register_static_phy is called. This is a wrapper function > > for of_mdiobus_register_phy which finds the device in package based on > > dev-addr and fills devices_addrs: > > devices_addrs is a new field added to phy_c45_device_ids. > > This new field will store the dev-addr property on the same index > > where the device in package has been found. > > In order to have dev-addr in get_phy_c45_ids(), mdio_c45_ids is passed > > from of_mdio.c to phy_device.c as an external variable. > > In get_phy_device a copy of the mdio_c45_ids is done over the local > > c45_ids (wich are empty). After the copying, the c45_ids will also > > contain the static device found from dev-addr. > > Having dev-addr stored in devices_addrs, in get_phy_c45_ids(), when > > probing the identifiers, dev-addr can be extracted from devices_addrs > > and probed if devices_addrs[current_identifier] is not 0. > > This way changing the kernel API is avoided completely. > > > > As a plus to this patch, num_ids in get_phy_c45_ids, has the value 8 > > (ARRAY_SIZE(c45_ids->device_ids)), > > but the u32 *devs can store 32 devices in the bitfield. > > If a device is stored in *devs, in bits 32 to 9, it will not be found. > > This is the reason for changing in phy.h, the size of device_ids > > array. > > > > Signed-off-by: Vicentiu Galanopulo > > --- > > Documentation/devicetree/bindings/net/phy.txt | 6 ++ > > Please split bindings to separate patch. Thanks Rob for your comments. I was considering doing that after I reach a more stable, non-RFC version of the patch. I also added a v3, before your comments, I think... so in the next one, v4, I will split the binding to a separate patch. > > > drivers/net/phy/phy_device.c | 22 +- > > drivers/of/of_mdio.c | 98 > > ++- > > include/linux/phy.h | 5 +- > > 4 files changed, 125 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/devicetree/bindings/net/phy.txt > > b/Documentation/devicetree/bindings/net/phy.txt > > index d2169a5..82692e2 100644 > > --- a/Documentation/devicetree/bindings/net/phy.txt > > +++ b/Documentation/devicetree/bindings/net/phy.txt > > @@ -61,6 +61,11 @@ Optional Properties: > > - reset-deassert-us: Delay after the reset was deasserted in microseconds. > >If this property is missing the delay will be skipped. > > > > +- dev-addr: If set, it indicates the device address of the PHY to be > > +used > > + when accessing the C45 PHY registers over MDIO. It is used for > > +vendor specific > > + register space addresses that do no conform to standard address for > > +the MDIO > > + registers (e.g. MMD30) > > This is a 2nd MDIO address, right? Can't you just append this to reg property? > > Rob Yes, it is a 2nd MDIO address which is coming from the PHY vendor. This address cannot be obtained by querying the MDIO bus, it's specified in the PHY datasheet. The current kernel implementation was relying on the fact that this address is in the decimal 0 to 7 range. That worked for the PHYs which already have a kernel driver, but for the new Inphi PHY, which I'm trying to add, it didn't. If I were to append the dev-addr address to the reg property, I would have to fit two 32bit variable into a single one... in my opinion the code is clearer having the two addresses as distinct variables And so far, the comments from Andrew or Florian didn't go in this direction.. Vicentiu
[PATCH net 1/1] net/smc: use announced length in sock_recvmsg()
Not every CLC proposal message needs the maximum buffer length. Due to the MSG_WAITALL flag, it is important to use the peeked real length when receiving the message. Fixes: d63d271ce2b5ce ("smc: switch to sock_recvmsg()") Signed-off-by: Ursula Braun--- net/smc/smc_clc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index 8ac51583a063..15c213250606 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -133,7 +133,7 @@ int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen, /* receive the complete CLC message */ memset(, 0, sizeof(struct msghdr)); - iov_iter_kvec(_iter, READ | ITER_KVEC, , 1, buflen); + iov_iter_kvec(_iter, READ | ITER_KVEC, , 1, datlen); krflags = MSG_WAITALL; smc->clcsock->sk->sk_rcvtimeo = CLC_WAIT_TIME; len = sock_recvmsg(smc->clcsock, , krflags); -- 2.13.5
Re: [PATCH net] vhost: correctly remove wait queue during poll failure
Hi Jason, On Tue, Mar 27, 2018 at 11:47:22AM +0800, Jason Wang wrote: We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by checking poll->wqh to make sure it was in a list. This text seems at odds with the code below, instead of checking poll-whq, you are removing that check... Maybe the text needs rewording? Thanks, Darren. Reported-by: syzbot+c0272972b01b872e6...@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang--- drivers/vhost/vhost.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1b3e8d2d..5d5a9d9 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file) if (mask) vhost_poll_wakeup(>wait, 0, 0, poll_to_key(mask)); if (mask & EPOLLERR) { - if (poll->wqh) - remove_wait_queue(poll->wqh, >wait); + vhost_poll_stop(poll); ret = -EINVAL; } -- 2.7.4
Re: [2/4] wireless: Use octal not symbolic permissions
Joe Percheswrote: > Prefer the direct use of octal for permissions. > > Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace > and some typing. > > Miscellanea: > > o Whitespace neatening around these conversions. > > Signed-off-by: Joe Perches Patch applied to wireless-drivers-next.git, thanks. 2ef00c53049b wireless: Use octal not symbolic permissions -- https://patchwork.kernel.org/patch/10305719/ https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Re: [PATCH 4/4] selftests/bpf: fix compiling errors
On Tue, Mar 27, 2018 at 10:52:57AM +0200, Daniel Borkmann wrote: > On 03/27/2018 05:06 AM, Du, Changbin wrote: > > On Mon, Mar 26, 2018 at 08:02:30PM -0700, Alexei Starovoitov wrote: > >> On Tue, Mar 27, 2018 at 10:20:10AM +0800, Du, Changbin wrote: > >>> On Mon, Mar 26, 2018 at 07:55:13AM -0700, Alexei Starovoitov wrote: > On Mon, Mar 26, 2018 at 05:23:28PM +0800, changbin...@intel.com wrote: > > Signed-off-by: Changbin Du> > --- > > tools/testing/selftests/bpf/Makefile | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/tools/testing/selftests/bpf/Makefile > > b/tools/testing/selftests/bpf/Makefile > > index 5c43c18..dc0fdc8 100644 > > --- a/tools/testing/selftests/bpf/Makefile > > +++ b/tools/testing/selftests/bpf/Makefile > > @@ -10,7 +10,8 @@ ifneq ($(wildcard $(GENHDR)),) > >GENFLAGS := -DHAVE_GENHDR > > endif > > > > -CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) > > -I../../../include > > +CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) \ > > + -I../../../include -I../../../../usr/include > > LDLIBS += -lcap -lelf -lrt -lpthread > > > > # Order correspond to 'make run_tests' order > > @@ -62,7 +63,7 @@ else > >CPU ?= generic > > endif > > > > -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \ > > +CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi > > -I../../../../usr/include \ > > -Wno-compare-distinct-pointer-types > > Nack. > I suspect that will break the build for everyone else who's doing it in > the directory > itself instead of the outer one. > >>> > >>> This one? But I didn't see any problem. > >> > >> because the build was lucky and additional path ../../../../usr/include > >> didn't point > >> to a valid dir? > > Agree. > > > I am sorry but I don't understand why you mean *lucky*. Of cause, the path > > is valid. > > The problem is that this suddenly requires users to do a 'make > headers_install' in > order to populate usr/include/ directory in the first place. While it's > annoying > enough for BPF samples where this is needed, I absolutely don't want to > introduce > this for BPF kselftests. It's the wrong approach. Besides, in tools infra, > there is > a tools/arch/*/include/uapi/asm/bitsperlong.h header copy already, so we > really need > to use that instead. Please adapt your patch accordingly and respin. Please > also Cc > us and netdev@vger.kernel.org for BPF kselftests changes. > > Thanks, > Daniel Thanks for the explanation. So we expect that tools/arch/*/include is in the searching list, right? The corrent makefile seems not. How do you get this built? changbin@gvt-dell-host:~/work/linux/tools/testing/selftests/bpf$ make -p [...] clang -I. -I./include/uapi -I../../../include/uapi -Wno-compare-distinct-pointer-types \ -O2 -target bpf -emit-llvm -c test_pkt_access.c -o - | \ llc -march=bpf -mcpu=generic -filetype=obj -o /home/changbin/work/linux/tools/testing/selftests/bpf/test_pkt_access.o In file included from test_pkt_access.c:9: In file included from ../../../include/uapi/linux/bpf.h:11: In file included from ./include/uapi/linux/types.h:5: /usr/include/asm-generic/int-ll64.h:11:10: fatal error: 'asm/bitsperlong.h' file not found #include -- Thanks, Changbin Du
[PATCH iproute2-next] tc: Fix compilation error with old iptables
The compat_rev field does not exists in old versions of iptables. e.g. iptables 1.4. Fixes: dd29621578d2 ("tc: add em_ipt ematch for calling xtables matches from tc matching context") Signed-off-by: Roi Dayan--- tc/em_ipt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tc/em_ipt.c b/tc/em_ipt.c index 9d2b2f9b46d4..aa2edd63c550 100644 --- a/tc/em_ipt.c +++ b/tc/em_ipt.c @@ -41,7 +41,9 @@ static struct xtables_globals em_tc_ipt_globals = { .program_version = "0.1", .orig_opts = original_opts, .opts = original_opts, +#if (XTABLES_VERSION_CODE >= 11) .compat_rev = xtables_compatible_revision, +#endif }; static struct xt_entry_match *fake_xt_entry_match(int data_size, void *data) -- 2.7.0
[bpf-next V6 PATCH 00/15] XDP redirect memory return API
This patchset works towards supporting different XDP RX-ring memory allocators. As this will be needed by the AF_XDP zero-copy mode. The patchset uses mlx5 as the sample driver, which gets implemented XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to change thought the patchset). A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt). And both ndo_xdp_xmit and the new xdp_return_frame end-up using this. Support for a driver supplied allocator is implemented, and a refurbished version of page_pool is the first return allocator type introduced. This will be a integration point for AF_XDP zero-copy. The mlx5 driver evolve into using the page_pool, and see a performance increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps. The patchset stop at the 15 patches limit, but more API changes are planned. Specifically extending ndo_xdp_xmit and xdp_return_frame APIs to support bulking. As this will address some known limits. V2: Updated according to Tariq's feedback V3: Updated based on feedback from Jason Wang and Alex Duyck V4: Updated based on feedback from Tariq and Jason V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test V6: Updated based on feedback from Eric Dumazet and Alex Duyck --- Jesper Dangaard Brouer (15): mlx5: basic XDP_REDIRECT forward support xdp: introduce xdp_return_frame API and use in cpumap ixgbe: use xdp_return_frame API xdp: move struct xdp_buff from filter.h to xdp.h xdp: introduce a new xdp_frame type tun: convert to use generic xdp_frame and xdp_return_frame API virtio_net: convert to use generic xdp_frame and xdp_return_frame API bpf: cpumap convert to use generic xdp_frame mlx5: register a memory model when XDP is enabled xdp: rhashtable with allocator ID to pointer mapping page_pool: refurbish version of page_pool code xdp: allow page_pool as an allocator type in xdp_return_frame mlx5: use page_pool for xdp_return_frame call xdp: transition into using xdp_frame for return API xdp: transition into using xdp_frame for ndo_xdp_xmit drivers/net/ethernet/intel/ixgbe/ixgbe.h |3 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 37 ++ drivers/net/ethernet/mellanox/mlx5/core/en.h |4 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 36 ++ drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 42 ++- drivers/net/tun.c | 60 ++-- drivers/net/virtio_net.c | 52 ++- drivers/vhost/net.c |7 include/linux/filter.h| 24 -- include/linux/if_tun.h|4 include/linux/netdevice.h |4 include/net/page_pool.h | 129 + include/net/xdp.h | 83 + kernel/bpf/cpumap.c | 132 +++-- net/core/Makefile |1 net/core/filter.c | 17 + net/core/page_pool.c | 317 + net/core/xdp.c| 258 + 18 files changed, 1034 insertions(+), 176 deletions(-) create mode 100644 include/net/page_pool.h create mode 100644 net/core/page_pool.c --
[bpf-next V6 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support
This implements basic XDP redirect support in mlx5 driver. Notice that the ndo_xdp_xmit() is NOT implemented, because that API need some changes that this patchset is working towards. The main purpose of this patch is have different drivers doing XDP_REDIRECT to show how different memory models behave in a cross driver world. Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect, as the return API does not exist yet to to keep this mapped. Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing, introduce xdpsq.db.redirect_flush boolian. Signed-off-by: Jesper Dangaard BrouerReviewed-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx5/core/en.h|1 + drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 27 --- 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 4c9360b25532..28cc26debeda 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -398,6 +398,7 @@ struct mlx5e_xdpsq { struct { struct mlx5e_dma_info *di; bool doorbell; + bool redirect_flush; } db; /* read only */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 8cce90dc461d..6dcc3e8fbd3e 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq, return 0; } +static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, + struct mlx5e_dma_info *dma_info) +{ + dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq), + rq->buff.map_dir); +} + void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, bool recycle) { if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info)) return; - dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq), - rq->buff.map_dir); + mlx5e_page_dma_unmap(rq, dma_info); put_page(dma_info->page); } @@ -822,9 +828,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di, void *va, u16 *rx_headroom, u32 *len) { - const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); + struct bpf_prog *prog = READ_ONCE(rq->xdp_prog); struct xdp_buff xdp; u32 act; + int err; if (!prog) return false; @@ -845,6 +852,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, ))) trace_xdp_exception(rq->netdev, prog, act); return true; + case XDP_REDIRECT: + /* When XDP enabled then page-refcnt==1 here */ + err = xdp_do_redirect(rq->netdev, , prog); + if (!err) { + rq->wqe.xdp_xmit = true; /* XDP xmit owns page */ + rq->xdpsq.db.redirect_flush = true; + mlx5e_page_dma_unmap(rq, di); + } + return true; default: bpf_warn_invalid_xdp_action(act); case XDP_ABORTED: @@ -1107,6 +1123,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget) xdpsq->db.doorbell = false; } + if (xdpsq->db.redirect_flush) { + xdp_do_flush_map(); + xdpsq->db.redirect_flush = false; + } + mlx5_cqwq_update_db_record(>wq); /* ensure cq space is freed before enabling more cqes */
[bpf-next V6 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
Introduce an xdp_return_frame API, and convert over cpumap as the first user, given it have queued XDP frame structure to leverage. V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck. V6: Remove comment that id will be added later (Req by Alex Duyck) Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h | 27 +++ kernel/bpf/cpumap.c | 60 +++ net/core/xdp.c | 18 +++ 3 files changed, 81 insertions(+), 24 deletions(-) diff --git a/include/net/xdp.h b/include/net/xdp.h index b2362ddfa694..257762e8a3ad 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -33,16 +33,43 @@ * also mandatory during RX-ring setup. */ +enum mem_type { + MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ + MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + MEM_TYPE_MAX, +}; + +struct xdp_mem_info { + u32 type; /* enum mem_type, but known size type */ +}; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; u32 reg_state; + struct xdp_mem_info mem; } cacheline_aligned; /* perf critical, avoid false-sharing */ + +static inline +void xdp_return_frame(void *data, struct xdp_mem_info *mem) +{ + if (mem->type == MEM_TYPE_PAGE_SHARED) + page_frag_free(data); + + if (mem->type == MEM_TYPE_PAGE_ORDER0) { + struct page *page = virt_to_page(data); /* Assumes order0 page*/ + + put_page(page); + } +} + int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq); void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq); bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, + enum mem_type type, void *allocator); #endif /* __LINUX_NET_XDP_H__ */ diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index a4bb0b34375a..3e4bbcbe3e86 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } -static void __cpu_map_queue_destructor(void *ptr) -{ - /* The tear-down procedure should have made sure that queue is -* empty. See __cpu_map_entry_replace() and work-queue -* invoked cpu_map_kthread_stop(). Catch any broken behaviour -* gracefully and warn once. -*/ - if (WARN_ON_ONCE(ptr)) - page_frag_free(ptr); -} - -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) -{ - if (atomic_dec_and_test(>refcnt)) { - /* The queue should be empty at this point */ - ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor); - kfree(rcpu->queue); - kfree(rcpu); - } -} - static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) { atomic_inc(>refcnt); @@ -188,6 +168,10 @@ struct xdp_pkt { u16 len; u16 headroom; u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, +* while mem info is valid on remote CPU. +*/ + struct xdp_mem_info mem; struct net_device *dev_rx; }; @@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp) xdp_pkt->headroom = headroom - sizeof(*xdp_pkt); xdp_pkt->metasize = metasize; + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_pkt->mem = xdp->rxq->mem; + return xdp_pkt; } @@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, return skb; } +static void __cpu_map_ring_cleanup(struct ptr_ring *ring) +{ + /* The tear-down procedure should have made sure that queue is +* empty. See __cpu_map_entry_replace() and work-queue +* invoked cpu_map_kthread_stop(). Catch any broken behaviour +* gracefully and warn once. +*/ + struct xdp_pkt *xdp_pkt; + + while ((xdp_pkt = ptr_ring_consume(ring))) + if (WARN_ON_ONCE(xdp_pkt)) + xdp_return_frame(xdp_pkt, _pkt->mem); +} + +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) +{ + if (atomic_dec_and_test(>refcnt)) { + /* The queue should be empty at this point */ + __cpu_map_ring_cleanup(rcpu->queue); + ptr_ring_cleanup(rcpu->queue, NULL); + kfree(rcpu->queue); + kfree(rcpu); + } +} + static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; @@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
[bpf-next V6 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
The tuntap driver invented it's own driver specific way of queuing XDP packets, by storing the xdp_buff information in the top of the XDP frame data. Convert it over to use the more generic xdp_frame structure. The main problem with the in-driver method is that the xdp_rxq_info pointer cannot be trused/used when dequeueing the frame. V3: Remove check based on feedback from Jason Signed-off-by: Jesper Dangaard Brouer--- drivers/net/tun.c | 43 --- drivers/vhost/net.c|7 --- include/linux/if_tun.h |4 ++-- 3 files changed, 26 insertions(+), 28 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index baeafa004463..6750980d9f30 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -248,11 +248,11 @@ struct veth { __be16 h_vlan_TCI; }; -bool tun_is_xdp_buff(void *ptr) +bool tun_is_xdp_frame(void *ptr) { return (unsigned long)ptr & TUN_XDP_FLAG; } -EXPORT_SYMBOL(tun_is_xdp_buff); +EXPORT_SYMBOL(tun_is_xdp_frame); void *tun_xdp_to_ptr(void *ptr) { @@ -660,10 +660,10 @@ static void tun_ptr_free(void *ptr) { if (!ptr) return; - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - put_page(virt_to_head_page(xdp->data)); + xdp_return_frame(xdpf->data, >mem); } else { __skb_array_destroy_skb(ptr); } @@ -1290,17 +1290,14 @@ static const struct net_device_ops tun_netdev_ops = { static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) { struct tun_struct *tun = netdev_priv(dev); - struct xdp_buff *buff = xdp->data_hard_start; - int headroom = xdp->data - xdp->data_hard_start; + struct xdp_frame *frame; struct tun_file *tfile; u32 numqueues; int ret = 0; - /* Assure headroom is available and buff is properly aligned */ - if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp))) - return -ENOSPC; - - *buff = *xdp; + frame = convert_to_xdp_frame(xdp); + if (unlikely(!frame)) + return -EOVERFLOW; rcu_read_lock(); @@ -1315,7 +1312,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp) /* Encode the XDP flag into lowest bit for consumer to differ * XDP buffer from sk_buff. */ - if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(buff))) { + if (ptr_ring_produce(>tx_ring, tun_xdp_to_ptr(frame))) { this_cpu_inc(tun->pcpu_stats->tx_dropped); ret = -ENOSPC; } @@ -1993,11 +1990,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from) static ssize_t tun_put_user_xdp(struct tun_struct *tun, struct tun_file *tfile, - struct xdp_buff *xdp, + struct xdp_frame *xdp_frame, struct iov_iter *iter) { int vnet_hdr_sz = 0; - size_t size = xdp->data_end - xdp->data; + size_t size = xdp_frame->len; struct tun_pcpu_stats *stats; size_t ret; @@ -2013,7 +2010,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun, iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso)); } - ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz; + ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz; stats = get_cpu_ptr(tun->pcpu_stats); u64_stats_update_begin(>syncp); @@ -2181,11 +2178,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, return err; } - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - ret = tun_put_user_xdp(tun, tfile, xdp, to); - put_page(virt_to_head_page(xdp->data)); + ret = tun_put_user_xdp(tun, tfile, xdpf, to); + xdp_return_frame(xdpf->data, >mem); } else { struct sk_buff *skb = ptr; @@ -2424,10 +2421,10 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len, static int tun_ptr_peek_len(void *ptr) { if (likely(ptr)) { - if (tun_is_xdp_buff(ptr)) { - struct xdp_buff *xdp = tun_ptr_to_xdp(ptr); + if (tun_is_xdp_frame(ptr)) { + struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr); - return xdp->data_end - xdp->data; + return xdpf->len; } return __skb_array_len_with_tag(ptr); } else { diff --git a/drivers/vhost/net.c
[bpf-next V6 PATCH 05/15] xdp: introduce a new xdp_frame type
This is needed to convert drivers tuntap and virtio_net. This is a generalization of what is done inside cpumap, which will be converted later. Signed-off-by: Jesper Dangaard Brouer--- include/net/xdp.h | 40 1 file changed, 40 insertions(+) diff --git a/include/net/xdp.h b/include/net/xdp.h index 48cc5d3ee10e..1c69c07ee929 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -58,6 +58,46 @@ struct xdp_buff { struct xdp_rxq_info *rxq; }; +struct xdp_frame { + void *data; + u16 len; + u16 headroom; + u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, +* while mem info is valid on remote CPU. +*/ + struct xdp_mem_info mem; +}; + +/* Convert xdp_buff to xdp_frame */ +static inline +struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) +{ + struct xdp_frame *xdp_frame; + int metasize; + int headroom; + + /* Assure headroom is available for storing info */ + headroom = xdp->data - xdp->data_hard_start; + metasize = xdp->data - xdp->data_meta; + metasize = metasize > 0 ? metasize : 0; + if (unlikely((headroom - metasize) < sizeof(*xdp_frame))) + return NULL; + + /* Store info in top of packet */ + xdp_frame = xdp->data_hard_start; + + xdp_frame->data = xdp->data; + xdp_frame->len = xdp->data_end - xdp->data; + xdp_frame->headroom = headroom - sizeof(*xdp_frame); + xdp_frame->metasize = metasize; + + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_frame->mem = xdp->rxq->mem; + + return xdp_frame; +} + static inline void xdp_return_frame(void *data, struct xdp_mem_info *mem) {
[bpf-next V6 PATCH 03/15] ixgbe: use xdp_return_frame API
Extend struct ixgbe_tx_buffer to store the xdp_mem_info. Notice that this could be optimized further by putting this into a union in the struct ixgbe_tx_buffer, but this patchset works towards removing this again. Thus, this is not done. Signed-off-by: Jesper Dangaard Brouer--- drivers/net/ethernet/intel/ixgbe/ixgbe.h |1 + drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |6 -- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index c1e3a0039ea5..cbc20f199364 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -249,6 +249,7 @@ struct ixgbe_tx_buffer { DEFINE_DMA_UNMAP_ADDR(dma); DEFINE_DMA_UNMAP_LEN(len); u32 tx_flags; + struct xdp_mem_info xdp_mem; }; struct ixgbe_rx_buffer { diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 85369423452d..45520eb503ee 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, /* free the skb */ if (ring_is_xdp(tx_ring)) - page_frag_free(tx_buffer->data); + xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); else napi_consume_skb(tx_buffer->skb, napi_budget); @@ -5787,7 +5787,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring) /* Free all the Tx ring sk_buffs */ if (ring_is_xdp(tx_ring)) - page_frag_free(tx_buffer->data); + xdp_return_frame(tx_buffer->data, _buffer->xdp_mem); else dev_kfree_skb_any(tx_buffer->skb); @@ -8351,6 +8351,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter, dma_unmap_len_set(tx_buffer, len, len); dma_unmap_addr_set(tx_buffer, dma, dma); tx_buffer->data = xdp->data; + tx_buffer->xdp_mem = xdp->rxq->mem; + tx_desc->read.buffer_addr = cpu_to_le64(dma); /* put descriptor type bits */
[bpf-next V6 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h
This is done to prepare for the next patch, and it is also nice to move this XDP related struct out of filter.h. Signed-off-by: Jesper Dangaard Brouer--- include/linux/filter.h | 24 +--- include/net/xdp.h | 22 ++ 2 files changed, 23 insertions(+), 23 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index 109d05ccea9a..340ba653dd80 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -30,6 +30,7 @@ struct sock; struct seccomp_data; struct bpf_prog_aux; struct xdp_rxq_info; +struct xdp_buff; /* ArgX, context and stack frame pointer register positions. Note, * Arg1, Arg2, Arg3, etc are used as argument mappings of function @@ -499,14 +500,6 @@ struct bpf_skb_data_end { void *data_end; }; -struct xdp_buff { - void *data; - void *data_end; - void *data_meta; - void *data_hard_start; - struct xdp_rxq_info *rxq; -}; - struct sk_msg_buff { void *data; void *data_end; @@ -769,21 +762,6 @@ int xdp_do_redirect(struct net_device *dev, struct bpf_prog *prog); void xdp_do_flush_map(void); -/* Drivers not supporting XDP metadata can use this helper, which - * rejects any room expansion for metadata as a result. - */ -static __always_inline void -xdp_set_data_meta_invalid(struct xdp_buff *xdp) -{ - xdp->data_meta = xdp->data + 1; -} - -static __always_inline bool -xdp_data_meta_unsupported(const struct xdp_buff *xdp) -{ - return unlikely(xdp->data_meta > xdp->data); -} - void bpf_warn_invalid_xdp_action(u32 act); struct sock *do_sk_redirect_map(struct sk_buff *skb); diff --git a/include/net/xdp.h b/include/net/xdp.h index 257762e8a3ad..48cc5d3ee10e 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -50,6 +50,13 @@ struct xdp_rxq_info { struct xdp_mem_info mem; } cacheline_aligned; /* perf critical, avoid false-sharing */ +struct xdp_buff { + void *data; + void *data_end; + void *data_meta; + void *data_hard_start; + struct xdp_rxq_info *rxq; +}; static inline void xdp_return_frame(void *data, struct xdp_mem_info *mem) @@ -72,4 +79,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, enum mem_type type, void *allocator); +/* Drivers not supporting XDP metadata can use this helper, which + * rejects any room expansion for metadata as a result. + */ +static __always_inline void +xdp_set_data_meta_invalid(struct xdp_buff *xdp) +{ + xdp->data_meta = xdp->data + 1; +} + +static __always_inline bool +xdp_data_meta_unsupported(const struct xdp_buff *xdp) +{ + return unlikely(xdp->data_meta > xdp->data); +} + #endif /* __LINUX_NET_XDP_H__ */
Re: [PATCH] vhost-net: add time limitation for tx polling
On 2018年03月27日 17:12, haibinzhang(张海斌) wrote: handle_tx() will delay rx for a long time when busy tx polling udp packets with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no time. Interesting. Looking at vhost_can_busy_poll() it tries to poke pending vhost work and exit the busy loop if it found one. So I believe something block the work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db fix the issue? It's not fair for handle_rx(), so needs to limit max time of tx polling. Signed-off-by: Haibin Zhang--- drivers/vhost/net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8139bc70ad7d..dc9218a3a75b 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net) struct socket *sock; struct vhost_net_ubuf_ref *uninitialized_var(ubufs); bool zcopy, zcopy_used; + unsigned long start = jiffies; Checking jiffies is tricky, need to convert it to ms or whatever others. mutex_lock(>mutex); sock = vq->private_data; @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net) else vhost_zerocopy_signal_used(net, vq); vhost_net_tx_packet(net); - if (unlikely(total_len >= VHOST_NET_WEIGHT)) { + if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies - start >= 1)) { How value 1 is determined here? And we need a complete test to make sure this won't affect other use cases. Another thought is introduce another limit of #packets, but this need benchmark too. Thanks vhost_poll_queue(>poll); break; }
Re: [PATCH 4/4] selftests/bpf: fix compiling errors
On 03/27/2018 11:00 AM, Du, Changbin wrote: > On Tue, Mar 27, 2018 at 10:52:57AM +0200, Daniel Borkmann wrote: >> On 03/27/2018 05:06 AM, Du, Changbin wrote: >>> On Mon, Mar 26, 2018 at 08:02:30PM -0700, Alexei Starovoitov wrote: On Tue, Mar 27, 2018 at 10:20:10AM +0800, Du, Changbin wrote: > On Mon, Mar 26, 2018 at 07:55:13AM -0700, Alexei Starovoitov wrote: >> On Mon, Mar 26, 2018 at 05:23:28PM +0800, changbin...@intel.com wrote: >>> Signed-off-by: Changbin Du>>> --- >>> tools/testing/selftests/bpf/Makefile | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/tools/testing/selftests/bpf/Makefile >>> b/tools/testing/selftests/bpf/Makefile >>> index 5c43c18..dc0fdc8 100644 >>> --- a/tools/testing/selftests/bpf/Makefile >>> +++ b/tools/testing/selftests/bpf/Makefile >>> @@ -10,7 +10,8 @@ ifneq ($(wildcard $(GENHDR)),) >>>GENFLAGS := -DHAVE_GENHDR >>> endif >>> >>> -CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) >>> -I../../../include >>> +CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) \ >>> + -I../../../include -I../../../../usr/include >>> LDLIBS += -lcap -lelf -lrt -lpthread >>> >>> # Order correspond to 'make run_tests' order >>> @@ -62,7 +63,7 @@ else >>>CPU ?= generic >>> endif >>> >>> -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \ >>> +CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi >>> -I../../../../usr/include \ >>> -Wno-compare-distinct-pointer-types >> >> Nack. >> I suspect that will break the build for everyone else who's doing it in >> the directory >> itself instead of the outer one. > > This one? But I didn't see any problem. because the build was lucky and additional path ../../../../usr/include didn't point to a valid dir? >> >> Agree. >> >>> I am sorry but I don't understand why you mean *lucky*. Of cause, the path >>> is valid. >> >> The problem is that this suddenly requires users to do a 'make >> headers_install' in >> order to populate usr/include/ directory in the first place. While it's >> annoying >> enough for BPF samples where this is needed, I absolutely don't want to >> introduce >> this for BPF kselftests. It's the wrong approach. Besides, in tools infra, >> there is >> a tools/arch/*/include/uapi/asm/bitsperlong.h header copy already, so we >> really need >> to use that instead. Please adapt your patch accordingly and respin. Please >> also Cc >> us and netdev@vger.kernel.org for BPF kselftests changes. >> > Thanks for the explanation. So we expect that tools/arch/*/include is in the > searching list, right? > The corrent makefile seems not. How do you get this built? E.g. take a look at tools/include/asm/barrier.h or tools/include/uapi/asm/bpf_perf_event.h just to name two examples. We'd need something similar to this which then points to the arch specific includes. Thanks, Daniel > changbin@gvt-dell-host:~/work/linux/tools/testing/selftests/bpf$ make -p > [...] > clang -I. -I./include/uapi -I../../../include/uapi > -Wno-compare-distinct-pointer-types \ > -O2 -target bpf -emit-llvm -c test_pkt_access.c -o - | \ > llc -march=bpf -mcpu=generic -filetype=obj -o > /home/changbin/work/linux/tools/testing/selftests/bpf/test_pkt_access.o > In file included from test_pkt_access.c:9: > In file included from ../../../include/uapi/linux/bpf.h:11: > In file included from ./include/uapi/linux/types.h:5: > /usr/include/asm-generic/int-ll64.h:11:10: fatal error: 'asm/bitsperlong.h' > file not found > #include > >
[PATCH v3 2/2] net: macb: Try to retrieve MAC addess from nvmem provider
Call of_get_nvmem_mac_address() to fetch the MAC address from an nvmem cell, if one is provided in the device tree. This allows the address to be stored in an I2C EEPROM device for example. Signed-off-by: Mike Looijmans--- drivers/net/ethernet/cadence/macb_main.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index e84afcf..eabe14f 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -3950,10 +3950,16 @@ static int macb_probe(struct platform_device *pdev) dev->max_mtu = ETH_DATA_LEN; mac = of_get_mac_address(np); - if (mac) + if (mac) { ether_addr_copy(bp->dev->dev_addr, mac); - else - macb_get_hwaddr(bp); + } else { + err = of_get_nvmem_mac_address(np, bp->dev->dev_addr); + if (err) { + if (err == -EPROBE_DEFER) + goto err_out_free_netdev; + macb_get_hwaddr(bp); + } + } err = of_get_phy_mode(np); if (err < 0) { -- 1.9.1
[PATCH v3 1/2] of_net: Implement of_get_nvmem_mac_address helper
It's common practice to store MAC addresses for network interfaces into nvmem devices. However the code to actually do this in the kernel lacks, so this patch adds of_get_nvmem_mac_address() for drivers to obtain the address from an nvmem cell provider. This is particulary useful on devices where the ethernet interface cannot be configured by the bootloader, for example because it's in an FPGA. Signed-off-by: Mike Looijmans--- Documentation/devicetree/bindings/net/ethernet.txt | 2 ++ drivers/of/of_net.c| 40 ++ include/linux/of_net.h | 6 3 files changed, 48 insertions(+) diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt index 2974e63..cfc376b 100644 --- a/Documentation/devicetree/bindings/net/ethernet.txt +++ b/Documentation/devicetree/bindings/net/ethernet.txt @@ -10,6 +10,8 @@ Documentation/devicetree/bindings/phy/phy-bindings.txt. the boot program; should be used in cases where the MAC address assigned to the device by the boot program is different from the "local-mac-address" property; +- nvmem-cells: phandle, reference to an nvmem node for the MAC address; +- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used; - max-speed: number, specifies maximum speed in Mbit/s supported by the device; - max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than the maximum frame size (there's contradiction in the Devicetree diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c index d820f3e..1c5d372 100644 --- a/drivers/of/of_net.c +++ b/drivers/of/of_net.c @@ -7,6 +7,7 @@ */ #include #include +#include #include #include #include @@ -80,3 +81,42 @@ const void *of_get_mac_address(struct device_node *np) return of_get_mac_addr(np, "address"); } EXPORT_SYMBOL(of_get_mac_address); + +/** + * Obtain the MAC address from an nvmem provider named 'mac-address' through + * device tree. + * On success, copies the new address into memory pointed to by addr and + * returns 0. Returns a negative error code otherwise. + * @np:Device tree node containing the nvmem-cells phandle + * @addr: Pointer to receive the MAC address using ether_addr_copy() + */ +int of_get_nvmem_mac_address(struct device_node *np, void *addr) +{ + struct nvmem_cell *cell; + const void *mac; + size_t len; + int ret; + + cell = of_nvmem_cell_get(np, "mac-address"); + if (IS_ERR(cell)) + return PTR_ERR(cell); + + mac = nvmem_cell_read(cell, ); + + nvmem_cell_put(cell); + + if (IS_ERR(mac)) + return PTR_ERR(mac); + + if (len < 6 || !is_valid_ether_addr(mac)) { + ret = -EINVAL; + } else { + ether_addr_copy(addr, mac); + ret = 0; + } + + kfree(mac); + + return ret; +} +EXPORT_SYMBOL(of_get_nvmem_mac_address); diff --git a/include/linux/of_net.h b/include/linux/of_net.h index 9cd72aa..90d81ee 100644 --- a/include/linux/of_net.h +++ b/include/linux/of_net.h @@ -13,6 +13,7 @@ struct net_device; extern int of_get_phy_mode(struct device_node *np); extern const void *of_get_mac_address(struct device_node *np); +extern int of_get_nvmem_mac_address(struct device_node *np, void *addr); extern struct net_device *of_find_net_device_by_node(struct device_node *np); #else static inline int of_get_phy_mode(struct device_node *np) @@ -25,6 +26,11 @@ static inline const void *of_get_mac_address(struct device_node *np) return NULL; } +static inline int of_get_nvmem_mac_address(struct device_node *np, void *addr) +{ + return -ENODEV; +} + static inline struct net_device *of_find_net_device_by_node(struct device_node *np) { return NULL; -- 1.9.1
Re: [RFC PATCH 00/24] Introducing AF_XDP support
2018-03-26 23:58 GMT+02:00 William Tu: > Hi Jesper, > > Thanks a lot for your prompt reply. > >>> Hi, >>> I also did an evaluation of AF_XDP, however the performance isn't as >>> good as above. >>> I'd like to share the result and see if there are some tuning suggestions. >>> >>> System: >>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz >>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode >> >> Hmmm, why is X540-AT2 not able to use XDP natively? > > Because I'm only able to use ixgbe driver for this NIC, > and AF_XDP patch only has i40e support? > It's only i40e that support zero copy. As for native XDP support, only XDP_REDIRECT support is required and ixgbe does support XDP_REDIRECT -- unfortunately, ixgbe still needs a needs a patch to work properly, which is in net-next: ed93a3987128 ("ixgbe: tweak page counting for XDP_REDIRECT"). >> >>> AF_XDP performance: >>> Benchmark XDP_SKB >>> rxdrop 1.27 Mpps >>> txpush 0.99 Mpps >>> l2fwd0.85 Mpps >> >> Definitely too low... >> > I did another run, the rxdrop seems better. > Benchmark XDP_SKB > rxdrop 2.3 Mpps > txpush 1.05 Mpps > l2fwd0.90 Mpps > >> What is the performance if you drop packets via iptables? >> >> Command: >> $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP >> > I did > # iptables -t raw -I PREROUTING -p udp -i enp10s0f0 -j DROP > # iptables -nvL -t raw; sleep 10; iptables -nvL -t raw > > and I got 2.9Mpps. > >>> NIC configuration: >>> the command >>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16" >>> doesn't work on my ixgbe driver, so I use ntuple: >>> >>> ethtool -K enp10s0f0 ntuple on >>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1 >>> then >>> echo 1 > /proc/sys/net/core/bpf_jit_enable >>> ./xdpsock -i enp10s0f0 -r -S --queue=1 >>> >>> I also take a look at perf result: >>> For rxdrop: >>> 86.56% xdpsock xdpsock [.] main >>> 9.22% xdpsock [kernel.vmlinux] [k] nmi >>> 4.23% xdpsock xdpsock [.] xq_enq >> >> It looks very strange that you see non-maskable interrupt's (NMI) being >> this high... >> > yes, that's weird. Looking at the perf annotate of nmi, > it shows 100% spent on nop instruction. > >> >>> For l2fwd: >>> 20.81% xdpsock xdpsock [.] main >>> 10.64% xdpsock [kernel.vmlinux][k] clflush_cache_range >> >> Oh, clflush_cache_range is being called! > > I though clflush_cache_range is high because we have many smp_rmb, smp_wmb > in the xdpsock queue/ring management userspace code. > (perf shows that 75% of this 10.64% spent on mfence instruction.) > >> Do your system use an IOMMU ? >> > Yes. > With CONFIG_INTEL_IOMMU=y > and I saw some related functions called (ex: intel_alloc_iova). > >>> 8.46% xdpsock [kernel.vmlinux][k] xsk_sendmsg >>> 6.72% xdpsock [kernel.vmlinux][k] skb_set_owner_w >>> 5.89% xdpsock [kernel.vmlinux][k] __domain_mapping >>> 5.74% xdpsock [kernel.vmlinux][k] alloc_skb_with_frags >>> 4.62% xdpsock [kernel.vmlinux][k] netif_skb_features >>> 3.96% xdpsock [kernel.vmlinux][k] ___slab_alloc >>> 3.18% xdpsock [kernel.vmlinux][k] nmi >> >> Again high count for NMI ?!? >> >> Maybe you just forgot to tell perf that you want it to decode the >> bpf_prog correctly? >> >> https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols >> >> Enable via: >> $ sysctl net/core/bpf_jit_kallsyms=1 >> >> And use perf report (while BPF is STILL LOADED): >> >> $ perf report --kallsyms=/proc/kallsyms >> >> E.g. for emailing this you can use this command: >> >> $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms >> --no-children --stdio -g none | head -n 40 >> > > Thanks, I followed the steps, the result of l2fwd > # Total Lost Samples: 119 > # > # Samples: 2K of event 'cycles:ppp' > # Event count (approx.): 25675705627 > # > # Overhead CPU Command Shared Object Symbol > # ... ... .. > .. > # > 10.48% 013 xdpsock xdpsock [.] main > 9.77% 013 xdpsock [kernel.vmlinux][k] clflush_cache_range > 8.45% 013 xdpsock [kernel.vmlinux][k] nmi > 8.07% 013 xdpsock [kernel.vmlinux][k] xsk_sendmsg > 7.81% 013 xdpsock [kernel.vmlinux][k] __domain_mapping > 4.95% 013 xdpsock [kernel.vmlinux][k] ixgbe_xmit_frame_ring > 4.66% 013 xdpsock [kernel.vmlinux][k] skb_store_bits > 4.39% 013 xdpsock [kernel.vmlinux][k] syscall_return_via_sysret > 3.93% 013 xdpsock [kernel.vmlinux][k] pfn_to_dma_pte > 2.62% 013 xdpsock [kernel.vmlinux][k] __intel_map_single > 2.53% 013 xdpsock [kernel.vmlinux][k] __alloc_skb > 2.36% 013 xdpsock [kernel.vmlinux][k] iommu_no_mapping > 2.21% 013 xdpsock [kernel.vmlinux][k] alloc_skb_with_frags > 2.07% 013 xdpsock
Re: [RFC PATCH 00/24] Introducing AF_XDP support
2018-03-27 1:03 GMT+02:00 Alexander Duyck: > On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave wrote: [...] >> >> Whats the implication here. Should IOMMU be disabled? >> I'm asking because I do see a huge difference while running pktgen test for >> my performance benchmarks, with and without intel_iommu. >> >> >> -Tushar > > For the Intel parts the IOMMU can be expensive primarily for Tx, since > it should have minimal impact if the Rx pages are pinned/recycled. I > am assuming the same is true here for AF_XDP, Bjorn can correct me if > I am wrong. > For the non-zc case the DMA mapping is done in the Tx fast path, so there, as Alex says, you'll definitely see a performance penalty. For Rx the page-recycle mechanism (Intel drivers) usally avoids doing any DMA mappings in the fast-path. As for AF_XDP zerocopy mode, we do the DMA mapping up front (avoiding the single-use mappings), to avoid that performance hit. Keep in mind, though, that the IOTLB is still in play, and usually performs worse under pressure, than the non-IOMMU case. > Basically the IOMMU can make creating/destroying a DMA mapping really > expensive. The easiest way to work around it in the case of the Intel > IOMMU is to boot with "iommu=pt" which will create an identity mapping > for the host. The downside is though that you then have the entire > system accessible to the device unless a new mapping is created for it > by assigning it to a new IOMMU domain. > > Thanks. > > - Alex
[PATCH net 2/2] net/mlx4_core: Fix memory leak while delete slave's resources
From: Moshe Shemeshmlx4_delete_all_resources_for_slave in resource tracker should free all memory allocated for a slave. While releasing memory of fs_rule, it misses releasing memory of fs_rule->mirr_mbox. Fixes: 78efed275117 ('net/mlx4_core: Support mirroring VF DMFS rules on both ports') Signed-off-by: Moshe Shemesh Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c index 606a0e0beeae..29e50f787349 100644 --- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c +++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c @@ -5088,6 +5088,7 @@ static void rem_slave_fs_rule(struct mlx4_dev *dev, int slave) >res_tree[RES_FS_RULE]); list_del(_rule->com.list); spin_unlock_irq(mlx4_tlock(dev)); + kfree(fs_rule->mirr_mbox); kfree(fs_rule); state = 0; break; -- 1.8.3.1
[PATCH net 0/2] mlx4 misc fixes for 4.16
Hi Dave, This patchset contains misc bug fixes from the team to the mlx4 Core and Eth drivers. Patch 1 by Eran fixes a control mix of PFC and Global pauses, please queue it to -stable for >= v4.8. Patch 2 by Moshe fixes a resource leak in slave's delete flow, please queue it to -stable for >= v4.5. Series generated against net commit: 3c82b372a9f4 net: dsa: mt7530: fix module autoloading for OF platform drivers Thanks, Tariq. Eran Ben Elisha (1): net/mlx4_en: Fix mixed PFC and Global pause user control requests Moshe Shemesh (1): net/mlx4_core: Fix memory leak while delete slave's resources drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c | 72 -- drivers/net/ethernet/mellanox/mlx4/en_ethtool.c| 33 +- drivers/net/ethernet/mellanox/mlx4/en_main.c | 4 +- .../net/ethernet/mellanox/mlx4/resource_tracker.c | 1 + 4 files changed, 63 insertions(+), 47 deletions(-) -- 1.8.3.1
Re: [PATCH v7 0/7] netdev: intel: Eliminate duplicate barriers on weakly-ordered archs
Jeff, On 3/23/2018 10:34 PM, ok...@codeaurora.org wrote: > On 2018-03-23 19:58, Jeff Kirsher wrote: >> On Fri, 2018-03-23 at 14:53 -0700, Alexander Duyck wrote: >>> On Fri, Mar 23, 2018 at 11:52 AM, Sinan Kaya>>> wrote: >>> > Code includes wmb() followed by writel() in multiple places. writel() >>> > already has a barrier on some architectures like arm64. >>> > >>> > This ends up CPU observing two barriers back to back before executing >>> > the >>> > register write. >>> > >>> > Since code already has an explicit barrier call, changing writel() to >>> > writel_relaxed(). >>> > >>> > I did a regex search for wmb() followed by writel() in each drivers >>> > directory. >>> > I scrubbed the ones I care about in this series. >>> > >>> > I considered "ease of change", "popular usage" and "performance >>> > critical >>> > path" as the determining criteria for my filtering. >>> > >>> > We used relaxed API heavily on ARM for a long time but >>> > it did not exist on other architectures. For this reason, relaxed >>> > architectures have been paying double penalty in order to use the >>> > common >>> > drivers. >>> > >>> > Now that relaxed API is present on all architectures, we can go and >>> > scrub >>> > all drivers to see what needs to change and what can remain. >>> > >>> > We start with mostly used ones and hope to increase the coverage over >>> > time. >>> > It will take a while to cover all drivers. >>> > >>> > Feel free to apply patches individually. >>> >>> I looked over the set and they seem good. >>> >>> Reviewed-by: Alexander Duyck >> >> Grrr, patch 1 does not apply cleanly to my next-queue tree (dev-queue >> branch). I will deal with this series in a day or two, after I have dealt >> with my driver pull requests. > > Sorry, you will have to replace the ones you took from me. Double sorry now. I don't know if you have been following "RFC on writel and writel_relaxed" thread or not but there are some new developments about wmb() requirement. Basically, wmb() should never be used before writel() as writel() seem to provide coherency and observability guarantee. wmb()+writel_relaxed() is slower on some architectures than plain writel() I'll have to rework these patches to have writel() only. Are you able to drop the applied ones so that I can post V8 or is it too late? Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH] net/9p: fix potential refcnt problem of trans module
When specifying trans_mod multiple times in a mount, it may cause inaccurate refcount of trans module. Also, in the error case of option parsing, we should put the trans module if we have already got. Signed-off-by: Chengguang Xu--- net/9p/client.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/9p/client.c b/net/9p/client.c index b433aff..7ccfb4b 100644 --- a/net/9p/client.c +++ b/net/9p/client.c @@ -190,7 +190,9 @@ static int parse_opts(char *opts, struct p9_client *clnt) p9_debug(P9_DEBUG_ERROR, "problem allocating copy of trans arg\n"); goto free_and_return; -} + } + + v9fs_put_trans(clnt->trans_mod); clnt->trans_mod = v9fs_get_trans_by_name(s); if (clnt->trans_mod == NULL) { pr_info("Could not find request transport: %s\n", @@ -226,6 +228,7 @@ static int parse_opts(char *opts, struct p9_client *clnt) } free_and_return: + v9fs_put_trans(clnt->trans_mod); kfree(tmp_options); return ret; } -- 1.8.3.1
Re: [PATCH net v2] udp6: set dst cache for a connected sk before udp_v6_send_skb
On 26.03.2018 20:02, Martin KaFai Lau wrote: > On Mon, Mar 26, 2018 at 05:48:47PM +0300, Alexey Kodanev wrote: >> After commit 33c162a980fe ("ipv6: datagram: Update dst cache of a >> connected datagram sk during pmtu update"), when the error occurs on >> sending datagram in udpv6_sendmsg() due to ICMPV6_PKT_TOOBIG type, >> error handler can trigger the following path and call ip6_dst_store(): >> >> udpv6_err() >> ip6_sk_update_pmtu() >> ip6_datagram_dst_update() >> ip6_dst_lookup_flow(), can create a RTF_CACHE clone > Instead of ip6_dst_lookup_flow(), > you meant the RTF_CACHE route created in ip6_update_pmtu() > Right, or even earlier... I was using vti tunnel and it invokes skb_dst_update_pmtu() on this error, then sends ICMPv6_PKT_TOOBIG. >> ... >> ip6_dst_store() >> >> It can happen before a connected UDP socket invokes ip6_dst_store() >> in the end of udpv6_sendmsg(), on destination release, as a result, >> the last one changes dst to the old one, preventing getting updated >> dst cache on the next udpv6_sendmsg() call. >> >> This patch moves ip6_dst_store() in udpv6_sendmsg(), so that it is >> invoked after ip6_sk_dst_lookup_flow() and before udp_v6_send_skb(). > After this patch, the above udpv6_err() path could not happen after > ip6_sk_dst_lookup_flow() and before the ip6_dst_store() in udpv6_sendmsg()? > May be we could minimize this if save it in ip6_sk_dst_lookup_flow() for a connected UDP sockets only if we're not getting it from a cache for some reason? diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index a8a9195..0204f52 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1115,13 +1115,30 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 *fl6, * error code. */ struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6, -const struct in6_addr *final_dst) +const struct in6_addr *final_dst, +bool connected) { struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie); dst = ip6_sk_dst_check(sk, dst, fl6); - if (!dst) - dst = ip6_dst_lookup_flow(sk, fl6, final_dst); + if (dst) + return dst; + + dst = ip6_dst_lookup_flow(sk, fl6, final_dst); + + if (connected && !IS_ERR(dst)) + ip6_dst_store(sk, dst_clone(dst), ...); Thanks, Alexey >> >> Also, increase refcnt for dst, when passing it to ip6_dst_store() >> because after that the dst cache can be released by other calls >> to ip6_dst_store() with the same socket. >> >> Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected >> datagram sk during pmtu update") >> Signed-off-by: Alexey Kodanev>> --- >> >> v2: * remove 'release_dst:' label >> >> * move ip6_dst_store() below MSG_CONFIRM check as >> suggested by Eric and add dst_clone() >> >> * add 'Fixes' commit. >> >> >> net/ipv6/udp.c | 29 +++-- >> 1 file changed, 11 insertions(+), 18 deletions(-) >> >> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c >> index 52e3ea0..4508e5a 100644 >> --- a/net/ipv6/udp.c >> +++ b/net/ipv6/udp.c >> @@ -1303,6 +1303,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr >> *msg, size_t len) >> goto do_confirm; >> back_from_confirm: >> >> +if (connected)>> + ip6_dst_store(sk, dst_clone(dst), >> + ipv6_addr_equal(, >sk_v6_daddr) ? >> + >sk_v6_daddr : NULL, >> +#ifdef CONFIG_IPV6_SUBTREES >> + ipv6_addr_equal(, >saddr) ? >> + >saddr : >> +#endif >> + NULL); >> + >> /* Lockless fast path for the non-corking case */ >> if (!corkreq) { >> struct sk_buff *skb; >> @@ -1314,7 +1324,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, >> size_t len) >> err = PTR_ERR(skb); >> if (!IS_ERR_OR_NULL(skb)) >> err = udp_v6_send_skb(skb, ); >> -goto release_dst; >> +goto out; >> } >> >> lock_sock(sk); >> @@ -1348,23 +1358,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr >> *msg, size_t len) >> err = np->recverr ? net_xmit_errno(err) : 0; >> release_sock(sk); >> >> -release_dst: >> -if (dst) { >> -if (connected) { >> -ip6_dst_store(sk, dst, >> - ipv6_addr_equal(, >> >sk_v6_daddr) ? >> - >sk_v6_daddr : NULL, >> -#ifdef CONFIG_IPV6_SUBTREES >> - ipv6_addr_equal(, >saddr) ? >> - >saddr : >> -#endif >> - NULL); >> -} else { >> -
Re: [PATCH net 1/1] net sched actions: fix dumping which requires several messages to user space
On 18-03-26 02:58 PM, Craig Dillabaugh wrote: Fixes a bug in the tcf_dump_walker function that can cause some actions to not be reported when dumping a large number of actions. This issue became more aggrevated when cookies feature was added. In particular this issue is manifest when large cookie values are assigned to the actions and when enough actions are created that the resulting table must be dumped in multiple batches. The number of actions returned in each batch is limited by the total number of actions and the memory buffer size. With small cookies the numeric limit is reached before the buffer size limit, which avoids the code path triggering this bug. When large cookies are used buffer fills before the numeric limit, and the erroneous code path is hit. For example after creating 32 csum actions with the cookie $ tc actions ls action csum total acts 26 action order 0: csum (tcp) action continue index 1 ref 1 bind 0 cookie . action order 25: csum (tcp) action continue index 26 ref 1 bind 0 cookie total acts 6 action order 0: csum (tcp) action continue index 28 ref 1 bind 0 cookie .. action order 5: csum (tcp) action continue index 32 ref 1 bind 0 cookie Note that the action with index 27 is omitted from the report. Fixes: 4b3550ef530c ("[NET_SCHED]: Use nla_nest_start/nla_nest_end")" Signed-off-by: Craig DillabaughGood catch. Acked-by: Jamal Hadi Salim cheers, jamal
[PATCH net 1/2] net/mlx4_en: Fix mixed PFC and Global pause user control requests
From: Eran Ben ElishaGlobal pause and PFC configuration should be mutually exclusive (i.e. only one of them at most can be set). However, once PFC was turned off, driver automatically turned Global pause on. This is a bug. Fix the driver behaviour to turn off PFC/Global once the user turned the other on. This also fixed a weird behaviour that at a current time, the profile had both PFC and global pause configuration turned on, which is Hardware-wise impossible and caused returning false positive indication to query tools. In addition, fix error code when setting global pause or PFC to change metadata only upon successful change. Also, removed useless debug print. Fixes: af7d51852631 ("net/mlx4_en: Add DCB PFC support through CEE netlink commands") Fixes: c27a02cd94d6 ("mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC") Signed-off-by: Eran Ben Elisha Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c | 72 ++--- drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 33 +++- drivers/net/ethernet/mellanox/mlx4/en_main.c| 4 +- 3 files changed, 62 insertions(+), 47 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c index 1a0c3bf86ead..752a72499b4f 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c @@ -156,57 +156,63 @@ static int mlx4_en_dcbnl_getnumtcs(struct net_device *netdev, int tcid, u8 *num) static u8 mlx4_en_dcbnl_set_all(struct net_device *netdev) { struct mlx4_en_priv *priv = netdev_priv(netdev); + struct mlx4_en_port_profile *prof = priv->prof; struct mlx4_en_dev *mdev = priv->mdev; + u8 tx_pause, tx_ppp, rx_pause, rx_ppp; if (!(priv->dcbx_cap & DCB_CAP_DCBX_VER_CEE)) return 1; if (priv->cee_config.pfc_state) { int tc; + rx_ppp = prof->rx_ppp; + tx_ppp = prof->tx_ppp; - priv->prof->rx_pause = 0; - priv->prof->tx_pause = 0; for (tc = 0; tc < CEE_DCBX_MAX_PRIO; tc++) { u8 tc_mask = 1 << tc; switch (priv->cee_config.dcb_pfc[tc]) { case pfc_disabled: - priv->prof->tx_ppp &= ~tc_mask; - priv->prof->rx_ppp &= ~tc_mask; + tx_ppp &= ~tc_mask; + rx_ppp &= ~tc_mask; break; case pfc_enabled_full: - priv->prof->tx_ppp |= tc_mask; - priv->prof->rx_ppp |= tc_mask; + tx_ppp |= tc_mask; + rx_ppp |= tc_mask; break; case pfc_enabled_tx: - priv->prof->tx_ppp |= tc_mask; - priv->prof->rx_ppp &= ~tc_mask; + tx_ppp |= tc_mask; + rx_ppp &= ~tc_mask; break; case pfc_enabled_rx: - priv->prof->tx_ppp &= ~tc_mask; - priv->prof->rx_ppp |= tc_mask; + tx_ppp &= ~tc_mask; + rx_ppp |= tc_mask; break; default: break; } } - en_dbg(DRV, priv, "Set pfc on\n"); + rx_pause = !!(rx_ppp || tx_ppp) ? 0 : prof->rx_pause; + tx_pause = !!(rx_ppp || tx_ppp) ? 0 : prof->tx_pause; } else { - priv->prof->rx_pause = 1; - priv->prof->tx_pause = 1; - en_dbg(DRV, priv, "Set pfc off\n"); + rx_ppp = 0; + tx_ppp = 0; + rx_pause = prof->rx_pause; + tx_pause = prof->tx_pause; } if (mlx4_SET_PORT_general(mdev->dev, priv->port, priv->rx_skb_size + ETH_FCS_LEN, - priv->prof->tx_pause, - priv->prof->tx_ppp, - priv->prof->rx_pause, - priv->prof->rx_ppp)) { + tx_pause, tx_ppp, rx_pause, rx_ppp)) { en_err(priv, "Failed setting pause params\n"); return 1; } + prof->tx_ppp = tx_ppp; + prof->rx_ppp = rx_ppp; + prof->tx_pause = tx_pause; + prof->rx_pause = rx_pause; + return 0; } @@ -408,6 +414,7 @@ static int mlx4_en_dcbnl_ieee_setpfc(struct net_device *dev, struct
Re: [PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs
Dave, On 3/26/2018 12:48 PM, David Miller wrote: > From: Sinan Kaya> Date: Sun, 25 Mar 2018 10:39:14 -0400 > >> Code includes wmb() followed by writel() in multiple places. writel() >> already has a barrier on some architectures like arm64. >> >> This ends up CPU observing two barriers back to back before executing the >> register write. >> >> Since code already has an explicit barrier call, changing writel() to >> writel_relaxed(). >> >> I did a regex search for wmb() followed by writel() in each drivers >> directory. >> I scrubbed the ones I care about in this series. >> >> I considered "ease of change", "popular usage" and "performance critical >> path" as the determining criteria for my filtering. >> >> We used relaxed API heavily on ARM for a long time but >> it did not exist on other architectures. For this reason, relaxed >> architectures have been paying double penalty in order to use the common >> drivers. >> >> Now that relaxed API is present on all architectures, we can go and scrub >> all drivers to see what needs to change and what can remain. >> >> We start with mostly used ones and hope to increase the coverage over time. >> It will take a while to cover all drivers. >> >> Feel free to apply patches individually. >> >> Changes since v6: >> - bring back amazon ena and add mmiowb, remove >> ena_com_write_sq_doorbell_rel(). >> - remove extra mmiowb in bnx2x >> - correct spelling mistake in bnx2x: Replace doorbell barrier() with wmb() > > Series applied, thank you. > I don't know if you have been following "RFC on writel and writel_relaxed" thread or not but there are some new developments about wmb() requirement. Basically, wmb() should never be used before writel() as writel() seem to provide coherency and observability guarantee. wmb()+writel_relaxed() is slower on some architectures than plain writel() I'll have to rework these patches to have writel() only. Are you able to drop the applied ones so that I can post V7 or is it too late? Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH net-next] net: Export net->ipv6.sysctl.ip_nonlocal_bind to /proc
Please, ignore this. Thanks, Kirill On 27.03.2018 14:24, Kirill Tkhai wrote: > Currenly, this parameter can be configured via sysctl > only. But sysctl is considered as depricated interface > (man 2 sysctl), and it only can applied to current's net > namespace (this requires to do setns() to change it in > not current's net ns). > > So, let's export the parameter to /proc in standard way, > and this allows to access another process namespace > via /proc/[pid]/net/ip6_nonlocal_bind. > > Signed-off-by: Kirill Tkhai> --- > net/ipv6/proc.c | 48 > 1 file changed, 48 insertions(+) > > diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c > index 6e57028d2e91..2d0aa59c2d0d 100644 > --- a/net/ipv6/proc.c > +++ b/net/ipv6/proc.c > @@ -312,6 +312,47 @@ int snmp6_unregister_dev(struct inet6_dev *idev) > return 0; > } > > +static int nonlocal_bind_show(struct seq_file *seq, void *v) > +{ > + struct net *net = seq->private; > + > + seq_printf(seq, "%d\n", !!net->ipv6.sysctl.ip_nonlocal_bind); > + return 0; > +} > + > +static int open_nonlocal_bind(struct inode *inode, struct file *file) > +{ > + return single_open_net(inode, file, nonlocal_bind_show); > +} > + > +static ssize_t write_nonlocal_bind(struct file *file, const char __user > *ubuf, > +size_t count, loff_t *ppos) > +{ > + struct net *net = ((struct seq_file *)file->private_data)->private; > + char buf[3]; > + > + if (*ppos || count <= 0 || count > sizeof(buf)) > + return -EINVAL; > + > + if (copy_from_user(buf, ubuf, count)) > + return -EFAULT; > + buf[0] -= '0'; > + if ((count == 3 && buf[2] != '\0') || > + (count >= 2 && buf[1] != '\n') || > + (buf[0] != 0 && buf[0] != 1)) > + return -EINVAL; > + > + net->ipv6.sysctl.ip_nonlocal_bind = buf[0]; > + return count; > +} > + > +static const struct file_operations nonlocal_bind_ops = { > + .open = open_nonlocal_bind, > + .read = seq_read, > + .write = write_nonlocal_bind, > + .release = single_release_net, > +}; > + > static int __net_init ipv6_proc_init_net(struct net *net) > { > if (!proc_create("sockstat6", 0444, net->proc_net, > @@ -321,12 +362,18 @@ static int __net_init ipv6_proc_init_net(struct net > *net) > if (!proc_create("snmp6", 0444, net->proc_net, _seq_fops)) > goto proc_snmp6_fail; > > +if (!proc_create_data("ip6_nonlocal_bind", 0644, > +net->proc_net, _bind_ops, net)) > + goto proc_bind_fail; > + > net->mib.proc_net_devsnmp6 = proc_mkdir("dev_snmp6", net->proc_net); > if (!net->mib.proc_net_devsnmp6) > goto proc_dev_snmp6_fail; > return 0; > > proc_dev_snmp6_fail: > + remove_proc_entry("ip6_nonlocal_bind", net->proc_net); > +proc_bind_fail: > remove_proc_entry("snmp6", net->proc_net); > proc_snmp6_fail: > remove_proc_entry("sockstat6", net->proc_net); > @@ -337,6 +384,7 @@ static void __net_exit ipv6_proc_exit_net(struct net *net) > { > remove_proc_entry("sockstat6", net->proc_net); > remove_proc_entry("dev_snmp6", net->proc_net); > + remove_proc_entry("ip6_nonlocal_bind", net->proc_net); > remove_proc_entry("snmp6", net->proc_net); > } > >
possible deadlock in rtnl_lock (5)
Hello, syzbot hit the following crash on upstream commit 3eb2ce825ea1ad89d20f7a3b5780df850e4be274 (Sun Mar 25 22:44:30 2018 +) Linux 4.16-rc7 syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=a46d6abf9d56b1365a72 So far this crash happened 27 times on net-next, upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6524202618191872 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5383267238805504 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5136472378179584 Kernel config: https://syzkaller.appspot.com/x/.config?id=-8440362230543204781 compiler: gcc (GCC) 7.1.1 20170620 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+a46d6abf9d56b1365...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. IPVS: sync thread started: state = BACKUP, mcast_ifn = sit0, syncid = 0, id = 0 IPVS: sync thread started: state = BACKUP, mcast_ifn = sit0, syncid = 0, id = 0 IPVS: stopping backup sync thread 4500 ... WARNING: possible recursive locking detected 4.16.0-rc7+ #3 Not tainted syzkaller688027/4497 is trying to acquire lock: (rtnl_mutex){+.+.}, at: [] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 but task is already holding lock: IPVS: stopping backup sync thread 4495 ... (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(rtnl_mutex); lock(rtnl_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by syzkaller688027/4497: #0: (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<703f78e3>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 stack backtrace: CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline] check_deadlock kernel/locking/lockdep.c:1805 [inline] validate_chain kernel/locking/lockdep.c:2401 [inline] __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 sock_release+0x8d/0x1e0 net/socket.c:595 start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261 udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x446a69 RSP: 002b:7fa1c3a64da8 EFLAGS: 0246 ORIG_RAX: 0036 RAX: ffda RBX: RCX: 00446a69 RDX: 048b RSI: RDI: 0003 RBP: 006e29fc R08: 0018 R09: R10: 20c0 R11: 0246 R12: 006e29f8 R13: 00676e697279656b R14: 7fa1c3a659c0 R15: 006e2b60 --- This bug is generated by a dumb bot. It may contain errors. See https://goo.gl/tpsmEJ for details. Direct all questions to syzkal...@googlegroups.com. syzbot will keep track of this bug report. If you forgot to add the Reported-by tag, once the fix for this bug is merged into any tree, please reply to this email with: #syz fix: exact-commit-title If you want to test a patch for this bug, please reply with: #syz test: git://repo/address.git branch and provide the patch inline or as an attachment. To mark this as a duplicate of another syzbot report, please reply with: #syz dup: exact-subject-of-another-report If it's a one-off invalid bug report, please reply with: #syz invalid Note: if the crash happens again, it will cause creation of a new bug report. Note: all commands must start from beginning of the line in the email body.
[PATCH net V2] vhost: correctly remove wait queue during poll failure
We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by switching to use vhost_poll_stop() which zeros poll->wqh after removing poll from waitqueue to make sure it won't be freed twice. Cc: Darren KennyReported-by: syzbot+c0272972b01b872e6...@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang --- Changes from V1: - tweak the commit log for to match the code --- drivers/vhost/vhost.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1b3e8d2d..5d5a9d9 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file) if (mask) vhost_poll_wakeup(>wait, 0, 0, poll_to_key(mask)); if (mask & EPOLLERR) { - if (poll->wqh) - remove_wait_queue(poll->wqh, >wait); + vhost_poll_stop(poll); ret = -EINVAL; } -- 2.7.4
Re: [PATCH net V2] vhost: correctly remove wait queue during poll failure
On Tue, Mar 27, 2018 at 08:50:52PM +0800, Jason Wang wrote: We tried to remove vq poll from wait queue, but do not check whether or not it was in a list before. This will lead double free. Fixing this by switching to use vhost_poll_stop() which zeros poll->wqh after removing poll from waitqueue to make sure it won't be freed twice. Cc: Darren KennyReported-by: syzbot+c0272972b01b872e6...@syzkaller.appspotmail.com Fixes: 2b8b328b61c79 ("vhost_net: handle polling errors when setting backend") Signed-off-by: Jason Wang Reviewed-by: Darren Kenny --- Changes from V1: - tweak the commit log for to match the code --- drivers/vhost/vhost.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1b3e8d2d..5d5a9d9 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -212,8 +212,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file) if (mask) vhost_poll_wakeup(>wait, 0, 0, poll_to_key(mask)); if (mask & EPOLLERR) { - if (poll->wqh) - remove_wait_queue(poll->wqh, >wait); + vhost_poll_stop(poll); ret = -EINVAL; } -- 2.7.4
Re: [net-next PATCH 2/5] soc: ti: K2G: provide APIs to support driver probe deferral
Hello Andrew, On 03/26/2018 04:48 PM, Andrew Lunn wrote: > On Mon, Mar 26, 2018 at 04:15:09PM -0400, Murali Karicheri wrote: >> This patch provide APIs to allow client drivers to support >> probe deferral. On K2G SoC, devices can be probed only >> after the ti_sci_pm_domains driver is probed and ready. >> As drivers may get probed at different order, any driver >> that depends on knav dma and qmss drivers, for example >> netcp network driver, needs to defer probe until >> knav devices are probed and ready to service. To do this, >> add an API to query the device ready status from the knav >> dma and qmss devices. > > Hi Murali > > Shouldn't you really re-write this to be a dma driver? You would then > do something like of_dma_request_slave_channel() in the ethernet > driver probe function. That probably correctly returns EPROBE_DEFER. > >Andrew > Could you please elaborate? These knav dma and qmss drivers are introduced to support packet DMA hardware available in Keystone NetCP which couldn't be implemented using the DMA APIs available at the time this driver was introduced. Another reason was that the performance was really bad. We had an internal implementation based on DMA API before which couldn't be upstreamed at that time due to the reason that we were mis-using the API for this driver. So we introduced these knav_dma driver to support NetCP. We don't have any plan to re-write the driver at this time. If your question is about EPROBE_DEFER being returned from an existing knav_dma API and using the return code to achieve probe defer instead of introducing these APIs, I can take a look into that and respond. So please clarify. -- Murali Karicheri Linux Kernel, Keystone
Re: possible deadlock in rtnl_lock (5)
syzbotwrote: [ cc Julian and trimming cc list ] > syzkaller688027/4497 is trying to acquire lock: > (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 > but task is already holding lock: > IPVS: stopping backup sync thread 4495 ... > (rtnl_mutex){+.+.}, at: [ ] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 > > other info that might help us debug this: > Possible unsafe locking scenario: > >CPU0 > > lock(rtnl_mutex); > lock(rtnl_mutex); > > *** DEADLOCK *** > > May be due to missing lock nesting notation Looks like this is real, commit e0b26cc997d57305b4097711e12e13992580ae34 ("ipvs: call rtnl_lock early") added rtnl_lock when starting sync thread but socket close invokes rtnl_lock too: > stack backtrace: > rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 > ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 > inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 > sock_release+0x8d/0x1e0 net/socket.c:595 > start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 > do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389
Re: [PATCH] net: fec: set dma_coherent_mask
Hi Greg, On Mon, Mar 26, 2018 at 3:36 PM, Greg Ungererwrote: > As of commit 205e1b7f51e4 ("dma-mapping: warn when there is no > coherent_dma_mask") the Freescale FEC driver is issuing the following > warning on driver initialization on ColdFire systems: > > WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 0x40159e20 > Modules linked in: > CPU: 0 PID: 1 Comm: swapper Not tainted 4.16.0-rc7-dirty #4 > Stack from 41833dd8: > 41833dd8 40259c53 40025534 40279e26 0003 4004e514 > 41827000 > 400255de 40244e42 0204 40159e20 0009 > 4024531d > 40159e20 40244e42 0204 0007 > > 40279e26 4028d040 40226576 4003ae88 40279e26 418273f6 > 41833ef8 > 7fff 418273f2 41867028 4003c9a2 4180ac6c 0004 41833f8c > 4013e71c > 40279e1c 40279e26 40226c16 4013ced2 40279e26 40279e58 4028d040 > > Call Trace: > [<40025534>] 0x40025534 > [<4004e514>] 0x4004e514 > [<400255de>] 0x400255de > [<40159e20>] 0x40159e20 > [<40159e20>] 0x40159e20 > > It is not fatal, the driver and the system continue to function normally. > > As per the warning the coherent_dma_mask is not set on this device. > There is nothing special about the DMA memory coherency on this hardware > so we can just set the mask to 32bits during probe. > > Signed-off-by: Greg Ungerer Thanks for your patch! > --- > drivers/net/ethernet/freescale/fec_main.c | 2 ++ > 1 file changed, 2 insertions(+) > > Is this the best way to handle this problem? > Comments welcome... > > diff --git a/drivers/net/ethernet/freescale/fec_main.c > b/drivers/net/ethernet/freescale/fec_main.c > index d4604bc..3cb130a 100644 > --- a/drivers/net/ethernet/freescale/fec_main.c > +++ b/drivers/net/ethernet/freescale/fec_main.c > @@ -2702,6 +2702,8 @@ static int fec_enet_alloc_queue(struct net_device *ndev) > int ret = 0; > struct fec_enet_priv_tx_q *txq; > > + dma_set_coherent_mask(>pdev->dev, DMA_BIT_MASK(32)); > + > for (i = 0; i < fep->num_tx_queues; i++) { > txq = kzalloc(sizeof(*txq), GFP_KERNEL); > if (!txq) { As per your other email, this does not trigger on iMX systems using DT. Hence I'm wondering if the Coldfire platform code shouldn't just do the same what drivers/of/device.c does, cfr. https://www.spinics.net/lists/linux-m68k/msg10929.html? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH net-next 1/4] qed: Populate nvm image attribute shadow.
On Mon, Mar 26, 2018 at 03:13:45AM -0700, Sudarsana Reddy Kalluru wrote: > This patch add support for populating the flash image attributes. s/add/adds/ [...] > -int qed_mcp_bist_nvm_test_get_image_att(struct qed_hwfn *p_hwfn, > - struct qed_ptt *p_ptt, > - struct bist_nvm_image_att *p_image_att, > +int qed_mcp_bist_nvm_get_image_att(struct qed_hwfn *p_hwfn, > +struct qed_ptt *p_ptt, > +struct bist_nvm_image_att *p_image_att, > u32 image_index) Indentation seems broken. > > +int qed_mcp_nvm_info_populate(struct qed_hwfn *p_hwfn) > +{ > + struct qed_nvm_image_info *nvm_info = _hwfn->nvm_info; > + struct qed_ptt *p_ptt; > + int rc; > + u32 i; > + > + p_ptt = qed_ptt_acquire(p_hwfn); > + if (!p_ptt) { > + DP_ERR(p_hwfn, "failed to acquire ptt\n"); > + return -EBUSY; > + } > + > + /* Acquire from MFW the amount of available images */ > + nvm_info->num_images = 0; > + rc = qed_mcp_bist_nvm_get_num_images(p_hwfn, > + p_ptt, _info->num_images); > + if (rc == -EOPNOTSUPP) { > + DP_INFO(p_hwfn, "DRV_MSG_CODE_BIST_TEST is not supported\n"); > + goto out; > + } else if ((rc != 0) || (nvm_info->num_images == 0)) { rc || !nvm_info->num_images > + DP_ERR(p_hwfn, "Failed getting number of images\n"); > + goto err0; > + } > + > + nvm_info->image_att = > + kmalloc(nvm_info->num_images * sizeof(struct bist_nvm_image_att), > + GFP_KERNEL); Indentation can be better than this. [...] > --- a/drivers/net/ethernet/qlogic/qed/qed_selftest.c > +++ b/drivers/net/ethernet/qlogic/qed/qed_selftest.c > @@ -125,10 +125,11 @@ int qed_selftest_nvram(struct qed_dev *cdev) > } > > /* Acquire from MFW the amount of available images */ > - rc = qed_mcp_bist_nvm_test_get_num_images(p_hwfn, p_ptt, _images); > + rc = qed_mcp_bist_nvm_get_num_images(p_hwfn, p_ptt, _images); > if (rc || !num_images) { > DP_ERR(p_hwfn, "Failed getting number of images\n"); > - return -EINVAL; > + rc = -EINVAL; > + goto err0; Well, this one is a bug fix [Failure flow currently leaks a PTT entry]. If you don't want to treat it as one that's fine, but I think it deserves its own patch in the series. > } > > /* Iterate over images and validate CRC */ > @@ -136,8 +137,8 @@ int qed_selftest_nvram(struct qed_dev *cdev) > /* This mailbox returns information about the image required for >* reading it. >*/ > - rc = qed_mcp_bist_nvm_test_get_image_att(p_hwfn, p_ptt, > - _att, i); > + rc = qed_mcp_bist_nvm_get_image_att(p_hwfn, p_ptt, > + _att, i); > if (rc) { > DP_ERR(p_hwfn, > "Failed getting image index %d attributes\n", > -- > 1.8.3.1 >
Re: [PATCH v3 iproute2-next 1/8] rdma: include rdma-core
On 3/26/18 11:46 PM, Leon Romanovsky wrote: > On Mon, Mar 26, 2018 at 01:57:32PM -0700, Steve Wise wrote: >> This avoids requiring rdma-core be installed on systems. >> >> Signed-off-by: Steve Wise>> --- >> rdma/include/rdma/rdma_cma.h | 728 >> +++ >> 1 file changed, 728 insertions(+) >> create mode 100644 rdma/include/rdma/rdma_cma.h >> > > Steve, > > Sorry for not spotting it before, you actually need only 3 enums for the > cm_id_ps_to_str() from rdma_cma.h. > > Simply copy/paste that enum into cm_id_ps_to_str(). > > Thanks > I think this is Jason's point: if that enum is part of the UAPI why isn't it part of a uapi header file? We definitely do not want that entire header file brought in to iproute2
[PATCH v2 net 1/1] qede: Fix barrier usage after tx doorbell write.
Since commit c5ad119fb6c09b0297446be05bd66602fa564758 ("net: sched: pfifo_fast use skb_array") driver is exposed to an issue where it is hitting NULL skbs while handling TX completions. Driver uses mmiowb() to flush the writes to the doorbell bar which is a write-combined bar, however on x86 mmiowb() does not flush the write combined buffer. This patch fixes this problem by replacing mmiowb() with wmb() after the write combined doorbell write so that writes are flushed and synchronized from more than one processor. V1->V2: --- This patch was marked as "superseded" in patchwork. (Not really sure for what reason).Resending it as v2. Signed-off-by: Ariel EliorSigned-off-by: Manish Chopra --- drivers/net/ethernet/qlogic/qede/qede_fp.c | 10 -- 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c index dafc079..2e921ca 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_fp.c +++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c @@ -320,13 +320,11 @@ static inline void qede_update_tx_producer(struct qede_tx_queue *txq) barrier(); writel(txq->tx_db.raw, txq->doorbell_addr); - /* mmiowb is needed to synchronize doorbell writes from more than one -* processor. It guarantees that the write arrives to the device before -* the queue lock is released and another start_xmit is called (possibly -* on another CPU). Without this barrier, the next doorbell can bypass -* this doorbell. This is applicable to IA64/Altix systems. + /* Fence required to flush the write combined buffer, since another +* CPU may write to the same doorbell address and data may be lost +* due to relaxed order nature of write combined bar. */ - mmiowb(); + wmb(); } static int qede_xdp_xmit(struct qede_dev *edev, struct qede_fastpath *fp, -- 1.7.1
Re: [PATCH] staging: fsl-dpaa2/ethsw: Fix TCI values overwrite
On Tue, Mar 27, 2018 at 08:10:50AM -0500, Razvan Stefanescu wrote: > Previous implementation overwrites PCP value, assuming the default value is > 0, instead of 7. > > Avoid this by modifying helper function ethsw_port_set_tci() to > ethsw_port_set_pvid() and make it update only the vlan_id of the tci_cfg > struct. Hi Razvan It is a good idea to explain acronyms, especially for staging, since there are patches for all sorts of devices, can you cannot expect everybody to know network specific acronyms. By PCP you mean Priority Code Point. TCI i have no idea about. Looking at the code, i think you are changing the flow to become read/modify/write, instead of just write, which is overwriting the previously configured Priority Code Point? Please try to add more details to your change logs, to help us understand the change. Andrew
Re: [Patch net] llc: properly handle dev_queue_xmit() return value
Hi, I am not sure what is the next step from this? Does it mean that a patch is out in the kernel's GIT/Beta version? Or is this just a proposal? On Tue, Mar 27, 2018 at 1:08 AM, Cong Wangwrote: > llc_conn_send_pdu() pushes the skb into write queue and > calls llc_conn_send_pdus() to flush them out. However, the > status of dev_queue_xmit() is not returned to caller, > in this case, llc_conn_state_process(). > > llc_conn_state_process() needs hold the skb no matter > success or failure, because it still uses it after that, > therefore we should hold skb before dev_queue_xmit() when > that skb is the one being processed by llc_conn_state_process(). > > For other callers, they can just pass NULL and ignore > the return value as they are. > > Reported-by: Noam Rathaus > Signed-off-by: Cong Wang > --- > include/net/llc_conn.h | 2 +- > net/llc/llc_c_ac.c | 15 +-- > net/llc/llc_conn.c | 32 +++- > 3 files changed, 33 insertions(+), 16 deletions(-) > > diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h > index fe994d2e5286..5c40f118c0fa 100644 > --- a/include/net/llc_conn.h > +++ b/include/net/llc_conn.h > @@ -103,7 +103,7 @@ void llc_sk_reset(struct sock *sk); > > /* Access to a connection */ > int llc_conn_state_process(struct sock *sk, struct sk_buff *skb); > -void llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb); > +int llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb); > void llc_conn_rtn_pdu(struct sock *sk, struct sk_buff *skb); > void llc_conn_resend_i_pdu_as_cmd(struct sock *sk, u8 nr, u8 first_p_bit); > void llc_conn_resend_i_pdu_as_rsp(struct sock *sk, u8 nr, u8 first_f_bit); > diff --git a/net/llc/llc_c_ac.c b/net/llc/llc_c_ac.c > index f59648018060..163121192aca 100644 > --- a/net/llc/llc_c_ac.c > +++ b/net/llc/llc_c_ac.c > @@ -389,7 +389,7 @@ static int llc_conn_ac_send_i_cmd_p_set_0(struct sock > *sk, struct sk_buff *skb) > llc_pdu_init_as_i_cmd(skb, 0, llc->vS, llc->vR); > rc = llc_mac_hdr_init(skb, llc->dev->dev_addr, llc->daddr.mac); > if (likely(!rc)) { > - llc_conn_send_pdu(sk, skb); > + rc = llc_conn_send_pdu(sk, skb); > llc_conn_ac_inc_vs_by_1(sk, skb); > } > return rc; > @@ -916,7 +916,7 @@ static int llc_conn_ac_send_i_rsp_f_set_ackpf(struct sock > *sk, > llc_pdu_init_as_i_cmd(skb, llc->ack_pf, llc->vS, llc->vR); > rc = llc_mac_hdr_init(skb, llc->dev->dev_addr, llc->daddr.mac); > if (likely(!rc)) { > - llc_conn_send_pdu(sk, skb); > + rc = llc_conn_send_pdu(sk, skb); > llc_conn_ac_inc_vs_by_1(sk, skb); > } > return rc; > @@ -935,14 +935,17 @@ static int llc_conn_ac_send_i_rsp_f_set_ackpf(struct > sock *sk, > int llc_conn_ac_send_i_as_ack(struct sock *sk, struct sk_buff *skb) > { > struct llc_sock *llc = llc_sk(sk); > + int ret; > > if (llc->ack_must_be_send) { > - llc_conn_ac_send_i_rsp_f_set_ackpf(sk, skb); > + ret = llc_conn_ac_send_i_rsp_f_set_ackpf(sk, skb); > llc->ack_must_be_send = 0 ; > llc->ack_pf = 0; > - } else > - llc_conn_ac_send_i_cmd_p_set_0(sk, skb); > - return 0; > + } else { > + ret = llc_conn_ac_send_i_cmd_p_set_0(sk, skb); > + } > + > + return ret; > } > > /** > diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c > index 9177dbb16dce..110e32bcb399 100644 > --- a/net/llc/llc_conn.c > +++ b/net/llc/llc_conn.c > @@ -30,7 +30,7 @@ > #endif > > static int llc_find_offset(int state, int ev_type); > -static void llc_conn_send_pdus(struct sock *sk); > +static int llc_conn_send_pdus(struct sock *sk, struct sk_buff *skb); > static int llc_conn_service(struct sock *sk, struct sk_buff *skb); > static int llc_exec_conn_trans_actions(struct sock *sk, >struct llc_conn_state_trans *trans, > @@ -193,11 +193,11 @@ int llc_conn_state_process(struct sock *sk, struct > sk_buff *skb) > return rc; > } > > -void llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb) > +int llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb) > { > /* queue PDU to send to MAC layer */ > skb_queue_tail(>sk_write_queue, skb); > - llc_conn_send_pdus(sk); > + return llc_conn_send_pdus(sk, skb); > } > > /** > @@ -255,7 +255,7 @@ void llc_conn_resend_i_pdu_as_cmd(struct sock *sk, u8 nr, > u8 first_p_bit) > if (howmany_resend > 0) > llc->vS = (llc->vS + 1) % LLC_2_SEQ_NBR_MODULO; > /* any PDUs to re-send are queued up; start sending to MAC */ > - llc_conn_send_pdus(sk); > + llc_conn_send_pdus(sk, NULL); > out:; > } > > @@ -296,7 +296,7 @@ void llc_conn_resend_i_pdu_as_rsp(struct sock *sk, u8 nr, > u8
Re: [net-next PATCH 3/5] net: netcp: ethss enhancements to support 2u cpsw h/w on K2G SoC
On 03/26/2018 04:28 PM, Andrew Lunn wrote: > On Mon, Mar 26, 2018 at 04:15:10PM -0400, Murali Karicheri wrote: >> K2G SoC uses 2u cpsw h/w. It uses RGMII instead of SGMII to interface with >> Phy. This patch enhances the driver to check RGMII status instead of SGMII >> status for link state determination. Also map all of the vlan priorities >> to zero as the packet DMA is enabled to receive only flow id 0 which maps >> to priority zero. >> >> Additionally, When a phy with rgmii interface requires internal delay, the >> same is set in the phy driver. To support such phy devices, add a phy-mode >> handling code in the driver using of_get_phy_mode() and pass the obtained >> phy mode to of_phy_connect() > > Hi Murali > > Please break this patch up. One patch should do one thing. That makes > it easy to review. There are too many things going on at once here. > >Andrew > Hello Andrew, Thanks for the comment. But I am not sure how to break this up as this is an enhancement to the driver to support a newer version of the cspw hardware. Without all these pieces together, the driver can't function. Probably I can make the below break up based on different functions added. Let me know if this looks good to you. Beware that patch #2 and #3 are small patches and majority of code change will be in patch #1 which has to go together. Patch #1. Add support new link interface, RGMII_LINK_MAC_PHY, for K2G - Most of the code is for this Patch #2. Add support for configuring phy_mode - This just add phy_mode handling code Patch #3. map all vlan priorities to flow id 0 -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH 4/4] selftests/bpf: fix compiling errors
On Tue, Mar 27, 2018 at 11:52:27AM +0200, Daniel Borkmann wrote: > On 03/27/2018 11:00 AM, Du, Changbin wrote: > > On Tue, Mar 27, 2018 at 10:52:57AM +0200, Daniel Borkmann wrote: > >> On 03/27/2018 05:06 AM, Du, Changbin wrote: > >>> On Mon, Mar 26, 2018 at 08:02:30PM -0700, Alexei Starovoitov wrote: > On Tue, Mar 27, 2018 at 10:20:10AM +0800, Du, Changbin wrote: > > On Mon, Mar 26, 2018 at 07:55:13AM -0700, Alexei Starovoitov wrote: > >> On Mon, Mar 26, 2018 at 05:23:28PM +0800, changbin...@intel.com wrote: > >>> Signed-off-by: Changbin Du> >>> --- > >>> tools/testing/selftests/bpf/Makefile | 5 +++-- > >>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/tools/testing/selftests/bpf/Makefile > >>> b/tools/testing/selftests/bpf/Makefile > >>> index 5c43c18..dc0fdc8 100644 > >>> --- a/tools/testing/selftests/bpf/Makefile > >>> +++ b/tools/testing/selftests/bpf/Makefile > >>> @@ -10,7 +10,8 @@ ifneq ($(wildcard $(GENHDR)),) > >>>GENFLAGS := -DHAVE_GENHDR > >>> endif > >>> > >>> -CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) > >>> -I../../../include > >>> +CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) \ > >>> + -I../../../include -I../../../../usr/include > >>> LDLIBS += -lcap -lelf -lrt -lpthread > >>> > >>> # Order correspond to 'make run_tests' order > >>> @@ -62,7 +63,7 @@ else > >>>CPU ?= generic > >>> endif > >>> > >>> -CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \ > >>> +CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi > >>> -I../../../../usr/include \ > >>> -Wno-compare-distinct-pointer-types > >> > >> Nack. > >> I suspect that will break the build for everyone else who's doing it > >> in the directory > >> itself instead of the outer one. > > > > This one? But I didn't see any problem. > > because the build was lucky and additional path ../../../../usr/include > didn't point > to a valid dir? > >> > >> Agree. > >> > >>> I am sorry but I don't understand why you mean *lucky*. Of cause, the > >>> path is valid. > >> > >> The problem is that this suddenly requires users to do a 'make > >> headers_install' in > >> order to populate usr/include/ directory in the first place. While it's > >> annoying > >> enough for BPF samples where this is needed, I absolutely don't want to > >> introduce > >> this for BPF kselftests. It's the wrong approach. Besides, in tools infra, > >> there is > >> a tools/arch/*/include/uapi/asm/bitsperlong.h header copy already, so we > >> really need > >> to use that instead. Please adapt your patch accordingly and respin. > >> Please also Cc > >> us and netdev@vger.kernel.org for BPF kselftests changes. > >> > > Thanks for the explanation. So we expect that tools/arch/*/include is in > > the searching list, right? > > The corrent makefile seems not. How do you get this built? > > E.g. take a look at tools/include/asm/barrier.h or > tools/include/uapi/asm/bpf_perf_event.h > just to name two examples. We'd need something similar to this which then > points to the > arch specific includes. > > Thanks, > Daniel > ok, I see. But I am going to skip this fix this time. Because one get fixed, another appears. IMHO, It doesn't sound like a good idea to sync all these files manually. We should have better solution I think. clang -I. -I./include/uapi -I../../../include/uapi -Wno-compare-distinct-pointer-types \ -O2 -target bpf -emit-llvm -c test_pkt_access.c -o - | \ llc -march=bpf -mcpu=generic -filetype=obj -o /home/changbin/work/linux/tools/testing/selftests/bpf/test_pkt_access.o In file included from test_pkt_access.c:12: /usr/include/linux/ip.h:20:10: fatal error: 'asm/byteorder.h' file not found #include > > changbin@gvt-dell-host:~/work/linux/tools/testing/selftests/bpf$ make -p > > [...] > > clang -I. -I./include/uapi -I../../../include/uapi > > -Wno-compare-distinct-pointer-types \ > > -O2 -target bpf -emit-llvm -c test_pkt_access.c -o - | \ > > llc -march=bpf -mcpu=generic -filetype=obj -o > > /home/changbin/work/linux/tools/testing/selftests/bpf/test_pkt_access.o > > In file included from test_pkt_access.c:9: > > In file included from ../../../include/uapi/linux/bpf.h:11: > > In file included from ./include/uapi/linux/types.h:5: > > /usr/include/asm-generic/int-ll64.h:11:10: fatal error: 'asm/bitsperlong.h' > > file not found > > #include > > > > > -- Thanks, Changbin Du
[PATCH] staging: fsl-dpaa2/ethsw: Fix TCI values overwrite
Previous implementation overwrites PCP value, assuming the default value is 0, instead of 7. Avoid this by modifying helper function ethsw_port_set_tci() to ethsw_port_set_pvid() and make it update only the vlan_id of the tci_cfg struct. Signed-off-by: Razvan Stefanescu--- drivers/staging/fsl-dpaa2/ethsw/dpsw-cmd.h | 13 + drivers/staging/fsl-dpaa2/ethsw/dpsw.c | 42 ++ drivers/staging/fsl-dpaa2/ethsw/dpsw.h | 6 + drivers/staging/fsl-dpaa2/ethsw/ethsw.c| 37 +- 4 files changed, 79 insertions(+), 19 deletions(-) diff --git a/drivers/staging/fsl-dpaa2/ethsw/dpsw-cmd.h b/drivers/staging/fsl-dpaa2/ethsw/dpsw-cmd.h index 1c203e6..da744f2 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/dpsw-cmd.h +++ b/drivers/staging/fsl-dpaa2/ethsw/dpsw-cmd.h @@ -49,6 +49,8 @@ #define DPSW_CMDID_IF_SET_FLOODING DPSW_CMD_ID(0x047) #define DPSW_CMDID_IF_SET_BROADCAST DPSW_CMD_ID(0x048) +#define DPSW_CMDID_IF_GET_TCI DPSW_CMD_ID(0x04A) + #define DPSW_CMDID_IF_SET_LINK_CFG DPSW_CMD_ID(0x04C) #define DPSW_CMDID_VLAN_ADD DPSW_CMD_ID(0x060) @@ -206,6 +208,17 @@ struct dpsw_cmd_if_set_tci { __le16 conf; }; +struct dpsw_cmd_if_get_tci { + __le16 if_id; +}; + +struct dpsw_rsp_if_get_tci { + __le16 pad; + __le16 vlan_id; + u8 dei; + u8 pcp; +}; + #define DPSW_STATE_SHIFT 0 #define DPSW_STATE_SIZE4 diff --git a/drivers/staging/fsl-dpaa2/ethsw/dpsw.c b/drivers/staging/fsl-dpaa2/ethsw/dpsw.c index 9b9bc60..3ea957c 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/dpsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/dpsw.c @@ -529,6 +529,48 @@ int dpsw_if_set_tci(struct fsl_mc_io *mc_io, } /** + * dpsw_if_get_tci() - Get default VLAN Tag Control Information (TCI) + * @mc_io: Pointer to MC portal's I/O object + * @cmd_flags: Command flags; one or more of 'MC_CMD_FLAG_' + * @token: Token of DPSW object + * @if_id: Interface Identifier + * @cfg: Tag Control Information Configuration + * + * Return: Completion status. '0' on Success; Error code otherwise. + */ +int dpsw_if_get_tci(struct fsl_mc_io *mc_io, + u32 cmd_flags, + u16 token, + u16 if_id, + struct dpsw_tci_cfg *cfg) +{ + struct mc_command cmd = { 0 }; + struct dpsw_cmd_if_get_tci *cmd_params; + struct dpsw_rsp_if_get_tci *rsp_params; + int err; + + /* prepare command */ + cmd.header = mc_encode_cmd_header(DPSW_CMDID_IF_GET_TCI, + cmd_flags, + token); + cmd_params = (struct dpsw_cmd_if_get_tci *)cmd.params; + cmd_params->if_id = cpu_to_le16(if_id); + + /* send command to mc*/ + err = mc_send_command(mc_io, ); + if (err) + return err; + + /* retrieve response parameters */ + rsp_params = (struct dpsw_rsp_if_get_tci *)cmd.params; + cfg->pcp = rsp_params->pcp; + cfg->dei = rsp_params->dei; + cfg->vlan_id = le16_to_cpu(rsp_params->vlan_id); + + return 0; +} + +/** * dpsw_if_set_stp() - Function sets Spanning Tree Protocol (STP) state. * @mc_io: Pointer to MC portal's I/O object * @cmd_flags: Command flags; one or more of 'MC_CMD_FLAG_' diff --git a/drivers/staging/fsl-dpaa2/ethsw/dpsw.h b/drivers/staging/fsl-dpaa2/ethsw/dpsw.h index 3335add..82f80c40 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/dpsw.h +++ b/drivers/staging/fsl-dpaa2/ethsw/dpsw.h @@ -306,6 +306,12 @@ int dpsw_if_set_tci(struct fsl_mc_io *mc_io, u16 if_id, const struct dpsw_tci_cfg *cfg); +int dpsw_if_get_tci(struct fsl_mc_io *mc_io, + u32 cmd_flags, + u16 token, + u16 if_id, + struct dpsw_tci_cfg *cfg); + /** * enum dpsw_stp_state - Spanning Tree Protocol (STP) states * @DPSW_STP_STATE_BLOCKING: Blocking state diff --git a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c index c723a04..ab81a6c 100644 --- a/drivers/staging/fsl-dpaa2/ethsw/ethsw.c +++ b/drivers/staging/fsl-dpaa2/ethsw/ethsw.c @@ -50,14 +50,23 @@ static int ethsw_add_vlan(struct ethsw_core *ethsw, u16 vid) return 0; } -static int ethsw_port_set_tci(struct ethsw_port_priv *port_priv, - struct dpsw_tci_cfg *tci_cfg) +static int ethsw_port_set_pvid(struct ethsw_port_priv *port_priv, u16 pvid) { struct ethsw_core *ethsw = port_priv->ethsw_data; struct net_device *netdev = port_priv->netdev; + struct dpsw_tci_cfg tci_cfg = { 0 }; bool is_oper; int err, ret; + err = dpsw_if_get_tci(ethsw->mc_io, 0, ethsw->dpsw_handle, + port_priv->idx, _cfg); + if (err) { +
Re: [PATCH net-next 3/4] qed: Adapter flash update support.
On Mon, Mar 26, 2018 at 03:13:47AM -0700, Sudarsana Reddy Kalluru wrote: > This patch adds the required driver support for updating the flash or > non volatile memory of the adapter. At highlevel, flash upgrade comprises > of reading the flash images from the input file, validating the images and > writing it to the respective paritions. s/it/them/ [...] > + * > /--\ > + * 0B | 0x4 [command index] > | > + * 4B | image_type | Options| Number of register settings > | > + * 8B | Value > | > + * 12B | Mask > | > + * 16B | Offset > | > + * > \--/ > + * There can be several Value-Mask-Offset sets as specified by 'Number > of...'. > + * Options - 0'b - Calculate & Update CRC for image > + */ > +static int qed_nvm_flash_image_access(struct qed_dev *cdev, const u8 **data, > + bool *check_resp) > +{ > + struct qed_nvm_image_att nvm_image; > + struct qed_hwfn *p_hwfn; > + bool is_crc = false; > + u32 image_type; > + int rc = 0, i; > + u16 len; > + + > + nvm_image.start_addr = p_hwfn->nvm_info.image_att[i].nvm_start_addr; > + nvm_image.length = p_hwfn->nvm_info.image_att[i].len; > + > + DP_VERBOSE(cdev, NETIF_MSG_DRV, > +"Read image %02x; type = %08x; NVM [%08x,...,%08x]\n", > +**data, nvm_image.start_addr, image_type, > +nvm_image.start_addr + nvm_image.length - 1); Looks like 3rd and 4th printed parameters are flipped. > + (*data)++; > + is_crc = !!(**data); If you'd actually want to be able to use the reserved bits [forward-compatibility] in the future, you should check bit 0 instead of checking the byte. > + (*data)++; > + len = *((u16 *)*data); > + *data += 2; [...] > + > +/* Binary file format - > + * > /--\ > + * 0B | 0x3 [command index] > | > + * 4B | b'0: check_response? | b'1-127 reserved > | This shows there are 128 bits in a 4 byte field. > + * 8B | File-type | reserved > | > + * > \--/ > + * Start a new file of the provided type > + */ > +static int qed_nvm_flash_image_file_start(struct qed_dev *cdev, > + const u8 **data, bool *check_resp) > +{ > + int rc; > + > + *data += 4; > + *check_resp = !!(**data); Like above > + *data += 4; > + > + DP_VERBOSE(cdev, NETIF_MSG_DRV, > +"About to start a new file of type %02x\n", **data); > + rc = qed_mcp_nvm_put_file_begin(cdev, **data); > + *data += 4; > + > + return rc; > +} > + > +/* Binary file format - > + * > /--\ > + * 0B | 0x2 [command index] > | > + * 4B | Length in bytes > | > + * 8B | b'0: check_response? | b'1-127 reserved > | Same as above > + * 12B | Offset in bytes > | > + * 16B | Data ... > | > + * > \--/ > + * Write data as part of a file that was previously started. Data should > be > + * of length equal to that provided in the message > + */ > +static int qed_nvm_flash_image_file_data(struct qed_dev *cdev, > + const u8 **data, bool *check_resp) > +{ > + u32 offset, len; > + int rc; > + > + *data += 4; > + len = *((u32 *)(*data)); > + *data += 4; > + *check_resp = !!(**data); Same as above > + *data += 4; > + offset = *((u32 *)(*data)); > + *data += 4; > + > + DP_VERBOSE(cdev, NETIF_MSG_DRV, > +"About to write File-data: %08x bytes to offset %08x\n", > +len, offset); > + > + rc = qed_mcp_nvm_write(cdev, QED_PUT_FILE_DATA, offset, > +(char *)(*data), len); > + *data += len; > + > + return rc; > +} [...] > + > +static int qed_nvm_flash(struct qed_dev *cdev, const char *name) > +{ > + rc = qed_nvm_flash_image_validate(cdev, image, ); > + if (rc) > + goto exit; > + > + while (data < data_end) { > + bool check_resp = false; > + > + /* Parse
Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
Hello! On 3/27/2018 2:33 AM, NeilBrown wrote: The documentation for rhashtable_walk_peek() wrong. It claims to return the *next* entry, whereas it in fact returns the *previous* entry. However if no entries have yet been returned - or if the iterator was reset due to a resize event, then rhashtable_walk_peek() *does* return the next entry, but also advances the iterator. I suspect that this interface should be discarded and the one user should be changed to not require it. Possibly this patch should be seen as a first step in that conversation. This patch mostly corrects the documentation, but does make a small code change so that the documentation can be correct without listing too many special cases. I don't think the one user will be affected by the code change. Signed-off-by: NeilBrown--- lib/rhashtable.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 3825c30aaa36..24a57ca494cb 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter) EXPORT_SYMBOL_GPL(rhashtable_walk_next); /** - * rhashtable_walk_peek - Return the next object but don't advance the iterator + * rhashtable_walk_peek - Return the previously returned object without advancing the iterator * @iter: Hash table iterator * - * Returns the next object or NULL when the end of the table is reached. + * Returns the last object returned, Sounds somewhat tautological. :-) or NULL if no object has yet been returned. + * If the previously returned object has since been removed, then some other arbitrary + * object maybe returned, or possibly NULL will be returned. In that case, the + * iterator might be advanced. * * Returns -EAGAIN if resize event occurred. Note that the iterator - * will rewind back to the beginning and you may continue to use it. + * will rewind back to the beginning and rhashtable_walk_next() should be + * used to get the next object. */ void *rhashtable_walk_peek(struct rhashtable_iter *iter) { @@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter) * the table hasn't changed. */ iter->skip--; - } + } else + /* ->skip is only zero after rhashtable_walk_start() +* or when the iterator is reset. In this case there +* is no previous object to return. +*/ + return NULL; CodingStyle: need {} on the *else* branch if the 1st branch has them. return __rhashtable_walk_find_next(iter); } MBR, Sergei
[PATCH net-next] net: Export net->ipv6.sysctl.ip_nonlocal_bind to /proc
Currenly, this parameter can be configured via sysctl only. But sysctl is considered as depricated interface (man 2 sysctl), and it only can applied to current's net namespace (this requires to do setns() to change it in not current's net ns). So, let's export the parameter to /proc in standard way, and this allows to access another process namespace via /proc/[pid]/net/ip6_nonlocal_bind. Signed-off-by: Kirill Tkhai--- net/ipv6/proc.c | 48 1 file changed, 48 insertions(+) diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c index 6e57028d2e91..2d0aa59c2d0d 100644 --- a/net/ipv6/proc.c +++ b/net/ipv6/proc.c @@ -312,6 +312,47 @@ int snmp6_unregister_dev(struct inet6_dev *idev) return 0; } +static int nonlocal_bind_show(struct seq_file *seq, void *v) +{ + struct net *net = seq->private; + + seq_printf(seq, "%d\n", !!net->ipv6.sysctl.ip_nonlocal_bind); + return 0; +} + +static int open_nonlocal_bind(struct inode *inode, struct file *file) +{ + return single_open_net(inode, file, nonlocal_bind_show); +} + +static ssize_t write_nonlocal_bind(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct net *net = ((struct seq_file *)file->private_data)->private; + char buf[3]; + + if (*ppos || count <= 0 || count > sizeof(buf)) + return -EINVAL; + + if (copy_from_user(buf, ubuf, count)) + return -EFAULT; + buf[0] -= '0'; + if ((count == 3 && buf[2] != '\0') || + (count >= 2 && buf[1] != '\n') || + (buf[0] != 0 && buf[0] != 1)) + return -EINVAL; + + net->ipv6.sysctl.ip_nonlocal_bind = buf[0]; + return count; +} + +static const struct file_operations nonlocal_bind_ops = { + .open = open_nonlocal_bind, + .read = seq_read, + .write = write_nonlocal_bind, + .release = single_release_net, +}; + static int __net_init ipv6_proc_init_net(struct net *net) { if (!proc_create("sockstat6", 0444, net->proc_net, @@ -321,12 +362,18 @@ static int __net_init ipv6_proc_init_net(struct net *net) if (!proc_create("snmp6", 0444, net->proc_net, _seq_fops)) goto proc_snmp6_fail; +if (!proc_create_data("ip6_nonlocal_bind", 0644, + net->proc_net, _bind_ops, net)) + goto proc_bind_fail; + net->mib.proc_net_devsnmp6 = proc_mkdir("dev_snmp6", net->proc_net); if (!net->mib.proc_net_devsnmp6) goto proc_dev_snmp6_fail; return 0; proc_dev_snmp6_fail: + remove_proc_entry("ip6_nonlocal_bind", net->proc_net); +proc_bind_fail: remove_proc_entry("snmp6", net->proc_net); proc_snmp6_fail: remove_proc_entry("sockstat6", net->proc_net); @@ -337,6 +384,7 @@ static void __net_exit ipv6_proc_exit_net(struct net *net) { remove_proc_entry("sockstat6", net->proc_net); remove_proc_entry("dev_snmp6", net->proc_net); + remove_proc_entry("ip6_nonlocal_bind", net->proc_net); remove_proc_entry("snmp6", net->proc_net); }
Re: [PATCH net-next v2 0/2] kernel: add support to collect hardware logs in crash recovery kernel
Rahul Lakkireddywrites: > On Saturday, March 03/24/18, 2018 at 20:50:52 +0530, Eric W. Biederman wrote: >> >> Rahul Lakkireddy writes: >> >> > On production servers running variety of workloads over time, kernel >> > panic can happen sporadically after days or even months. It is >> > important to collect as much debug logs as possible to root cause >> > and fix the problem, that may not be easy to reproduce. Snapshot of >> > underlying hardware/firmware state (like register dump, firmware >> > logs, adapter memory, etc.), at the time of kernel panic will be very >> > helpful while debugging the culprit device driver. >> > >> > This series of patches add new generic framework that enable device >> > drivers to collect device specific snapshot of the hardware/firmware >> > state of the underlying device in the crash recovery kernel. In crash >> > recovery kernel, the collected logs are exposed via /sys/kernel/crashdd/ >> > directory, which is copied by user space scripts for post-analysis. >> > >> > A kernel module crashdd is newly added. In crash recovery kernel, >> > crashdd exposes /sys/kernel/crashdd/ directory containing device >> > specific hardware/firmware logs. >> >> Have you looked at instead of adding a sysfs file adding the dumps >> as additional elf notes in /proc/vmcore? >> > > I see the crash recovery kernel's memory is not present in any of the > the PT_LOAD headers. So, makedumpfile is not collecting the dumps > that are in crash recovery kernel's memory. > > Also, are you suggesting exporting the dumps themselves as PT_NOTE > instead? I'll look into doing it this way. Yes. I was suggesting exporting the dumps themselves as PT_NOTE in /proc/vmcore. I think that will allow makedumpfile to collect your new information without modification. Eric
Re: [PATCH] net: fec: set dma_coherent_mask
Hi, > > + dma_set_coherent_mask(>pdev->dev, DMA_BIT_MASK(32)); > + > for (i = 0; i < fep->num_tx_queues; i++) { > txq = kzalloc(sizeof(*txq), GFP_KERNEL); > if (!txq) { dma_set_coherent_mask() can fail, so the return value should be checked and a failure be handled accordingly. Regards, Lino
Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()
Neil, 2018-03-27 1:33 GMT+02:00 NeilBrown: > The documentation for rhashtable_walk_peek() wrong. It claims to > return the *next* entry, whereas it in fact returns the *previous* > entry. > However if no entries have yet been returned - or if the iterator > was reset due to a resize event, then rhashtable_walk_peek() > *does* return the next entry, but also advances the iterator. > > I suspect that this interface should be discarded and the one user > should be changed to not require it. Possibly this patch should be > seen as a first step in that conversation. > > This patch mostly corrects the documentation, but does make a > small code change so that the documentation can be correct without > listing too many special cases. I don't think the one user will > be affected by the code change. how about I come up with a replacement so that we can remove rhashtable_walk_peek straight away without making it differently broken in the meantime? Thanks, Andreas
Re: [PATCH net-next 0/6] rxrpc: Fixes
Sorry, that should be net, not net-next, in the subject line. David
Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT
- On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov a...@fb.com wrote: > On 3/27/18 2:04 PM, Steven Rostedt wrote: >> >> +#ifdef CONFIG_BPF_EVENTS >> +#define BPF_RAW_TP() . = ALIGN(8); \ Given that the section consists of a 16-bytes structure elements on architectures with 8 bytes pointers, this ". = ALIGN(8)" should be turned into a STRUCT_ALIGN(), especially given that the compiler is free to up-align the structure on 32 bytes. This could explain the kasan splat you are experiencing. Thanks, Mathieu >> + VMLINUX_SYMBOL(__start__bpf_raw_tp) = .; \ >> + KEEP(*(__bpf_raw_tp_map)) \ >> + VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .; > > that looks to be correct, but something wrong with it. > > Can you try your mini test with kasan on ? > > I'm seeing this crash: > test_stacktrace_[ 18.760662] start 84642438 stop 84644f60 > map_raw_tp:PASS:[ 18.761467] i 1 btp->tp > prog_load raw tp[ 18.762064] kasan: CONFIG_KASAN_INLINE enabled > 0 nsec > [ 18.762704] kasan: GPF could be caused by NULL-ptr deref or user > memory access > [ 18.765125] general protection fault: [#1] SMP KASAN PTI > [ 18.765830] Modules linked in: > [ 18.778358] Call Trace: > [ 18.778674] bpf_raw_tracepoint_open.isra.27+0x92/0x380 > > for some reason the start_bpf_raw_tp is off by 8. > Not sure how it works for you. > > (gdb) p &__bpf_trace_tp_map_sys_exit > $10 = (struct bpf_raw_event_map *) 0x84642440 > <__bpf_trace_tp_map_sys_exit> > > (gdb) p &__start__bpf_raw_tp > $7 = ( *) 0x84642438 > > (gdb) p (void*)(&__start__bpf_raw_tp)+8 > $11 = (void *) 0x84642440 <__bpf_trace_tp_map_sys_exit> -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [iproute PATCH 2/3] ss: Put filter DB parsing into a separate function
On Tue, Mar 27, 2018 at 11:46:01AM -0700, Stephen Hemminger wrote: > On Sat, 24 Mar 2018 19:18:10 +0100 > Phil Sutterwrote: > > > +#define ENTRY(name, ...) { #name, { __VA_ARGS__, MAX_DB }} > > > + ENTRY(all, UDP_DB, DCCP_DB, TCP_DB, RAW_DB, \ > > + UNIX_ST_DB, UNIX_DG_DB, UNIX_SQ_DB, \ > > + PACKET_R_DB, PACKET_DG_DB, NETLINK_DB, \ > > + SCTP_DB, VSOCK_ST_DB, VSOCK_DG_DB), > > Checkpatch complains that line continuations are not necessary here; > and it is right. Macro usage can cross lines. Interesting. Seems I should fix my muscle memory when it comes to macros. :) I'll follow-up with a fixed version. Thanks, Phil
[PATCH V5 net-next 01/14] tcp: Add clean acked data hook
From: Ilya LesokhinCalled when a TCP segment is acknowledged. Could be used by application protocols who hold additional metadata associated with the stream data. This is required by TLS device offload to release metadata associated with acknowledged TLS records. Signed-off-by: Ilya Lesokhin Signed-off-by: Boris Pismenny Signed-off-by: Aviad Yehezkel Signed-off-by: Saeed Mahameed --- include/net/inet_connection_sock.h | 2 ++ include/net/tcp.h | 8 net/ipv4/tcp_input.c | 23 +++ 3 files changed, 33 insertions(+) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index b68fea022a82..2ab6667275df 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops { * @icsk_af_ops Operations which are AF_INET{4,6} specific * @icsk_ulp_ops Pluggable ULP control hook * @icsk_ulp_data ULP private data + * @icsk_clean_acked Clean acked data hook * @icsk_listen_portaddr_node hash to the portaddr listener hashtable * @icsk_ca_state:Congestion control state * @icsk_retransmits: Number of unrecovered [RTO] timeouts @@ -102,6 +103,7 @@ struct inet_connection_sock { const struct inet_connection_sock_af_ops *icsk_af_ops; const struct tcp_ulp_ops *icsk_ulp_ops; void *icsk_ulp_data; + void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq); struct hlist_node icsk_listen_portaddr_node; unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu); __u8 icsk_ca_state:6, diff --git a/include/net/tcp.h b/include/net/tcp.h index 9c9b3768b350..a15e294ced66 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2101,4 +2101,12 @@ static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk) #if IS_ENABLED(CONFIG_SMC) extern struct static_key_false tcp_have_smc; #endif + +#if IS_ENABLED(CONFIG_TLS_DEVICE) +void clean_acked_data_enable(struct inet_connection_sock *icsk, +void (*cad)(struct sock *sk, u32 ack_seq)); +void clean_acked_data_disable(struct inet_connection_sock *icsk); + +#endif + #endif /* _TCP_H */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 451ef3012636..8cfc6d1ac804 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -111,6 +111,23 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE; #define REXMIT_LOST1 /* retransmit packets marked lost */ #define REXMIT_NEW 2 /* FRTO-style transmit of unsent/new packets */ +#if IS_ENABLED(CONFIG_TLS_DEVICE) +static DEFINE_STATIC_KEY_FALSE(clean_acked_data_enabled); + +void clean_acked_data_enable(struct inet_connection_sock *icsk, +void (*cad)(struct sock *sk, u32 ack_seq)) +{ + icsk->icsk_clean_acked = cad; + static_branch_inc(_acked_data_enabled); +} + +void clean_acked_data_disable(struct inet_connection_sock *icsk) +{ + static_branch_dec(_acked_data_enabled); + icsk->icsk_clean_acked = NULL; +} +#endif + static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb, unsigned int len) { @@ -3542,6 +3559,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if (after(ack, prior_snd_una)) { flag |= FLAG_SND_UNA_ADVANCED; icsk->icsk_retransmits = 0; + +#if IS_ENABLED(CONFIG_TLS_DEVICE) + if (static_branch_unlikely(_acked_data_enabled)) + if (icsk->icsk_clean_acked) + icsk->icsk_clean_acked(sk, ack); +#endif } prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) : tp->snd_una; -- 2.14.3
Re: [PATCH v6 bpf-next 08/11] bpf: introduce BPF_RAW_TRACEPOINT
On 3/27/18 4:13 PM, Mathieu Desnoyers wrote: - On Mar 27, 2018, at 6:48 PM, Alexei Starovoitov a...@fb.com wrote: On 3/27/18 2:04 PM, Steven Rostedt wrote: +#ifdef CONFIG_BPF_EVENTS +#define BPF_RAW_TP() . = ALIGN(8); \ Given that the section consists of a 16-bytes structure elements on architectures with 8 bytes pointers, this ". = ALIGN(8)" should be turned into a STRUCT_ALIGN(), especially given that the compiler is free to up-align the structure on 32 bytes. STRUCT_ALIGN fixed the 'off by 8' issue with kasan, but it fails without kasan too. For some reason the whole region __start__bpf_raw_tp - __stop__bpf_raw_tp comes inited with : [ 22.703562] i 1 btp 8288e530 btp->tp func [ 22.704638] i 2 btp 8288e540 btp->tp func [ 22.705599] i 3 btp 8288e550 btp->tp func [ 22.706551] i 4 btp 8288e560 btp->tp func [ 22.707503] i 5 btp 8288e570 btp->tp func [ 22.708452] i 6 btp 8288e580 btp->tp func [ 22.709406] i 7 btp 8288e590 btp->tp func [ 22.710368] i 8 btp 8288e5a0 btp->tp func while gdb shows that everything is good inside vmlinux for exactly these addresses. Some other linker magic missing?
RE: [Intel-wired-lan] [next-queue PATCH v5 7/9] igb: Add MAC address support for ethtool nftuple filters
Hi Aaron, "Brown, Aaron F"writes: [...] > And watching the rx_queue counters continues to be spread across the > different queues. This is with Jeff Kirsher's next queue, kernel > 4.16.0-rc4_next-queue_dev-queue_e31d20a, which has the series of 8 igb > patches applied. > > When I go back and run the an older build (with an earlier version of > the series) of the same tree, 4.16.0-rc4_next-queue_dev-queue_84a3942, > with the same procedure and same systems all the rx traffic is > relegated to queue 0 (or whichever queue I assign it to) for either > the src or dst filter. Here is a sample of my counters after it had > been running netperf_stress over the weekend: The difference in behaviour between v4 and v5 is that v4 is configuring (wrongly) the controller to send all the traffic directed to the local MAC address to queue 0, v5 allows that filter to be added, but it does nothing in reality. I am working on a new version of this series that should work for adding filters that involve the local MAC address. The initial use cases that I had in mind all used MAC addresses different from the local one, but I see that this indeed useful (and less surprising). Thank you, -- Vinicius
[pull request][net-next 00/15] Mellanox, mlx5 mlx5-updates-2018-03-27
Hi Dave, This series contains Misc updates and cleanups for mlx5e rx path and SQ recovery feature for tx path. For more information please see tag log below. Please pull and let me know if there's any problem. Thanks, Saeed. --- The following changes since commit 5d22d47b9ed96eddb35821dc2cc4f629f45827f7: Merge branch 'sfc-filter-locking' (2018-03-27 13:33:21 -0400) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2018-03-27 for you to fetch changes up to db75373c91b0cfb6a68ad6ae88721e4e21ae6261: net/mlx5e: Recover Send Queue (SQ) from error state (2018-03-27 17:29:28 -0700) mlx5-updates-2018-03-27 (Misc updates & SQ recovery) This series contains Misc updates and cleanups for mlx5e rx path and SQ recovery feature for tx path. >From Tariq: (RX updates) - Disable Striding RQ when PCI devices, striding RQ limits the use of CQE compression feature, which is very critical for slow PCI devices performance, in this change we will prefer CQE compression over Striding RQ only on specific "slow" PCIe links. - RX path cleanups - Private flag to enable/disable striding RQ >From Eran: (TX fast recovery) - TX timeout logic improvements, fast SQ recovery and TX error reporting if a HW error occurs while transmitting on a specific SQ, the driver will ignore such error and will wait for TX timeout to occur and reset all the rings. Instead, the current series improves the resiliency for such HW errors by detecting TX completions with errors, which will report them and perform a fast recover for the specific faulty SQ even before a TX timeout is detected. Thanks, Saeed. Eran Ben Elisha (5): net/mlx5e: Move all TX timeout logic to be under state lock mlx5_{ib,core}: Add query SQ state helper function mlx5: Move dump error CQE function out of mlx5_ib for code sharing net/mlx5e: Dump xmit error completions net/mlx5e: Recover Send Queue (SQ) from error state Gal Pressman (1): net/mlx5e: Remove unused max inline related code Tariq Toukan (9): net/mlx5e: Unify slow PCI heuristic net/mlx5e: Disable Striding RQ when PCI is slower than link net/mlx5e: Remove unused define MLX5_MPWRQ_STRIDES_PER_PAGE net/mlx5e: Separate dma base address and offset in dma_sync call net/mlx5e: Use no-offset function in skb header copy net/mlx5e: Remove RQ MPWQE fields from params net/mlx5e: Remove rq_headroom field from params net/mlx5e: Do not reset Receive Queue params on every type change net/mlx5e: Add ethtool priv-flag for Striding RQ drivers/infiniband/hw/mlx5/cq.c| 8 +- drivers/infiniband/hw/mlx5/qp.c| 14 +- drivers/net/ethernet/mellanox/mlx5/core/en.h | 29 +- .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 83 +++--- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 306 +++-- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 1 - drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 11 +- drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 6 + drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 + drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 27 +- .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 4 +- .../net/ethernet/mellanox/mlx5/core/mlx5_core.h| 5 + drivers/net/ethernet/mellanox/mlx5/core/transobj.c | 25 ++ include/linux/mlx5/cq.h| 6 + include/linux/mlx5/transobj.h | 1 + 15 files changed, 368 insertions(+), 162 deletions(-)
[PATCH iproute2-next 1/2] ip/ila: support json and color
From: Stephen HemmingerUse json print to enhance ila output. Signed-off-by: Stephen Hemminger --- ip/ipila.c | 76 ++ 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/ip/ipila.c b/ip/ipila.c index 9a324296ffd6..370385c0c375 100644 --- a/ip/ipila.c +++ b/ip/ipila.c @@ -23,6 +23,7 @@ #include "utils.h" #include "ip_common.h" #include "ila_common.h" +#include "json_print.h" static void usage(void) { @@ -47,9 +48,7 @@ static int genl_family = -1; #define ILA_RTA(g) ((struct rtattr *)(((char *)(g)) + \ NLMSG_ALIGN(sizeof(struct genlmsghdr -#define ADDR_BUF_SIZE sizeof(":::") - -static int print_addr64(__u64 addr, char *buff, size_t len) +static void print_addr64(__u64 addr, char *buff, size_t len) { __u16 *words = (__u16 *) __u16 v; @@ -64,38 +63,27 @@ static int print_addr64(__u64 addr, char *buff, size_t len) sep = ""; ret = snprintf([written], len - written, "%x%s", v, sep); - if (ret < 0) - return ret; - written += ret; } - - return written; } -static void print_ila_locid(FILE *fp, int attr, struct rtattr *tb[], int space) +static void print_ila_locid(const char *tag, int attr, struct rtattr *tb[]) { char abuf[256]; - size_t blen; - int i; - if (tb[attr]) { - blen = print_addr64(rta_getattr_u64(tb[attr]), - abuf, sizeof(abuf)); - fprintf(fp, "%s", abuf); - } else { - fprintf(fp, "-"); - blen = 1; - } + if (tb[attr]) + print_addr64(rta_getattr_u64(tb[attr]), +abuf, sizeof(abuf)); + else + snprintf(abuf, sizeof(abuf), "-"); - for (i = 0; i < space - blen; i++) - fprintf(fp, " "); + /* 20 = sizeof(":::") */ + print_string(PRINT_ANY, tag, "%-20s", abuf); } static int print_ila_mapping(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) { - FILE *fp = (FILE *)arg; struct genlmsghdr *ghdr; struct rtattr *tb[ILA_ATTR_MAX + 1]; int len = n->nlmsg_len; @@ -110,31 +98,38 @@ static int print_ila_mapping(const struct sockaddr_nl *who, ghdr = NLMSG_DATA(n); parse_rtattr(tb, ILA_ATTR_MAX, (void *) ghdr + GENL_HDRLEN, len); - print_ila_locid(fp, ILA_ATTR_LOCATOR_MATCH, tb, ADDR_BUF_SIZE); - print_ila_locid(fp, ILA_ATTR_LOCATOR, tb, ADDR_BUF_SIZE); + open_json_object(NULL); + print_ila_locid("locator_match", ILA_ATTR_LOCATOR_MATCH, tb); + print_ila_locid("locator", ILA_ATTR_LOCATOR, tb); - if (tb[ILA_ATTR_IFINDEX]) - fprintf(fp, "%-16s", - ll_index_to_name(rta_getattr_u32( - tb[ILA_ATTR_IFINDEX]))); - else - fprintf(fp, "%-10s ", "-"); + if (tb[ILA_ATTR_IFINDEX]) { + __u32 ifindex + = rta_getattr_u32(tb[ILA_ATTR_IFINDEX]); - if (tb[ILA_ATTR_CSUM_MODE]) - fprintf(fp, "%s", - ila_csum_mode2name(rta_getattr_u8( - tb[ILA_ATTR_CSUM_MODE]))); - else - fprintf(fp, "%-10s ", "-"); + print_color_string(PRINT_ANY, COLOR_IFNAME, + "interface", "%-16s", + ll_index_to_name(ifindex)); + } else { + print_string(PRINT_FP, NULL, "%-10s ", "-"); + } + + if (tb[ILA_ATTR_CSUM_MODE]) { + __u8 csum = rta_getattr_u8(tb[ILA_ATTR_CSUM_MODE]); + + print_string(PRINT_ANY, "csum_mode", "%s", +ila_csum_mode2name(csum)); + } else + print_string(PRINT_FP, NULL, "%-10s ", "-"); if (tb[ILA_ATTR_IDENT_TYPE]) - fprintf(fp, "%s", + print_string(PRINT_ANY, "ident_type", "%s", ila_ident_type2name(rta_getattr_u8( tb[ILA_ATTR_IDENT_TYPE]))); else - fprintf(fp, "-"); + print_string(PRINT_FP, NULL, "%s", "-"); - fprintf(fp, "\n"); + print_string(PRINT_FP, NULL, "%s", _SL_); + close_json_object(); return 0; } @@ -156,10 +151,13 @@ static int do_list(int argc, char **argv) exit(1); } + new_json_obj(json); if (rtnl_dump_filter(_rth, print_ila_mapping, stdout) < 0) { fprintf(stderr, "Dump terminated\n"); return 1; } + delete_json_obj(); + fflush(stdout); return 0; } -- 2.16.2
Re: RFC on writel and writel_relaxed
On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidtwrote: > > The discussion at hand is about > > dma_buffer->foo = 1;/* WB */ > writel(KICK, DMA_KICK_REGISTER);/* UC */ Yes. That certainly is ordered on x86. In fact, afaik it's ordered even if that writel() might be of type WC, because that only delays writes, it doesn't move them earlier. Whether people *do* that or not, I don't know. But I wouldn't be surprised if they do. So if it's a DMA buffer, it's "cached". And even cached accesses are ordered wrt MMIO. Basically, to get unordered writes on x86, you need to either use explicitly nontemporal stores, or have a writecombining region with back-to-back writes that actually combine. And nobody really does that nontemporal store thing any more because the hardware that cared pretty much doesn't exist any more. It was too much pain. People use DMA and maybe an UC store for starting the DMA (or possibly a WC buffer that gets multiple stores in ascending order as a stream of commands). Things like UC will force everything to be entirely ordered, but even without UC, loads won't pass loads, and stores won't pass stores. > Now it appears that this wasn't fully understood back then, and some > people are now saying that x86 might not even provide that semantic > always. Oh, the above UC case is absoutely guaranteed. And I think even if it's WC, the write to kick off the DMA is ordered wrt the cached write. On x86, I think you need barriers only if you do things like - do two non-temporal stores and require them to be ordered: put a sfence or mfence in between them. - do two WC stores, and make sure they do not combine: put a sfence or mfence between them. - do a store, and a subsequent from a different address, and neither of them is UC: put a mfence between them. But note that this is literally just "load after store". A "store after load" doesn't need one. I think that's pretty much it. For example, the "lfence" instruction is almost entirely pointless on x86 - it was designed back in the time when people *thought* they might re-order loads. But loads don't get re-ordered. At least Intel seems to document that only non-temporal *stores* can get re-ordered wrt each other. End result: lfence is a historical oddity that can now be used to guarantee that a previous load has finished, and that in turn meant that it is now used in the Spectre mitigations. But it basically has no real memory ordering meaning since nothing passes an earlier load anyway, it's more of a pipeline thing. But in the end, one question is just "how much do drivers actually _rely_ on the x86 strong ordering?" We so support "smp_wmb()" even though x86 has strong enough ordering that just a barrier suffices. Somebody might just say "screw the x86 memory ordering, we're relaxed, and we'll fix up the drivers we care about". The only issue really is that 99.9% of all testing gets done on x86 unless you look at specific SoC drivers. On ARM, for example, there is likely little reason to care about x86 memory ordering, because there is almost zero driver overlap between x86 and ARM. *Historically*, the reason for following the x86 IO ordering was simply that a lot of architectures used the drivers that were developed on x86. The alpha and powerpc workstations were *designed* with the x86 IO bus (PCI, then PCIe) and to work with the devices that came with it. ARM? PCIe is almost irrelevant. For ARM servers, if they ever take off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 test coverage" is simply not an issue. How much of an issue is it for Power? Maybe you decide it's not a big deal. Then all the above is almost irrelevant. Linus
Re: [V9fs-developer] [PATCH] net/9p: fix potential refcnt problem of trans module
On 2018/3/28 10:52, cgxu...@gmx.com wrote: > 在 2018年3月28日,上午10:10,jiangyiwen写道: >> >> On 2018/3/27 20:49, Chengguang Xu wrote: >>> When specifying trans_mod multiple times in a mount, >>> it may cause inaccurate refcount of trans module. Also, >>> in the error case of option parsing, we should put the >>> trans module if we have already got. >>> >>> Signed-off-by: Chengguang Xu >>> --- >>> net/9p/client.c | 5 - >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/9p/client.c b/net/9p/client.c >>> index b433aff..7ccfb4b 100644 >>> --- a/net/9p/client.c >>> +++ b/net/9p/client.c >>> @@ -190,7 +190,9 @@ static int parse_opts(char *opts, struct p9_client >>> *clnt) >>> p9_debug(P9_DEBUG_ERROR, >>> "problem allocating copy of trans >>> arg\n"); >>> goto free_and_return; >>> -} >>> + } >>> + >>> + v9fs_put_trans(clnt->trans_mod); >> >> I think this should return error if using multiple times >> in a mount. > > Fail or retake are just different policies how to deal with this kind of > situation, > most filesystems in kernel choose retaking new value, so we’d better keep > consistence > with others unless there is a necessary reason. > > Thanks, > Chengguang. > Yes, you're right, most filesystems retake indeed the new value. Reviewed-by: Yiwen Jiang > >> >>> clnt->trans_mod = v9fs_get_trans_by_name(s); >>> if (clnt->trans_mod == NULL) { >>> pr_info("Could not find request transport: >>> %s\n", >>> @@ -226,6 +228,7 @@ static int parse_opts(char *opts, struct p9_client >>> *clnt) >>> } >>> >>> free_and_return: >>> + v9fs_put_trans(clnt->trans_mod); >> >> This looks good. >> >>> kfree(tmp_options); >>> return ret; >>> } >>> >> >> > > > . >
[PATCH v2 bpf-next 6/9] bpf: Hooks for sys_connect
From: Andrey Ignatov== The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind()` sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. `bpf_bind()` sets `with_lock` to `false` when calling to `__inet_bind()` and `__inet6_bind()` since all call-sites, where `bpf_bind()` is called, already hold socket lock. Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- include/linux/bpf-cgroup.h | 31 include/net/sock.h | 3 +++ include/net/udp.h | 1 + include/uapi/linux/bpf.h | 12 ++- kernel/bpf/syscall.c | 8 net/core/filter.c | 50 ++ net/ipv4/af_inet.c | 13 net/ipv4/tcp_ipv4.c| 16 +++ net/ipv4/udp.c | 14 + net/ipv6/tcp_ipv6.c| 16 +++ net/ipv6/udp.c | 20 +++ 11 files changed, 183 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 67dc4a6471ad..c6ab295e6dcb 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -116,12 +116,38 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, __ret; \ }) +#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type) \ +({\ + int __ret = 0; \ + if (cgroup_bpf_enabled) { \ + lock_sock(sk); \ + __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);\ + release_sock(sk); \ + } \ + __ret; \ +}) + #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) \ BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_BIND) #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) \ BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_BIND) +#define BPF_CGROUP_PRE_CONNECT_ENABLED(sk) (cgroup_bpf_enabled && \ + sk->sk_prot->pre_connect) + +#define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) \ + BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET4_CONNECT) + +#define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr)
[PATCH v2 bpf-next 1/9] bpf: Check attach type at prog load time
From: Andrey Ignatov== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov and Daniel Borkmann here: https://marc.info/?l=linux-netdev=152107378717201=2 Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 5 - include/linux/filter.h | 1 + include/uapi/linux/bpf.h | 5 + kernel/bpf/cgroup.c | 3 ++- kernel/bpf/syscall.c | 31 ++- kernel/bpf/verifier.c| 6 +++--- kernel/trace/bpf_trace.c | 21 ++--- net/core/filter.c| 39 +-- 8 files changed, 84 insertions(+), 27 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 819229c80eca..95a7abd0ee92 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -208,12 +208,15 @@ struct bpf_prog_ops { struct bpf_verifier_ops { /* return eBPF function prototype for verification */ - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); + const struct bpf_func_proto * + (*get_func_proto)(enum bpf_func_id func_id, + const struct bpf_prog *prog); /* return true if 'size' wide access at offset 'off' within bpf_context * with 'type' (read or write) is allowed */ bool (*is_valid_access)(int off, int size, enum bpf_access_type type, + const struct bpf_prog *prog, struct bpf_insn_access_aux *info); int (*gen_prologue)(struct bpf_insn *insn, bool direct_write, const struct bpf_prog *prog); diff --git a/include/linux/filter.h b/include/linux/filter.h index 109d05ccea9a..5787280b1e88 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -469,6 +469,7 @@ struct bpf_prog { is_func:1, /* program is a bpf function */ kprobe_override:1; /* Do we override a kprobe? */ enum bpf_prog_type type; /* Type of BPF program */ + enum bpf_attach_typeexpected_attach_type; /* For some prog types */ u32 len;/* Number of filter blocks */ u32 jited_len; /* Size of jited insns in bytes */ u8 tag[BPF_TAG_SIZE]; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 18b7c510c511..dbc8a66b5d7e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -294,6 +294,11 @@ union bpf_attr { __u32 prog_flags; charprog_name[BPF_OBJ_NAME_LEN]; __u32 prog_ifindex; /* ifindex of netdev to prep for */ + /* For some prog types expected attach type must be known at +* load time to verify attach type specific parts of prog +* (context accesses, allowed helpers, etc). +*/ + __u32 expected_attach_type; }; struct { /* anonymous struct used by BPF_OBJ_* commands */ diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index c1c0b60d3f2f..8730b24ed540 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -545,7 +545,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission); static const struct bpf_func_proto * -cgroup_dev_func_proto(enum bpf_func_id func_id)
[PATCH v2 bpf-next 0/9] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
v1->v2: - support expected_attach_type at prog load time so that prog (incl. context accesses and calls to helpers) can be validated with regard to specific attach point it is supposed to be attached to. Later, at attach time, attach type is checked so that it must be same as at load time if it was provided - reworked hooks to rely on expected_attach_type, and reduced number of new prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks - add selftests for post-sys_bind hook For our container management we've been using complicated and fragile setup consisting of LD_PRELOAD wrapper intercepting bind and connect calls from all containerized applications. Unfortunately it doesn't work for apps that don't use glibc and changing all applications that run in the datacenter is not possible due to 3rd party code and libraries (despite being open source code) and sheer amount of legacy code that has to be rewritten (we're rewriting what we can in parallel) These applications are written without containers in mind and have builtin assumptions about network services. Like an application X expects to connect localhost:special_port and find service Y in there. To move application X and service Y into two different containers LD_PRELOAD approach is used to help one service connect to another without rewriting them. Moving these two applications into different L2 (netns) or L3 (vrf) network isolation scopes doesn't help to solve the problem, since applications need to see each other like they were running on the host without containers. So if app X and app Y would run in different netns something would need to punch a connectivity hole in those namespaces. That would be real layering violation (with corresponding network debugging pains), since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Instead we used LD_PRELOAD (and now bpf programs) at bind/connect time to help applications discover and connect to each other. All applications are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do and the whole system is clean from network point of view and can be debugged with standard tools. We also considered resurrecting Hannes's afnetns work, but all hierarchical namespace abstraction don't work due to these builtin networking assumptions inside the apps. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature. BPF+cgroup looks to be the best solution for this problem. Hence we introduce 3 hooks: - at entry into sys_bind and sys_connect to let bpf prog look and modify 'struct sockaddr' provided by user space and fail bind/connect when appropriate - post sys_bind after port is allocated The approach works great and has zero overhead for anyone who doesn't use it and very low overhead when deployed. Different use case for this feature is to do low overhead firewall that doesn't need to inspect all packets and works at bind/connect time. Andrey Ignatov (9): bpf: Check attach type at prog load time libbpf: Support expected_attach_type at prog load bpf: Hooks for sys_bind selftests/bpf: Selftest for sys_bind hooks net: Introduce __inet_bind() and __inet6_bind bpf: Hooks for sys_connect selftests/bpf: Selftest for sys_connect hooks bpf: Post-hooks for sys_bind selftests/bpf: Selftest for sys_bind post-hooks. include/linux/bpf-cgroup.h| 68 ++- include/linux/bpf.h | 5 +- include/linux/bpf_types.h | 1 + include/linux/filter.h| 11 + include/net/inet_common.h | 2 + include/net/ipv6.h| 2 + include/net/sock.h| 3 + include/net/udp.h | 1 + include/uapi/linux/bpf.h | 51 ++- kernel/bpf/cgroup.c | 39 +- kernel/bpf/syscall.c | 102 - kernel/bpf/verifier.c | 7 +- kernel/trace/bpf_trace.c | 21 +- net/core/filter.c | 435 +-- net/ipv4/af_inet.c| 71 +++- net/ipv4/tcp_ipv4.c | 16 + net/ipv4/udp.c| 14 + net/ipv6/af_inet6.c
Re: [PATCH V5 net-next 06/14] net/tls: Add generic NIC offload infrastructure
On 3/27/2018 4:56 PM, Saeed Mahameed wrote: From: Ilya LesokhinThis patch adds a generic infrastructure to offload TLS crypto to a network device. It enables the kernel TLS socket to skip encryption and authentication operations on the transmit side of the data path. Leaving those computationally expensive operations to the NIC. The NIC offload infrastructure builds TLS records and pushes them to the TCP layer just like the SW KTLS implementation and using the same API. TCP segmentation is mostly unaffected. Currently the only exception is that we prevent mixed SKBs where only part of the payload requires offload. In the future we are likely to add a similar restriction following a change cipher spec record. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. The offloaded implementation builds "plaintext TLS record", those records contain plaintext instead of ciphertext and place holder bytes instead of authentication tags. 2. The offloaded implementation maintains a mapping from TCP sequence number to TLS records. Thus given a TCP SKB sent from a NIC offloaded TLS socket, we can use the tls NIC offload infrastructure to obtain enough context to encrypt the payload of the SKB. A TLS record is released when the last byte of the record is ack'ed, this is done through the new icsk_clean_acked callback. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC assumes that packets from each offloaded stream are sent as plaintext and in-order. It keeps track of the TLS records in the TCP stream. When a packet marked for offload is transmitted, the NIC encrypts the payload in-place and puts authentication tags in the relevant place holders. The responsibility for handling out-of-order packets (i.e. TCP retransmission, qdisc drops) falls on the netdev driver. The netdev driver keeps track of the expected TCP SN from the NIC's perspective. If the next packet to transmit matches the expected TCP SN, the driver advances the expected TCP SN, and transmits the packet with TLS offload indication. If the next packet to transmit does not match the expected TCP SN. The driver calls the TLS layer to obtain the TLS record that includes the TCP of the packet for transmission. Using this TLS record, the driver posts a work entry on the transmit queue to reconstruct the NIC TLS state required for the offload of the out-of-order packet. It updates the expected TCP SN accordingly and transmits the now in-order packet. The same queue is used for packet transmission and TLS context reconstruction to avoid the need for flushing the transmit queue before issuing the context reconstruction request. Signed-off-by: Ilya Lesokhin Signed-off-by: Boris Pismenny Signed-off-by: Aviad Yehezkel Signed-off-by: Saeed Mahameed Acked-by: Shannon Nelson --- include/net/tls.h | 120 +-- net/tls/Kconfig | 10 + net/tls/Makefile | 2 + net/tls/tls_device.c | 759 ++ net/tls/tls_device_fallback.c | 454 + net/tls/tls_main.c| 120 --- net/tls/tls_sw.c | 132 7 files changed, 1476 insertions(+), 121 deletions(-) create mode 100644 net/tls/tls_device.c create mode 100644 net/tls/tls_device_fallback.c diff --git a/include/net/tls.h b/include/net/tls.h index 437a746300bf..0a8529e9ec21 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -57,21 +57,10 @@ #define TLS_AAD_SPACE_SIZE 13 -struct tls_sw_context { +struct tls_sw_context_tx { struct crypto_aead *aead_send; - struct crypto_aead *aead_recv; struct crypto_wait async_wait; - /* Receive context */ - struct strparser strp; - void (*saved_data_ready)(struct sock *sk); - unsigned int (*sk_poll)(struct file *file, struct socket *sock, - struct poll_table_struct *wait); - struct sk_buff *recv_pkt; - u8 control; - bool decrypted; - - /* Sending context */ char aad_space[TLS_AAD_SPACE_SIZE]; unsigned int sg_plaintext_size; @@ -88,6 +77,50 @@ struct tls_sw_context { struct scatterlist sg_aead_out[2]; }; +struct tls_sw_context_rx { + struct crypto_aead *aead_recv; + struct crypto_wait async_wait; + + struct strparser strp; + void (*saved_data_ready)(struct sock *sk); + unsigned int (*sk_poll)(struct file *file, struct socket *sock, + struct poll_table_struct *wait); + struct sk_buff *recv_pkt; + u8 control; + bool decrypted; +}; + +struct tls_record_info { + struct list_head list; + u32 end_seq; + int len;
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote: > On 3/27/2018 10:51 PM, Linus Torvalds wrote: > > > The discussion at hand is about > > > > > > dma_buffer->foo = 1;/* WB */ > > > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > > even if that writel() might be of type WC, because that only delays > > writes, it doesn't move them earlier. > > Now that we clarified x86 myth, Is this guaranteed on all architectures? If not we need to fix it. It's guaranteed on the "main" ones (arm, arm64, powerpc, i386, x86_64). We might need to check with other arch maintainers for the rest. We really want Linux to provide well defined "sane" semantics for the basic writel accessors. Note: We still have open questions about how readl() relates to surrounding memory accesses. It looks like ARM and powerpc do different things here. > We keep getting IA64 exception example. Maybe, this is also corrected since > then. I would think ia64 fixed it back when it was all discussed. I was under the impression all ia64 had "special" was the writel vs. spin_unlock which requires mmiowb, but maybe that was never completely fixed ? > Jose Abreu says "I don't know about x86 but arc architecture doesn't > have a wmb() in the writel() function (in some configs)". Well, it probably should then. > As long as we have these exceptions, these wmb() in common drivers is not > going anywhere and relaxed-arches will continue paying performance penalty. Well, let's fix them or leave them broken, at this point, it doesn't matter. We can give all arch maintainers a wakeup call and start making drivers work based on the documented assumptions. > I see 15% performance loss on ARM64 servers using Intel i40e network > drivers and an XL710 adapter due to CPU keeping itself busy doing barriers > most of the time rather than real work because of sequences like this all over > the place. > > dma_buffer->foo = 1;/* WB */ >wmb() > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > I posted several patches last week to remove duplicate barriers on ARM while > trying to make the code friendly with other architectures. > > Basically changing it to > > dma_buffer->foo = 1;/* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER);/* UC */ > mmiowb() > > This is a small step in the performance direction until we remove all > exceptions. > > https://www.spinics.net/lists/netdev/msg491842.html > https://www.spinics.net/lists/linux-rdma/msg62434.html > https://www.spinics.net/lists/arm-kernel/msg642336.html > > Discussion started to move around the need for relaxed API on PPC and then > why wmb() question came up. I'm working on the problem of relaxed APIs for powerpc, but we can keep that orthogonal. As is, today, a wmb() + writel() and a wmb() + writel_relaxed() on powerpc are identical. So changing them will not break us. But I don't see the point of doing that transformation if we can just get the straying archs fixed. It's not like any of them has a significant market presence these days anyway. Cheers, Ben. > Sinan >
[PATCH v2 bpf-next 2/9] libbpf: Support expected_attach_type at prog load
From: Andrey IgnatovSupport setting `expected_attach_type` at prog load time in both `bpf/bpf.h` and `bpf/libbpf.h`. Since both headers already have API to load programs, new functions are added not to break backward compatibility for existing ones: * `bpf_load_program_xattr()` is added to `bpf/bpf.h`; * `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`. Both new functions accept structures, `struct bpf_load_program_attr` and `struct bpf_prog_load_attr` correspondingly, where new fields can be added in the future w/o changing the API. Standard `_xattr` suffix is used to name the new API functions. Since `bpf_load_program_name()` is not used as heavily as `bpf_load_program()`, it was removed in favor of more generic `bpf_load_program_xattr()`. Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 5 ++ tools/lib/bpf/bpf.c| 44 +++-- tools/lib/bpf/bpf.h| 17 +-- tools/lib/bpf/libbpf.c | 105 +++-- tools/lib/bpf/libbpf.h | 8 5 files changed, 133 insertions(+), 46 deletions(-) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index d245c41213ac..b44bcd7814ee 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -294,6 +294,11 @@ union bpf_attr { __u32 prog_flags; charprog_name[BPF_OBJ_NAME_LEN]; __u32 prog_ifindex; /* ifindex of netdev to prep for */ + /* For some prog types expected attach type must be known at +* load time to verify attach type specific parts of prog +* (context accesses, allowed helpers, etc). +*/ + __u32 expected_attach_type; }; struct { /* anonymous struct used by BPF_OBJ_* commands */ diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 592a58a2b681..e85a2191f8b5 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -146,26 +146,30 @@ int bpf_create_map_in_map(enum bpf_map_type map_type, const char *name, -1); } -int bpf_load_program_name(enum bpf_prog_type type, const char *name, - const struct bpf_insn *insns, - size_t insns_cnt, const char *license, - __u32 kern_version, char *log_buf, - size_t log_buf_sz) +int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr, + char *log_buf, size_t log_buf_sz) { - int fd; union bpf_attr attr; - __u32 name_len = name ? strlen(name) : 0; + __u32 name_len; + int fd; + + if (!load_attr) + return -EINVAL; + + name_len = load_attr->name ? strlen(load_attr->name) : 0; bzero(, sizeof(attr)); - attr.prog_type = type; - attr.insn_cnt = (__u32)insns_cnt; - attr.insns = ptr_to_u64(insns); - attr.license = ptr_to_u64(license); + attr.prog_type = load_attr->prog_type; + attr.expected_attach_type = load_attr->expected_attach_type; + attr.insn_cnt = (__u32)load_attr->insns_cnt; + attr.insns = ptr_to_u64(load_attr->insns); + attr.license = ptr_to_u64(load_attr->license); attr.log_buf = ptr_to_u64(NULL); attr.log_size = 0; attr.log_level = 0; - attr.kern_version = kern_version; - memcpy(attr.prog_name, name, min(name_len, BPF_OBJ_NAME_LEN - 1)); + attr.kern_version = load_attr->kern_version; + memcpy(attr.prog_name, load_attr->name, + min(name_len, BPF_OBJ_NAME_LEN - 1)); fd = sys_bpf(BPF_PROG_LOAD, , sizeof(attr)); if (fd >= 0 || !log_buf || !log_buf_sz) @@ -184,8 +188,18 @@ int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, __u32 kern_version, char *log_buf, size_t log_buf_sz) { - return bpf_load_program_name(type, NULL, insns, insns_cnt, license, -kern_version, log_buf, log_buf_sz); + struct bpf_load_program_attr load_attr; + + memset(_attr, 0, sizeof(struct bpf_load_program_attr)); + load_attr.prog_type = type; + load_attr.expected_attach_type = 0; + load_attr.name = NULL; + load_attr.insns = insns; + load_attr.insns_cnt = insns_cnt; + load_attr.license = license; + load_attr.kern_version = kern_version; + + return bpf_load_program_xattr(_attr, log_buf, log_buf_sz); } int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns, diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 8d18fb73d7fb..2f7813d5e357 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -41,13 +41,20 @@ int bpf_create_map_in_map(enum bpf_map_type
[PATCH v2 bpf-next 9/9] selftests/bpf: Selftest for sys_bind post-hooks.
From: Andrey IgnatovAdd selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`. The main things tested are: * prog load behaves as expected (valid/invalid accesses in prog); * prog attach behaves as expected (load- vs attach-time attach types); * `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible way; * post-hooks return expected result and errno. Example: # ./test_sock Test case: bind4 load with invalid access: src_ip6 .. [PASS] Test case: bind4 load with invalid access: mark .. [PASS] Test case: bind6 load with invalid access: src_ip4 .. [PASS] Test case: sock_create load with invalid access: src_port .. [PASS] Test case: sock_create load w/o expected_attach_type (compat mode) .. [PASS] Test case: sock_create load w/ expected_attach_type .. [PASS] Test case: attach type mismatch bind4 vs bind6 .. [PASS] Test case: attach type mismatch bind6 vs bind4 .. [PASS] Test case: attach type mismatch default vs bind4 .. [PASS] Test case: attach type mismatch bind6 vs sock_create .. [PASS] Test case: bind4 reject all .. [PASS] Test case: bind6 reject all .. [PASS] Test case: bind6 deny specific IP & port .. [PASS] Test case: bind4 allow specific IP & port .. [PASS] Test case: bind4 allow all .. [PASS] Test case: bind6 allow all .. [PASS] Summary: 16 PASSED, 0 FAILED Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 11 + tools/testing/selftests/bpf/Makefile| 4 +- tools/testing/selftests/bpf/test_sock.c | 479 3 files changed, 493 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/test_sock.c diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7a8314342e2d..bd9ad31c0ebc 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -150,6 +150,8 @@ enum bpf_attach_type { BPF_CGROUP_INET6_BIND, BPF_CGROUP_INET4_CONNECT, BPF_CGROUP_INET6_CONNECT, + BPF_CGROUP_INET4_POST_BIND, + BPF_CGROUP_INET6_POST_BIND, __MAX_BPF_ATTACH_TYPE }; @@ -940,6 +942,15 @@ struct bpf_sock { __u32 protocol; __u32 mark; __u32 priority; + __u32 src_ip4; /* Allows 1,2,4-byte read. +* Stored in network byte order. +*/ + __u32 src_ip6[4]; /* Allows 1,2,4-byte read. +* Stored in network byte order. +*/ + __u32 src_port; /* Allows 4-byte read. +* Stored in host byte order +*/ }; #define XDP_PACKET_HEADROOM 256 diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index c64d4ebc77ff..0a315ddabbf4 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -23,7 +23,8 @@ urandom_read: urandom_read.c # Order correspond to 'make run_tests' order TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ - test_align test_verifier_log test_dev_cgroup test_tcpbpf_user test_sock_addr + test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ + test_sock test_sock_addr TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ @@ -52,6 +53,7 @@ $(TEST_GEN_PROGS): $(BPFOBJ) $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c +$(OUTPUT)/test_sock: cgroup_helpers.c $(OUTPUT)/test_sock_addr: cgroup_helpers.c .PHONY: force diff --git a/tools/testing/selftests/bpf/test_sock.c b/tools/testing/selftests/bpf/test_sock.c new file mode 100644 index ..73bb20cfb9b7 --- /dev/null +++ b/tools/testing/selftests/bpf/test_sock.c @@ -0,0 +1,479 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2018 Facebook + +#include +#include + +#include +#include +#include + +#include + +#include + +#include "cgroup_helpers.h" + +#ifndef ARRAY_SIZE +# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#endif + +#define CG_PATH"/foo" +#define MAX_INSNS 512 + +char bpf_log_buf[BPF_LOG_BUF_SIZE]; + +struct sock_test { + const char *descr; + /* BPF prog properties */ + struct bpf_insn insns[MAX_INSNS]; + enum bpf_attach_type expected_attach_type; + enum bpf_attach_type attach_type; + /* Socket properties */ + int domain; + int type; + /* Endpoint to bind() to */ + const char *ip; + unsigned short port; + /* Expected test result */ + enum { + LOAD_REJECT, + ATTACH_REJECT, + BIND_REJECT, +
[PATCH v2 bpf-next 3/9] bpf: Hooks for sys_bind
From: Andrey Ignatov== The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- include/linux/bpf-cgroup.h | 21 include/linux/bpf_types.h | 1 + include/linux/filter.h | 10 ++ include/uapi/linux/bpf.h | 23 + kernel/bpf/cgroup.c| 36 +++ kernel/bpf/syscall.c | 36 +-- kernel/bpf/verifier.c | 1 + net/core/filter.c | 232 + net/ipv4/af_inet.c | 7 ++ net/ipv6/af_inet6.c| 7 ++ 10 files changed, 366 insertions(+), 8 deletions(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 8a4566691c8f..67dc4a6471ad 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -6,6 +6,7 @@ #include struct sock; +struct sockaddr; struct cgroup; struct sk_buff; struct bpf_sock_ops_kern; @@ -63,6 +64,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, int __cgroup_bpf_run_filter_sk(struct sock *sk, enum bpf_attach_type type); +int __cgroup_bpf_run_filter_sock_addr(struct sock *sk, + struct sockaddr *uaddr, + enum bpf_attach_type type); + int __cgroup_bpf_run_filter_sock_ops(struct sock *sk, struct bpf_sock_ops_kern *sock_ops, enum bpf_attach_type type); @@ -103,6 +108,20 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, __ret; \ }) +#define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) \ +({\ +
Re: [PATCH v13 net-next 08/12] crypto : chtls - CPL handler definition
On 3/27/2018 11:12 PM, Stefano Brivio wrote: > On Tue, 27 Mar 2018 23:06:37 +0530 > Atul Guptawrote: > >> Exchange messages with hardware to program the TLS session >> CPL handlers for messages received from chip. >> >> Signed-off-by: Atul Gupta >> Signed-off-by: Michael Werner >> Reviewed-by: Sabrina Dubroca >> Reviewed-by: Stefano Brivio > No, I haven't. Stefano, I was not clear on protocol for reviewed-by tag, I want to acknowledge the feedback you have provided to code so far. Will remove this. Thanks Atul >
Re: [PATCH] vhost-net: add time limitation for tx polling(Internet mail)
On 2018年03月27日 19:26, Jason wrote On 2018年03月27日 17:12, haibinzhang wrote: >> handle_tx() will delay rx for a long time when busy tx polling udp packets >> with short length(ie: 1byte udp payload), because setting VHOST_NET_WEIGHT >> takes into account only sent-bytes but no time. > >Interesting. > >Looking at vhost_can_busy_poll() it tries to poke pending vhost work and >exit the busy loop if it found one. So I believe something block the >work queuing. E.g did reverting 8241a1e466cd56e6c10472cac9c1ad4e54bc65db >fix the issue? "busy tx polling" means using netperf send udp packets with 1 bytes payload(total 47bytes frame lenght), and handle_tx() will be busy sending packets continuously. > >> It's not fair for handle_rx(), >> so needs to limit max time of tx polling. >> >> --- >> drivers/vhost/net.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c >> index 8139bc70ad7d..dc9218a3a75b 100644 >> --- a/drivers/vhost/net.c >> +++ b/drivers/vhost/net.c >> @@ -473,6 +473,7 @@ static void handle_tx(struct vhost_net *net) >> struct socket *sock; >> struct vhost_net_ubuf_ref *uninitialized_var(ubufs); >> bool zcopy, zcopy_used; >> +unsigned long start = jiffies; > >Checking jiffies is tricky, need to convert it to ms or whatever others. > >> >> mutex_lock(>mutex); >> sock = vq->private_data; >> @@ -580,7 +581,7 @@ static void handle_tx(struct vhost_net *net) >> else >> vhost_zerocopy_signal_used(net, vq); >> vhost_net_tx_packet(net); >> -if (unlikely(total_len >= VHOST_NET_WEIGHT)) { >> +if (unlikely(total_len >= VHOST_NET_WEIGHT) || unlikely(jiffies >> - start >= 1)) { > >How value 1 is determined here? And we need a complete test to make sure >this won't affect other use cases. We just want <1ms ping latency, but actually we are not sure what value is reasonable. We have some test results using netperf before this patch as follow, Udp payload1byte100bytes1000bytes1400bytes Ping avg latency25ms 10ms 2ms 1.5ms What is other testcases? > >Another thought is introduce another limit of #packets, but this need >benchmark too. > >Thanks > >> vhost_poll_queue(>poll); >> break; >> } > >
Re: [PATCH v13 net-next 01/12] tls: support for Inline tls record
On 3/27/2018 11:53 PM, Stefano Brivio wrote: > On Tue, 27 Mar 2018 23:06:30 +0530 > Atul Guptawrote: > >> +static struct tls_context *create_ctx(struct sock *sk) >> +{ >> +struct inet_connection_sock *icsk = inet_csk(sk); >> +struct tls_context *ctx; >> + >> +/* allocate tls context */ >> +ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); >> +if (!ctx) >> +return NULL; >> + >> +icsk->icsk_ulp_data = ctx; >> +return ctx; >> +} >> >> [...] >> >> static int tls_init(struct sock *sk) >> { >> int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4; >> -struct inet_connection_sock *icsk = inet_csk(sk); >> struct tls_context *ctx; >> int rc = 0; >> >> +if (tls_hw_prot(sk)) >> +goto out; >> + >> /* The TLS ulp is currently supported only for TCP sockets >> * in ESTABLISHED state. >> * Supporting sockets in LISTEN state will require us >> @@ -530,12 +624,11 @@ static int tls_init(struct sock *sk) >> return -ENOTSUPP; >> >> /* allocate tls context */ >> -ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); >> +ctx = create_ctx(sk); >> if (!ctx) { >> rc = -ENOMEM; >> goto out; >> } >> -icsk->icsk_ulp_data = ctx; > Why are you changing this? since create_ctx is called at two place it is assigned in allocating function than duplicate the assignment. > > This is now equivalent to the original implementation, except that you > are "hiding" the assignment of icsk->icsk_ulp_data into a function named > "create_ctx". > > Please also note that you are duplicating the "allocate tls context" > comment. will remove this comment. >
[PATCH v2 bpf-next 4/9] selftests/bpf: Selftest for sys_bind hooks
From: Andrey IgnatovAdd selftest to work with bpf_sock_addr context from `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs. Try to bind(2) on IP:port and apply: * loads to make sure context can be read correctly, including narrow loads (byte, half) for IP and full-size loads (word) for all fields; * stores to those fields allowed by verifier. All combination from IPv4/IPv6 and TCP/UDP are tested. Both scenarios are tested: * valid programs can be loaded and attached; * invalid programs can be neither loaded nor attached. Test passes when expected data can be read from context in the BPF-program, and after the call to bind(2) socket is bound to IP:port pair that was written by BPF-program to the context. Example: # ./test_sock_addr Attached bind4 program. Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, ) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, ) Attached bind6 program. Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, ) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, ) ### SUCCESS Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h | 23 ++ tools/lib/bpf/libbpf.c | 6 + tools/testing/selftests/bpf/Makefile | 3 +- tools/testing/selftests/bpf/test_sock_addr.c | 486 +++ 4 files changed, 517 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index b44bcd7814ee..a8ec89065496 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -134,6 +134,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, BPF_PROG_TYPE_SK_MSG, + BPF_PROG_TYPE_CGROUP_SOCK_ADDR, }; enum bpf_attach_type { @@ -145,6 +146,8 @@ enum bpf_attach_type { BPF_SK_SKB_STREAM_VERDICT, BPF_CGROUP_DEVICE, BPF_SK_MSG_VERDICT, + BPF_CGROUP_INET4_BIND, + BPF_CGROUP_INET6_BIND, __MAX_BPF_ATTACH_TYPE }; @@ -1002,6 +1005,26 @@ struct bpf_map_info { __u64 netns_ino; } __attribute__((aligned(8))); +/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed + * by user and intended to be used by socket (e.g. to bind to, depends on + * attach attach type). + */ +struct bpf_sock_addr { + __u32 user_family; /* Allows 4-byte read, but no write. */ + __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write. +* Stored in network byte order. +*/ + __u32 user_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write. +* Stored in network byte order. +*/ + __u32 user_port;/* Allows 4-byte read and write. +* Stored in network byte order +*/ + __u32 family; /* Allows 4-byte read, but no write */ + __u32 type; /* Allows 4-byte read, but no write */ + __u32 protocol; /* Allows 4-byte read, but no write */ +}; + /* User bpf_sock_ops struct to access socket values and specify request ops * and their replies. * Some of this fields are in network (bigendian) byte order and may need diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 48e3e743ebf7..d7ce8818982c 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1859,6 +1859,9 @@ static void bpf_program__set_expected_attach_type(struct bpf_program *prog, #define BPF_PROG_SEC(string, ptype) BPF_PROG_SEC_FULL(string, ptype, 0) +#define BPF_SA_PROG_SEC(string, ptype) \ + BPF_PROG_SEC_FULL(string, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, ptype) + static const struct { const char *sec; size_t len; @@ -1882,10 +1885,13 @@ static const struct { BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS), BPF_PROG_SEC("sk_skb", BPF_PROG_TYPE_SK_SKB), BPF_PROG_SEC("sk_msg", BPF_PROG_TYPE_SK_MSG), + BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND), + BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND), }; #undef BPF_PROG_SEC #undef BPF_PROG_SEC_FULL +#undef BPF_SA_PROG_SEC static int bpf_program__identify_section(struct bpf_program *prog) { diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index f35fb02bdf56..f4717c910874 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -23,7 +23,7 @@ urandom_read:
[PATCH v2 bpf-next 5/9] net: Introduce __inet_bind() and __inet6_bind
From: Andrey IgnatovRefactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Signed-off-by: Andrey Ignatov Acked-by: Alexei Starovoitov Signed-off-by: Alexei Starovoitov --- include/net/inet_common.h | 2 ++ include/net/ipv6.h| 2 ++ net/ipv4/af_inet.c| 39 --- net/ipv6/af_inet6.c | 37 - 4 files changed, 52 insertions(+), 28 deletions(-) diff --git a/include/net/inet_common.h b/include/net/inet_common.h index 500f81375200..384b90c62c0b 100644 --- a/include/net/inet_common.h +++ b/include/net/inet_common.h @@ -32,6 +32,8 @@ int inet_shutdown(struct socket *sock, int how); int inet_listen(struct socket *sock, int backlog); void inet_sock_destruct(struct sock *sk); int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len); +int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, + bool force_bind_address_no_port, bool with_lock); int inet_getname(struct socket *sock, struct sockaddr *uaddr, int peer); int inet_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg); diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 50a6f0ddb878..2e5fedc56e59 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -1066,6 +1066,8 @@ void ipv6_local_error(struct sock *sk, int err, struct flowi6 *fl6, u32 info); void ipv6_local_rxpmtu(struct sock *sk, struct flowi6 *fl6, u32 mtu); int inet6_release(struct socket *sock); +int __inet6_bind(struct sock *sock, struct sockaddr *uaddr, int addr_len, +bool force_bind_address_no_port, bool with_lock); int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len); int inet6_getname(struct socket *sock, struct sockaddr *uaddr, int peer); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 2dec266507dc..e203a39d6988 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -432,30 +432,37 @@ EXPORT_SYMBOL(inet_release); int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) { - struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; struct sock *sk = sock->sk; - struct inet_sock *inet = inet_sk(sk); - struct net *net = sock_net(sk); - unsigned short snum; - int chk_addr_ret; - u32 tb_id = RT_TABLE_LOCAL; int err; /* If the socket has its own bind function then use it. (RAW) */ if (sk->sk_prot->bind) { - err = sk->sk_prot->bind(sk, uaddr, addr_len); - goto out; + return sk->sk_prot->bind(sk, uaddr, addr_len); } - err = -EINVAL; if (addr_len < sizeof(struct sockaddr_in)) - goto out; + return -EINVAL; /* BPF prog is run before any checks are done so that if the prog * changes context in a wrong way it will be caught. */ err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr); if (err) - goto out; + return err; + + return __inet_bind(sk, uaddr, addr_len, false, true); +} +EXPORT_SYMBOL(inet_bind); + +int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len, + bool force_bind_address_no_port, bool with_lock) +{ + struct sockaddr_in *addr = (struct sockaddr_in *)uaddr; + struct inet_sock *inet = inet_sk(sk); + struct net *net = sock_net(sk); + unsigned short snum; + int chk_addr_ret; + u32 tb_id = RT_TABLE_LOCAL; + int err; if (addr->sin_family != AF_INET) { /* Compatibility games : accept AF_UNSPEC (mapped to AF_INET) @@ -499,7 +506,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) * would be illegal to use them (multicast/broadcast) in * which case the sending
[PATCH v2 bpf-next 8/9] bpf: Post-hooks for sys_bind
From: Andrey Ignatov"Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- include/linux/bpf-cgroup.h | 16 +-- include/uapi/linux/bpf.h | 11 + kernel/bpf/syscall.c | 43 + net/core/filter.c | 116 +++-- net/ipv4/af_inet.c | 18 --- net/ipv6/af_inet6.c| 21 +--- 6 files changed, 195 insertions(+), 30 deletions(-) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index c6ab295e6dcb..30d15e64b993 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -98,16 +98,24 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, __ret; \ }) -#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \ +#define BPF_CGROUP_RUN_SK_PROG(sk, type) \ ({\ int __ret = 0; \ if (cgroup_bpf_enabled) { \ - __ret = __cgroup_bpf_run_filter_sk(sk, \ -BPF_CGROUP_INET_SOCK_CREATE); \ + __ret = __cgroup_bpf_run_filter_sk(sk, type); \ } \ __ret; \ }) +#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) \ + BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET_SOCK_CREATE) + +#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) \ + BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET4_POST_BIND) + +#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) \ + BPF_CGROUP_RUN_SK_PROG(sk, BPF_CGROUP_INET6_POST_BIND) + #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, type) \ ({\ int __ret = 0; \ @@ -183,6 +191,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; } #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET6_BIND(sk, uaddr) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_INET6_POST_BIND(sk) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET4_CONNECT(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; }) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 969091fc6ee7..9f5f166f9d73 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -150,6 +150,8 @@ enum bpf_attach_type { BPF_CGROUP_INET6_BIND, BPF_CGROUP_INET4_CONNECT, BPF_CGROUP_INET6_CONNECT, + BPF_CGROUP_INET4_POST_BIND, + BPF_CGROUP_INET6_POST_BIND, __MAX_BPF_ATTACH_TYPE }; @@ -941,6 +943,15 @@ struct bpf_sock { __u32 protocol; __u32 mark; __u32 priority; + __u32 src_ip4; /* Allows 1,2,4-byte read. +
[PATCH v2 bpf-next 7/9] selftests/bpf: Selftest for sys_connect hooks
From: Andrey IgnatovAdd selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT attach types. Try to connect(2) to specified IP:port and test that: * remote IP:port pair is overridden; * local end of connection is bound to specified IP. All combinations of IPv4/IPv6 and TCP/UDP are tested. Example: # tcpdump -pn -i lo -w connect.pcap 2>/dev/null & [1] 478 # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh Wait for testing IPv4/IPv6 to become available ... OK Load bind4 with invalid type (can pollute stderr) ... REJECTED Load bind4 with valid type ... OK Attach bind4 with invalid type ... REJECTED Attach bind4 with valid type ... OK Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r2 = 23569 1: (63) *(u32 *)(r1 +24) = r2 2: (b7) r2 = 16777343 3: (63) *(u32 *)(r1 +4) = r2 invalid bpf_context access off=4 size=4 [ 1518.404609] random: crng init done libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect4' libbpf: failed to load object './connect4_prog.o' ... REJECTED Load connect4 with valid type ... OK Attach connect4 with invalid type ... REJECTED Attach connect4 with valid type ... OK Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, ) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, ) from (127.0.0.4, 56068) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, ) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, ) from (127.0.0.4, 56447) Load bind6 with invalid type (can pollute stderr) ... REJECTED Load bind6 with valid type ... OK Attach bind6 with invalid type ... REJECTED Attach bind6 with valid type ... OK Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r6 = 0 1: (63) *(u32 *)(r1 +12) = r6 invalid bpf_context access off=12 size=4 libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect6' libbpf: failed to load object './connect6_prog.o' ... REJECTED Load connect6 with valid type ... OK Attach connect6 with invalid type ... REJECTED Attach connect6 with valid type ... OK Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, ) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, ) from (::6, 37458) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, ) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, ) from (::6, 39315) ### SUCCESS # egrep 'connect\(.*AF_INET' connect.trace | \ > egrep -vw 'htons\(1025\)' | fold -b -s -w 72 502 connect(7, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(8, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(9, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", _addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 502 connect(10, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", _addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 # fg tcpdump -pn -i lo -w connect.pcap 2> /dev/null # tcpdump -r connect.pcap -n tcp | cut -c 1-72 reading from file connect.pcap, link-type EN10MB (Ethernet) 17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.: Flags [S], seq 1333 17:57:40.383566 IP 127.0.0.1. > 127.0.0.4.56068: Flags [S.], seq 112 17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.: Flags [.], ack 1, w 17:57:40.384578 IP 127.0.0.1. > 127.0.0.4.56068: Flags [R.], seq 1, 17:57:40.403327 IP6 ::6.37458 > ::1.: Flags [S], seq 406513443, win 17:57:40.403357 IP6 ::1. > ::6.37458: Flags [S.], seq 2448389240, ac 17:57:40.403376 IP6 ::6.37458 > ::1.: Flags [.], ack 1, win 342, opt 17:57:40.404263 IP6 ::1. > ::6.37458: Flags [R.], seq 1, ack 1, win Signed-off-by: Andrey Ignatov Signed-off-by: Alexei Starovoitov --- tools/include/uapi/linux/bpf.h| 12 ++- tools/lib/bpf/libbpf.c| 2 + tools/testing/selftests/bpf/Makefile | 5 +- tools/testing/selftests/bpf/bpf_helpers.h | 2 + tools/testing/selftests/bpf/connect4_prog.c | 45 +++ tools/testing/selftests/bpf/connect6_prog.c | 61 +++ tools/testing/selftests/bpf/test_sock_addr.c | 104
Re: RFC on writel and writel_relaxed
On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt >wrote: > > > > The discussion at hand is about > > > > dma_buffer->foo = 1;/* WB */ > > writel(KICK, DMA_KICK_REGISTER);/* UC */ > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > even if that writel() might be of type WC, because that only delays > writes, it doesn't move them earlier. Ok so this is our answer ... ... snip ... (thanks for the background info !) > Oh, the above UC case is absoutely guaranteed. Good. Then > The only issue really is that 99.9% of all testing gets done on x86 > unless you look at specific SoC drivers. > > On ARM, for example, there is likely little reason to care about x86 > memory ordering, because there is almost zero driver overlap between > x86 and ARM. > > *Historically*, the reason for following the x86 IO ordering was > simply that a lot of architectures used the drivers that were > developed on x86. The alpha and powerpc workstations were *designed* > with the x86 IO bus (PCI, then PCIe) and to work with the devices that > came with it. > > ARM? PCIe is almost irrelevant. For ARM servers, if they ever take > off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 > test coverage" is simply not an issue. > > How much of an issue is it for Power? Maybe you decide it's not a big deal. > > Then all the above is almost irrelevant. So the overlap may not be that NIL in practice :-) But even then that doesn't matter as ARM has been happily implementing the same semantic you describe above for years, as do we powerpc. This is why, I want (with your agreement) to define clearly and once and for all, that the Linux semantics of writel are that it is ordered with previous writes to coherent memory (*) This is already what ARM and powerpc provide, from what you say, what x86 provides, I don't see any reason to keep that badly documented and have drivers randomly growing useless wmb()'s because they don't think it works on x86 without them ! Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock and the problem of writel_relaxed semantics but as separate issues :-) Also, can I assume the above ordering with writel() equally applies to readl() or not ? IE: dma_buf->foo = 1; readl(STUPID_DEVICE_DMA_KICK_ON_READ); Also works on x86 ? (It does on power, maybe not on ARM). Cheers, Ben. (*) From an Linux API perspective, all of this is only valid if the memory was allocated by dma_alloc_coherent(). Anything obtained by dma_map_something() might have been bounced bufferred or might require extra cache flushes on some architectures, and thus needs dma_sync_for_{cpu,device} calls. Cheers, Ben.